4.4 Many hosts rebooting under load

jmann

Member
May 5, 2017
11
0
21
42
Hello,

We are a recent proxmox adopter in an enterprise environment. We currently have rolled out proxmox to a small number of hosts, but hope to expand. Unfortunately a few issues have been plaguing us we would like to resolve.

Currently, we have 27 servers running proxmox 4.4, installed and configured via the online documentation, running approximately 120 VMs.

The servers are:
22x Dell R420, 64GB RAM
5x Supermicro 1028R, 128GB RAM

They are backed by a dedicated storage network, each having dual 10Gbp/s links, serving a ceph cluster of 46 SSDs. They are uplinked to our production network on a 3rd 10Gbp/s link.

The specific problem we are having is that the hypervisors seem to be spontaneously rebooting under load. We have ruled out:
1) Power. All machines have redundant and properly specced power.
2) Memory/CPU. The problem drifts from machine to machine and can be replicated by producing artificial load or moving a greedy VM to it.

Suspecting this might be watchdog related, we disabled the hardware watchdog on the Supermicro chassis, but the random reboots continued. We eventually narrowed it down to a particular VM that is very heavy on on CPU and RAM demands. If that VM is on the host, it eventually spontaneously reboots.

Following that lead, I experimented on a random hypervisor, and merely put a lot of I/O stress onto it's local disk with bonnie++. After just a few minutes, that machine became the 6th in the cluster to randomly restart. I can't find any data in the logs suggesting this was expected, no errors or warnings, nothing out of the ordinary. The machine is there one minute, and gone the next, still serving its VMs up until the point it mysteriously reboots.

Currently, I suspect proxmox itself is rebooting the machines, and am requesting some assistance in verifying that and seeing what I can do about it.
 
Hello,

We are a recent proxmox adopter in an enterprise environment. We currently have rolled out proxmox to a small number of hosts, but hope to expand. Unfortunately a few issues have been plaguing us we would like to resolve.

Currently, we have 27 servers running proxmox 4.4, installed and configured via the online documentation, running approximately 120 VMs.

The servers are:
22x Dell R420, 64GB RAM
5x Supermicro 1028R, 128GB RAM

They are backed by a dedicated storage network, each having dual 10Gbp/s links, serving a ceph cluster of 46 SSDs. They are uplinked to our production network on a 3rd 10Gbp/s link.

The specific problem we are having is that the hypervisors seem to be spontaneously rebooting under load. We have ruled out:
1) Power. All machines have redundant and properly specced power.
2) Memory/CPU. The problem drifts from machine to machine and can be replicated by producing artificial load or moving a greedy VM to it.

Suspecting this might be watchdog related, we disabled the hardware watchdog on the Supermicro chassis, but the random reboots continued. We eventually narrowed it down to a particular VM that is very heavy on on CPU and RAM demands. If that VM is on the host, it eventually spontaneously reboots.

Following that lead, I experimented on a random hypervisor, and merely put a lot of I/O stress onto it's local disk with bonnie++. After just a few minutes, that machine became the 6th in the cluster to randomly restart. I can't find any data in the logs suggesting this was expected, no errors or warnings, nothing out of the ordinary. The machine is there one minute, and gone the next, still serving its VMs up until the point it mysteriously reboots.

Currently, I suspect proxmox itself is rebooting the machines, and am requesting some assistance in verifying that and seeing what I can do about it.

please include the output of "pveversion -v", and enable the persistent journal ("mkdir /var/log/journal; systemctl restart systemd-journald"). when a node reboot, post the last few minutes of the journal from before the reboot ("journalctl -b-1"). please also include the VM and storage configuration.
 
I simulated load and caused another machine to reboot at approximately 11:53 our local time.

Here are the last few lines of output from the journal as it reboots (with -r flag so it is newest to oldest)

Code:
-- Logs begin at Thu 2017-03-09 12:02:12 PST, end at Tue 2017-05-09 12:20:08 PDT. --
May 09 11:42:21 AF002162 systemd-journal[30466]: Permanent journal is using 8.0M (max allowed 2.4G, trying to leave 3.6G free of 21.2G available →
May 09 11:42:21 AF002162 systemd-journal[30466]: Permanent journal is using 8.0M (max allowed 2.4G, trying to leave 3.6G free of 21.2G available →
May 09 11:42:21 AF002162 systemd-journal[970]: Journal stopped
May 09 11:42:21 AF002162 systemd[1]: Stopping Journal Service...
May 09 11:42:19 AF002162 sshd[30343]: pam_unix(sshd:session): session opened for user root by (uid=0)
May 09 11:42:19 AF002162 sshd[30343]: Accepted publickey for root from 10.2.6.119 port 51019 ssh2: RSA 
May 09 11:42:01 AF002162 CRON[30177]: (root) CMD (   cd / && nice -n -18 runparallel.sh /usr/lib/imvu-fact/worker.minutely)
May 09 11:42:01 AF002162 CRON[30176]: pam_unix(cron:session): session opened for user root by (uid=0)
May 09 11:41:27 AF002162 CRON[29700]: pam_unix(cron:session): session closed for user root

syslog:
Code:
May  9 11:52:40 AF002162 pvestatd[2040]: status update time (5.360 seconds)
May  9 11:52:51 AF002162 pvestatd[2040]: status update time (5.948 seconds)
May  9 11:56:36 AF002162 systemd-modules-load[942]: Module 'fuse' is builtin
May  9 11:56:36 AF002162 systemd-modules-load[942]: Inserted module 'ipmi_devintf'

kernel (i manually migrated VMs off first):
Code:
May  9 11:47:04 AF002162 kernel: [5266013.894104] vmbr0: port 7(tap1104i0) entered disabled state
May  9 11:47:05 AF002162 pve-ha-lrm[32096]: <root@pam> end task UPID:AF002162:00007D62:1F630353:59120E62:qmigrate:1104:root@pam: OK
May  9 11:56:36 AF002162 kernel: [    0.000000] Initializing cgroup subsys cpuset
May  9 11:56:36 AF002162 kernel: [    0.000000] Initializing cgroup subsys cpu
May  9 11:56:36 AF002162 kernel: [    0.000000] Initializing cgroup subsys cpuacct

Here is the output of pveversion -v:

Code:
root@AF002162:~# pveversion -v
proxmox-ve: 4.4-76 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-1 (running version: 4.4-1/eb2d6f1e)
pve-kernel-4.4.35-1-pve: 4.4.35-76
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-101
pve-firmware: 1.1-10
libpve-common-perl: 4.0-83
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-88
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.10-1~bpo80+1

Attached is a draw.io diagram of the network and storage toplogy, I cannot link it directly as the forum won't let me but please see attached image.

All virtual machines are utilizing RDB storage in the ceph cluster.

I am able to easily replicate the reboot by merely putting a lot of stress on a hypervisor, whether or not it is running any VMs, and whether or not it is also having an OSD on it.

As you can see, absolutely nothing of interest appears to be written by the logs. The machine is there one second and gone the next.
 

Attachments

  • our_storage_network.png
    our_storage_network.png
    37.1 KB · Views: 15
Last edited:
Upon upgrading one host, it is no longer rebooting under stress.

Do you have technical details for the nature of whatever the bug present in the original kernel is? I would like to understand why it failed in the first place and where the problem was. Thank you.
 
Upon upgrading one host, it is no longer rebooting under stress.

Do you have technical details for the nature of whatever the bug present in the original kernel is? I would like to understand why it failed in the first place and where the problem was. Thank you.

Ubuntu cherry-picked a bunch of OOM-handling commits from newer kernels as part of a bug fix. The commits turned out to be rather buggy on their own, and could trigger premature OOM-kills in certain memory situations. We subsequently reverted the commits in our pve-kernel packages (as did Ubuntu in theirs, after a few weeks of user complaints ;)).