Random Reboots During Proxmox Backups

redmop · Mar 7, 2016

It crashes and reboots when backing up a specific VM (102) and almost always on Saturday or Sunday, though it has happened outside of those days. During the Proxmox backup window, there is almost nothing else going on.

This is an HP Proliant ML350p running the latest firmwares as of 2015-09-21. HP tells me there is no newer firmware than that. They don't see anything in the ahs logs. They have replaced the motherboard and SSDs and the problem remains.

After HP replaced the motherboard, I ran the insight diagnostics for 4 straight days, without error.

This problem has existed for about a year now. I usually just disable the Proxmox backup and use other things (zfs, tar, backuppc). I had the same problem on 3.4.

I just enabled kexec-tools and crash dump so I'll get more info on that after it crashes.

ZFS Root from the installer. ZFS Swap is disabled. Swap is on RAID 1 on SSDs.

The server crashed 2016-03-06 4:04a. Here is the pastebin of the syslog: http://pastebin.com/JeMauzhc

What else do I need?

root@cb-prox1:~# pveversion -v
proxmox-ve: 4.1-37 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-13 (running version: 4.1-13/cfb599fb)
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.2.8-1-pve: 4.2.8-37
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-32
qemu-server: 4.0-55
pve-firmware: 1.1-7
libpve-common-perl: 4.0-48
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-40
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-5
pve-container: 1.0-44
pve-firewall: 2.0-17
pve-ha-manager: 1.0-21
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

redmop · Mar 7, 2016

Could the watchdog be causing this? How do I check that?

fireon · Mar 7, 2016

To you have HPtool installed (hphealth...) ? Did you have an Real SATA/SAScontroller in AHCImode with your ZFSfilesystem? No HWraidcontroller, no fakeraidcontroller. Real SATA or real SAS. What generation is you server? G8, G9?

You can disable Watchdog in HPbios. And you disable watchdogservice in hptools too.

redmop · Mar 7, 2016

Model HP H220 Host Bus Adapter
Firmware Version 15.10.09.00
Controller Type SAS/SATA

No raid is enabled in the controller. It is AHCI, no caching, direct access.

root@cb-prox1:~# apt-cache policy hp-health
hp-health:
Installed: 10.0.0.1.3-4.
Candidate: 10.0.0.1.3-4.
Version table:
*** 10.0.0.1.3-4. 0
500 http://downloads.linux.hpe.com/SDR/repo/mcp/ jessie/current/non-free amd64 Packages
100 /var/lib/dpkg/status

What could be triggering the watchdog?

It is a G8
Product Name ProLiant ML350p Gen8

There is HP Smart Array P420i Controller Installed, but nothing at all is on it.

Proxmox backup is set to start at 3:00a. the syslog showed errors before that.

Mar 6 01:46:49 cb-prox1 corosync[3369]: [MAIN ] Corosync main process was not scheduled for 1722.5845 ms (threshold is 800.0000 ms). Consider token timeout increase.
Mar 6 01:46:49 cb-prox1 corosync[3369]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Mar 6 01:46:49 cb-prox1 corosync[3369]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.

The crash happened during the Proxmox backup. The errors in the syslog began during the BackupPC backup.

2016-03-06 01:05:01 full backup started for directory / (baseline backup #131)
2016-03-06 02:32:03 full backup 132 complete, 1900280 files, 147069160492 bytes, 0 xferErrs (0 bad files, 0 bad shares, 0 other)

However, the crashes began before BackupPC was even installed. It persisted from Proxmox V 3.1, 3.2, 3.3, and 3.4.

spirit · Mar 8, 2016

redmop said:
Mar 6 01:46:49 cb-prox1 corosync[3369]: [MAIN ] Corosync main process was not scheduled for 1722.5845 ms (threshold is 800.0000 ms). Consider token timeout increase.
Mar 6 01:46:49 cb-prox1 corosync[3369]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Mar 6 01:46:49 cb-prox1 corosync[3369]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.

network overloaded maybe ?

redmop · Mar 8, 2016

It shouldn't be. Dual gigabit on LACP. Linespeed on Prox1 never went over 81 megs per second on a single link, and that was for about 3 min spread out through the backup. Prox2 (much weaker hardware, also LACP) peaked at 77, and it's whole backup took 3 min.

redmop · Mar 8, 2016

Bump

HP pretty much gave me the "it's not hardware, we won't fix it"

windinternet · Mar 9, 2016

I wonder, are these virtual machines on ZVOLs? They are the same kind of virtual device on ZFS that has shown the tendency to reboot the machine under low memory conditions in Proxmox 4 very easily and under Proxmox 3 a little bit harder when used for SWAP. I find there is a little bit too much correlation between these unscheduled reboots (usually under high I/O) and ZFS installed machines.

It may help to constrain the ARC cache to reasonable limits or raise vm.vfs_cache_pressure to something like 10000. You could also try to disable caching in ARC for the ZVOL altogether during backups. As I understand it, the ARC limits in zfs on linux are not rockhard. If data is not written to disk fast enough the limits will be exceeded and this may lead to unexpected low memory conditions which may crash a Linux kernel accessing a ZVOL, which after all is an emulation of a block device possibly needing lots of buffers at the moment of access.

redmop · Mar 21, 2016

They are VMs on ZVOLs. The machine isn't getting under low memory. I have 64 GB ram, ARC is limited to 32, and my VMs use 13 GB. I have a little over 16GB free ram right now.

Also, I have proxmox backups disabled, and it just crashed it while idle.

I got a lot of this:
Mar 20 04:58:47 cb-prox1 corosync[3473]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.

and this shows up sometimes
Mar 20 04:58:47 cb-prox1 corosync[3473]: [MAIN ] Corosync main process was not scheduled for 2942.3713 ms (threshold is 800.0000 ms). Consider token timeout increase.
Mar 20 04:58:47 cb-prox1 corosync[3473]: [TOTEM ] A new membership (192.168.251.9:3696) was formed. Members joined: 2 left: 2
Mar 20 04:58:47 cb-prox1 corosync[3473]: [TOTEM ] Failed to receive the leave message. failed: 2

redmop · Apr 1, 2016

Bump Anything?

chrone · Jul 18, 2016

Did you solve the issue?

I often found Proxmox VE 3.x/4.1/4.2 Cluster (3-4 nodes) reboots itself whether using ext4 or ZFS and I have no idea what is the root cause.

How can we increase the corosync timed out?

redmop · Jul 18, 2016

No. I am just living with it. Since I am not doing proxmox (vzdump) backups from the host, it seems to run about 3 months between reboots instead of about 3 days. Still not acceptable. I'm looking for a solution. As I run zfs send/receive every 15 min to both onsite and offsite hosts, it is not a huge problem, but it is still a problem.

I have a HP ML10 that doesn't have this problem, even though Proxmox is configured the same way.

chrone · Jul 19, 2016

I think I found some possible root causes of the random reboot in Proxmox VE 4.2 HA Cluster. It was caused by out of sync date and time, and perhaps network glitches.

The default Debian NTP pool drifted off more than 5 minutes across 3 nodes and make Proxmox Cluster File System pmxcfs failed to sync and write, then Proxmox nodes rebooted themselves. I don't know what Proxmox use, either the NTP service or systemd timesyncd services. Either way, I set one node as master for NTP which sync to Debian NTP pool, and the rest of the nodes sync to this master NTP node.

I also found one of the node complaints to increase corosync token timeout. So I went ahead editing /etc/pve/corosync.conf as follows:

config_version: this value need to be increased incrementally on each changes made
token: 10000
token_retransmits_before_loss_const: 10
consensus: 12000

Quoted from OpenStack HA Guide:

The token value specifies the time, in milliseconds, during which the Corosync token is expected to be transmitted around the ring. When this timeout expires, the token is declared lost, and after token_retransmits_before_loss_const losttokens, the non-responding processor (cluster node) is declared dead. In other words, token ×token_retransmits_before_loss_const is the maximum time a node is allowed to not respond to cluster messages before being considered dead. The default for token is 1000 milliseconds (1 second), with 4 allowed retransmits. These defaults are intended to minimize failover times, but can cause frequent “false alarms” and unintended failovers in case of short network interruptions. The values used here are safer, albeit with slightly extended failover times.

Hope this will stabilize the random reboot Proxmox Cluster nodes. :/

PS:
I always find Proxmox pve-manager fails to start VM at boot due to no quorum. Perhaps this was caused by the LACP bonding and VLAN I used which add delay to network connectivy and pve-manager starts too early. I had to add crontab as follows:

@reboot sleep 300; systemctl restart pve-manager.service &>/dev/null

References:
http://manpages.ubuntu.com/manpages/xenial/man5/corosync.conf.5.html
http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html

Search

Search

Random Reboots During Proxmox Backups

redmop

Well-Known Member

redmop

Well-Known Member

fireon

Distinguished Member

redmop

Well-Known Member

spirit

Distinguished Member

redmop

Well-Known Member

redmop

Well-Known Member

windinternet

Member

redmop

Well-Known Member

redmop

Well-Known Member

chrone

Renowned Member

redmop

Well-Known Member

chrone

Renowned Member

We value your privacy