PVE nodes spontaneously reboot during backup

budy

Active Member
Jan 31, 2020
210
14
38
57
Hi,

I have two separate PVE clusters: one hosts my Ceph storage, while the other hosts only the guests. The PVE nodes do have 2 1GbE and 2 10GbE interfaces, where the 10GbE ones are configured as a LACP bond. I had all the communication run over different VLANs on that bonds and this led to some performance/stability issues with corosync when the backup was taking up too much of the resources of those bonds. The result was that some nodes experienced fencing while the backup was running.
However, what's interesting is, that these fences always occurred at around the same time - which surprised me since I'd imagined, that these fences should occur at different times, in regards of how saturated the LAPC link was, but … it's almost about 02:09 in the morning.

I then introduced a 2nd ring to the corosync config to provide some redundancy. The 2nd ring runs over a 1 GbE active/backup bond, since the switches in the blade chassis don't support VPC to our Cisco Nexus switches, but anyway. Now the fencing stopped, but nonetheless when the last backup ran, two of my PVE nodes rebooted again around 02:10 in the morning, however without being actively fenced by corosync. There is simply nothing in the debug or syslog.

Any Idea of how this can be diagnosed? Since these PVE nodes are only Ceph clients (RDB), ceph itself seems out of the picture, but if neither Ceph nor corosync causes the reboots, what does?

Thanks,
budy
 
Thanks for the offer. I have just had another instance of that and this time, I also got a fence notification. I probably did not take into account, that the guest, that runs our internal mail relay could have also been subject to be fenced together with the host it has been running on… duh…

So, this time I got a fence note for one pve and taking a look at the syslog and the messages log of that one revealed some issues with a Ceph node from the storage cluster, which I was rather surprised to see. Looks like the rados clients also get informed pretty well, about that is going on with the ceph nodes. However, this is what I found in the messages and syslog:

Code:
Jul 14 01:43:32 proteus kernel: [257627.941803] libceph: osd14 (1)10.11.15.69:6803 socket closed (con state OPEN)
Jul 14 01:43:33 proteus kernel: [257628.127390] libceph: osd14 (1)10.11.15.69:6803 socket closed (con state CONNECTING)
Jul 14 01:43:33 proteus kernel: [257628.733854] libceph: osd14 (1)10.11.15.69:6803 socket closed (con state CONNECTING)
Jul 14 01:43:33 proteus kernel: [257628.824683] libceph: osd14 down
Jul 14 01:43:51 proteus kernel: [257646.434917] libceph: osd14 up
Jul 14 02:35:38 proteus kernel: [260753.606481] libceph: osd17 (1)10.11.15.69:6820 socket closed (con state OPEN)
Jul 14 02:35:39 proteus kernel: [260754.454034] libceph: osd17 down
Jul 14 02:35:59 proteus kernel: [260774.792993] libceph: osd17 up
Jul 14 02:35:59 proteus kernel: [260775.010850] libceph: osd16 (1)10.11.15.67:6815 socket closed (con state OPEN)
Jul 14 02:38:02 proteus kernel: [260897.858356] libceph: osd4 (1)10.11.15.67:6805 socket closed (con state OPEN)
Jul 14 02:39:34 proteus kernel: [260989.830535] libceph: osd3 (1)10.11.15.66:6803 socket closed (con state OPEN)
Jul 14 03:04:42 proteus kernel: [262497.716835] libceph: osd17 (1)10.11.15.69:6820 socket closed (con state OPEN)
Jul 14 03:04:42 proteus kernel: [262497.845329] libceph: osd17 (1)10.11.15.69:6820 socket closed (con state CONNECTING)
Jul 14 03:04:42 proteus kernel: [262498.338340] libceph: osd17 down
Jul 14 03:05:05 proteus kernel: [262520.983516] libceph: osd17 up
Jul 14 03:42:16 proteus kernel: [    0.000000] Linux version 5.4.44-2-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) ()
Jul 14 03:42:16 proteus kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.44-2-pve root=/dev/mapper/pve-root ro quiet
Jul 14 03:42:16 proteus kernel: [    0.000000] KERNEL supported cpus:
Jul 14 03:42:16 proteus kernel: [    0.000000]   Intel GenuineIntel
Jul 14 03:42:16 proteus kernel: [    0.000000]   AMD AuthenticAMD
Jul 14 03:42:16 proteus kernel: [    0.000000]   Hygon HygonGenuine
Jul 14 03:42:16 proteus kernel: [    0.000000]   Centaur CentaurHauls
Jul 14 03:42:16 proteus kernel: [    0.000000]   zhaoxin   Shanghai

And the syslog…
Code:
Jul 14 03:38:05 proteus systemd[1]: check_mk@1246-10.11.14.63:6556-10.11.24.41:47023.service: Succeeded.
Jul 14 03:38:05 proteus systemd[1]: Started Check_MK (10.11.24.41:47023).
Jul 14 03:39:00 proteus systemd[1]: Starting Proxmox VE replication runner...
Jul 14 03:39:13 proteus pve-firewall[2117]: firewall update time (7.281 seconds)
Jul 14 03:39:13 proteus systemd[1]: pvesr.service: Succeeded.
Jul 14 03:39:13 proteus systemd[1]: Started Proxmox VE replication runner.
Jul 14 03:39:14 proteus pvestatd[2129]: status update time (17.512 seconds)
Jul 14 03:39:41 proteus pve-firewall[2117]: firewall update time (15.115 seconds)
Jul 14 03:40:00 proteus systemd[1]: Starting Proxmox VE replication runner...
Jul 14 03:40:02 proteus pvesr[32446]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 14 03:40:03 proteus pvesr[32446]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 14 03:40:04 proteus pvesr[32446]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 14 03:40:05 proteus pvesr[32446]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 14 03:40:06 proteus systemd[1]: pvesr.service: Succeeded.
Jul 14 03:40:06 proteus systemd[1]: Started Proxmox VE replication runner.
Jul 14 03:42:16 proteus lvm[758]:   1 logical volume(s) in volume group "pve" monitored
Jul 14 03:42:16 proteus systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 14 03:42:16 proteus kernel: [    0.000000] Linux version 5.4.44-2-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.44-2 (Wed, 01 Jul 2020 16:37:57 +0200) ()
Jul 14 03:42:16 proteus kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.44-2-pve root=/dev/mapper/pve-root ro quiet
Jul 14 03:42:16 proteus kernel: [    0.000000] KERNEL supported cpus:

I then checked the Ceph node where these OSDs are located and noticed some OOM-Killer messages, which always seem to come up, when the backups are being made. I knew, that the amount of RAM was rather tight on that one, especially when taking into account, that there are 6 SSDs, which host an OSD each. I will soon upgrade the RAM on this node to 64 or probably 96 GB and then we'll see, if these OOMs go away.
 
Reduced the bluestore_cache_size_ssd from 3GB to 2GB for the Ceph node with the tight memory situation and restarted all OSDs. This time, there where no OOM killers, while the backup ran. Interestingly, the amount of memory used on the system has slowly crept up to the same amount as it had before, but… without the OOM killers. We'll see, if this holds through until the next backup run.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!