[SOLVED] Unexpected Cluster Reboot

drjaymz@

Member
Jan 19, 2022
124
5
23
102
All, have a cluster of 3 dell servers running about a dozen VM's and containers rebooted itself and I cannot find a reason why. At the moment it looks like all 3 rebooted at precisely the same time.

Previously, I have had this happen (as expected) due to watchdog timeout should networking be lost to all 3 at the same time - but you see evidence in the log. I can't find anything at all.
The following syslog entries look normal, then there's a gap, then its just booting the kernel.

Node #1:
Code:
Jan 22 11:05:34 proxmoxy1 kernel: [3470160.691921] x86/split lock detection: #AC: CPU 2/KVM/372576 took a split_lock trap at address:
0x742caa8d
Jan 22 11:16:34 proxmoxy1 systemd-modules-load[4820]: Inserted module 'iscsi_tcp'

Node #2:
Code:
Jan 22 11:05:01 proxmoxy2 CRON[4155610]: (root) CMD (/mnt/pve/replica/scripts/svnCheckout.sh >/dev/null )
Jan 22 11:16:41 proxmoxy2 systemd-modules-load[4569]: Inserted module 'iscsi_tcp'

Node #3:
Code:
Jan 22 11:05:13 proxmoxy3 systemd[1]: session-611519.scope: Succeeded.
Jan 22 11:16:32 proxmoxy3 kernel: [    0.000000] Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110,

They all go into the same switch which reports idrac interfaces going down at the same time as main ports - which sounds like power-off.
They are all connected to the same UPS but no other hosts on that UPS were powered off and also since they are dual power, the other PSU is direct to AC mains. Again no power off indicated by anything else. UPS logs show no power off.

Idracs report similar:
2024-01-22 11:13:08NIC101The Embedded NIC 1 Port 2 network link is started.
2024-01-22 11:08:21NIC100The Embedded NIC 1 Port 2 network link is down.

Basically NIC down and then up but I don't know if that should show a power cycle, doesn't seem to.
One of the VM's logs to syslog every minute shows that the VM was actually running until 11:08:17. Which is consistent with the idrac entry.

The VM's themselves have nothing in the task history except the start at 11:18.
I checked: last -xF reboot shutdown | head, auth.log, journal etc and didn't find anything.

So looks like a power loss to all 3 at the same time followed by a restart about 5 minutes later.

Does anyone have any other ideas of what I might be able to check? I'm really looking for any clues at all. Nearly everything I have so far indicates power loss, except for everything else on the same circuit that didn't see a power loss.
 
Last edited:
Visited the site. There had been a delivery and boxes were placed in such a way as to snag the UPS outlet cable. So this was a power off. The servers have redundant power and someone had put them all on the same UPS circuit, so now I split them and locked the server room door.

Proxmox wasn't to blame, did fine and it was physical interference that was the culprit.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!