[SOLVED] Unexpected Cluster Reboot

drjaymz@ · Jan 23, 2024

All, have a cluster of 3 dell servers running about a dozen VM's and containers rebooted itself and I cannot find a reason why. At the moment it looks like all 3 rebooted at precisely the same time.

Previously, I have had this happen (as expected) due to watchdog timeout should networking be lost to all 3 at the same time - but you see evidence in the log. I can't find anything at all.
The following syslog entries look normal, then there's a gap, then its just booting the kernel.

Node #1:

Code:

Jan 22 11:05:34 proxmoxy1 kernel: [3470160.691921] x86/split lock detection: #AC: CPU 2/KVM/372576 took a split_lock trap at address:
0x742caa8d
Jan 22 11:16:34 proxmoxy1 systemd-modules-load[4820]: Inserted module 'iscsi_tcp'

Node #2:

Code:

Jan 22 11:05:01 proxmoxy2 CRON[4155610]: (root) CMD (/mnt/pve/replica/scripts/svnCheckout.sh >/dev/null )
Jan 22 11:16:41 proxmoxy2 systemd-modules-load[4569]: Inserted module 'iscsi_tcp'

Node #3:

Code:

Jan 22 11:05:13 proxmoxy3 systemd[1]: session-611519.scope: Succeeded.
Jan 22 11:16:32 proxmoxy3 kernel: [    0.000000] Linux version 5.15.108-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110,

They all go into the same switch which reports idrac interfaces going down at the same time as main ports - which sounds like power-off.
They are all connected to the same UPS but no other hosts on that UPS were powered off and also since they are dual power, the other PSU is direct to AC mains. Again no power off indicated by anything else. UPS logs show no power off.

Idracs report similar:

		2024-01-22 11:13:08	NIC101	The Embedded NIC 1 Port 2 network link is started.
		2024-01-22 11:08:21	NIC100	The Embedded NIC 1 Port 2 network link is down.

Basically NIC down and then up but I don't know if that should show a power cycle, doesn't seem to.
One of the VM's logs to syslog every minute shows that the VM was actually running until 11:08:17. Which is consistent with the idrac entry.

The VM's themselves have nothing in the task history except the start at 11:18.
I checked: last -xF reboot shutdown | head, auth.log, journal etc and didn't find anything.

So looks like a power loss to all 3 at the same time followed by a restart about 5 minutes later.

Does anyone have any other ideas of what I might be able to check? I'm really looking for any clues at all. Nearly everything I have so far indicates power loss, except for everything else on the same circuit that didn't see a power loss.

drjaymz@ · Jan 24, 2024

Visited the site. There had been a delivery and boxes were placed in such a way as to snag the UPS outlet cable. So this was a power off. The servers have redundant power and someone had put them all on the same UPS circuit, so now I split them and locked the server room door.

Proxmox wasn't to blame, did fine and it was physical interference that was the culprit.

Search

Search

[SOLVED] Unexpected Cluster Reboot

drjaymz@

Member

drjaymz@

Member