Intermittent Reboots of nodes in cluster

inbfit

Member
May 23, 2020
9
0
6
34
Hi All,

I currently have 10 nodes in a cluster with proxmox currently they are randomly restarting and not quite sure why. Power is plentiful as other servers and switches in the rack have not shut down.

What kind of information should I give to be more infomational about this?
 
Hi,

What kind of information should I give to be more infomational about this?

Checkout the syslog (or journalctl if the systemd-journal is persistent) at the time around a reboot, if it was software triggered you may see something logged there.
 
I see around at the time it rebooted today was about 9:00 but i dont particularily see something that is off?
 
I see around at the time it rebooted today was about 9:00 but i dont particularily see something that is off?

Do the logs are suddenly cut, as in the node was hard reset (power loss, CPU watchdog, crash, …) or is a clean shutdown initiated?
 
Here's what we get from the syslog.

May 23 08:44:00 systemd[1]: Starting Proxmox VE replication runner...
May 23 08:44:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:45:00 systemd[1]: Starting Proxmox VE replication runner...
May 23 08:45:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'iscsi_tcp'
May 23 08:51:46 kernel: [ 0.000000] Linux version 4.15.18-27-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-55 (Tue, 17 Mar 2020 15:32:02 +0100) ()
May 23 08:51:46 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-27-pve root=/dev/mapper/pve-root ro quiet
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'ib_iser'
May 23 08:51:46 systemd[1]: Starting Flush Journal to Persistent Storage...
May 23 08:51:46 systemd[1]: Started Create Static Device Nodes in /dev.
May 23 08:51:46 kernel: [ 0.000000] KERNEL supported cpus:
May 23 08:51:46 kernel: [ 0.000000] Intel GenuineIntel
May 23 08:51:46 kernel: [ 0.000000] AMD AuthenticAMD
May 23 08:51:46 kernel: [ 0.000000] Centaur CentaurHauls
May 23 08:51:46 systemd[1]: Starting udev Kernel Device Manager...
May 23 08:51:46 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 23 08:51:46 systemd[1]: Started udev Coldplug all Devices.
 
May 23 08:45:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'iscsi_tcp'
May 23 08:51:46 kernel: [ 0.000000] Linux version 4.15.18-27-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-55 (Tue, 17 Mar 2020 15:32:02 +0100) ()

OK, so no normal shutdown - it was reset in a hard way... Do you use the HA manager?
 
So the HA managers watchdog could be the one resetting the node - not for sure but would be plausible. Why it does that would be then the real question.

Any log messages of that one previously to the reset?
What did another, working node, see at and before the time this node reset?
 
i do see this

May 23 08:51:46 systemd[1]: Started Proxmox VE watchdog multiplexer.
May 23 08:51:46 watchdog-mux[2874]: Watchdog driver 'Software Watchdog', version 0
May 23 08:51:46 kernel: [ 0.139287] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
May 23 08:51:49 corosync[3869]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 23 08:51:49 corosync[3869]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 23 08:51:49 corosync[3869]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
May 23 08:51:49 corosync[3869]: [SERV ] Service engine loaded: corosync watchdog service [7]

on the one thaat just reset
 
i do see this

This is expected and normal...

In the log you should see something like:

status change active => lost_agent_lock
client watchdog expired - disable watchdog updates

A few seconds to minutes before the reset happened, else it may not have been the HA watchdog..
 
Sorry yes corosync has its own network i just had to poke around quite a bit to see but yes!
 
corosync has its own network
Is it using it's own switch? I.e., not just VLAN separated but physically separated?

I would monitor that network, mainly for latency spikes/packet bursts. Also monitor the logs, check for frequent "retransmit" messages.

Also enable persistent journaling, it may hold slightly more info before a reset as we can tell it to flush logs to disk when we do a watchdog reset, so if it'd be that then it normally really shows up there. To enable persistent journaling simply do:
mkdir /var/log/journal/ && systemctl restart systemd-journald.service

Then you can check out the logs with journalctl - it has some nice switches for limiting time range or services to show.

Bash:
journalctl -b   # current boot
journalctl -f   # follow along
 # ha/cluster related stuff from the last 3 days:
journalctl -u corosync -u pve-cluster -u pve-ha-lrm -u pve-ha-crm -u watchdog-mux.service --since -3day
 
Yes they are on 2 separate physical 40g switches on two different networks

Would "retransmit" appear in the journalctl logs? i do see one of the servers that has persistant journaling on it what about the word retrans?
like in corosync[3556]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1?

Thank you for the bash commands this most definitely helps in searching
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!