Intermittent Reboots of nodes in cluster

inbfit · May 23, 2020

Hi All,

I currently have 10 nodes in a cluster with proxmox currently they are randomly restarting and not quite sure why. Power is plentiful as other servers and switches in the rack have not shut down.

What kind of information should I give to be more infomational about this?

t.lamprecht · May 23, 2020

Hi,

inbfit said:
What kind of information should I give to be more infomational about this?

Checkout the syslog (or journalctl if the systemd-journal is persistent) at the time around a reboot, if it was software triggered you may see something logged there.

inbfit · May 23, 2020

I see around at the time it rebooted today was about 9:00 but i dont particularily see something that is off?

t.lamprecht · May 23, 2020

inbfit said:
I see around at the time it rebooted today was about 9:00 but i dont particularily see something that is off?

Do the logs are suddenly cut, as in the node was hard reset (power loss, CPU watchdog, crash, …) or is a clean shutdown initiated?

inbfit · May 23, 2020

Here's what we get from the syslog.

May 23 08:44:00 systemd[1]: Starting Proxmox VE replication runner...
May 23 08:44:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:45:00 systemd[1]: Starting Proxmox VE replication runner...
May 23 08:45:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'iscsi_tcp'
May 23 08:51:46 kernel: [ 0.000000] Linux version 4.15.18-27-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-55 (Tue, 17 Mar 2020 15:32:02 +0100) ()
May 23 08:51:46 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-27-pve root=/dev/mapper/pve-root ro quiet
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'ib_iser'
May 23 08:51:46 systemd[1]: Starting Flush Journal to Persistent Storage...
May 23 08:51:46 systemd[1]: Started Create Static Device Nodes in /dev.
May 23 08:51:46 kernel: [ 0.000000] KERNEL supported cpus:
May 23 08:51:46 kernel: [ 0.000000] Intel GenuineIntel
May 23 08:51:46 kernel: [ 0.000000] AMD AuthenticAMD
May 23 08:51:46 kernel: [ 0.000000] Centaur CentaurHauls
May 23 08:51:46 systemd[1]: Starting udev Kernel Device Manager...
May 23 08:51:46 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 23 08:51:46 systemd[1]: Started udev Coldplug all Devices.

t.lamprecht · May 23, 2020

inbfit said:
May 23 08:45:00 systemd[1]: Started Proxmox VE replication runner.
May 23 08:51:46 systemd-modules-load[1404]: Inserted module 'iscsi_tcp'
May 23 08:51:46 kernel: [ 0.000000] Linux version 4.15.18-27-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-55 (Tue, 17 Mar 2020 15:32:02 +0100) ()

OK, so no normal shutdown - it was reset in a hard way... Do you use the HA manager?

inbfit · May 23, 2020

Yes we use the ha manager

t.lamprecht · May 23, 2020

So the HA managers watchdog could be the one resetting the node - not for sure but would be plausible. Why it does that would be then the real question.

Any log messages of that one previously to the reset?
What did another, working node, see at and before the time this node reset?

inbfit · May 23, 2020

i do see this

May 23 08:51:46 systemd[1]: Started Proxmox VE watchdog multiplexer.
May 23 08:51:46 watchdog-mux[2874]: Watchdog driver 'Software Watchdog', version 0
May 23 08:51:46 kernel: [ 0.139287] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
May 23 08:51:49 corosync[3869]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 23 08:51:49 corosync[3869]: info [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 23 08:51:49 corosync[3869]: notice [SERV ] Service engine loaded: corosync watchdog service [7]
May 23 08:51:49 corosync[3869]: [SERV ] Service engine loaded: corosync watchdog service [7]

on the one thaat just reset

t.lamprecht · May 23, 2020

inbfit said:
i do see this

This is expected and normal...

In the log you should see something like:

status change active => lost_agent_lock
client watchdog expired - disable watchdog updates

A few seconds to minutes before the reset happened, else it may not have been the HA watchdog..

inbfit · May 24, 2020

Hmm I dont think I see anything along the lines of that

inbfit · May 24, 2020

On another node that was working at the time there was a bunch of corosync voting

t.lamprecht · May 24, 2020

inbfit said:
On another node that was working at the time there was a bunch of corosync voting

How's your network setup/topology? Has corosync its own network, or at least one without IO traffic on it?

inbfit · May 24, 2020

Sorry yes corosync has its own network i just had to poke around quite a bit to see but yes!

t.lamprecht · May 25, 2020

inbfit said:
corosync has its own network

Is it using it's own switch? I.e., not just VLAN separated but physically separated?

I would monitor that network, mainly for latency spikes/packet bursts. Also monitor the logs, check for frequent "retransmit" messages.

Also enable persistent journaling, it may hold slightly more info before a reset as we can tell it to flush logs to disk when we do a watchdog reset, so if it'd be that then it normally really shows up there. To enable persistent journaling simply do:
mkdir /var/log/journal/ && systemctl restart systemd-journald.service

Then you can check out the logs with journalctl - it has some nice switches for limiting time range or services to show.

Bash:

journalctl -b   # current boot
journalctl -f   # follow along
 # ha/cluster related stuff from the last 3 days:
journalctl -u corosync -u pve-cluster -u pve-ha-lrm -u pve-ha-crm -u watchdog-mux.service --since -3day

inbfit · May 26, 2020

Yes they are on 2 separate physical 40g switches on two different networks

Would "retransmit" appear in the journalctl logs? i do see one of the servers that has persistant journaling on it what about the word retrans?
like in corosync[3556]: [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1?

Thank you for the bash commands this most definitely helps in searching

Search

Search

Intermittent Reboots of nodes in cluster

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member

t.lamprecht

Proxmox Staff Member

inbfit

Member