Cluster node member - sudden reboot

AlexLup

Well-Known Member
Mar 19, 2018
218
14
58
42
Hi,
I have a node that all of the sudden begun acting up, the memory pressure is high on all 3 nodes (90%+) since newest luminous but I have 2 other identical nodes that dont have this issue and everything is working perfectly which I attribute to ceph reserving memory for future use(?).


So I am pasting the excerpt from syslog, since I dont know if you bugzilla allows the full log to be uploaded (100mb).

pve-manager/5.3-6/37b3c8df (running kernel: 4.15.18-9-pve)

Syslog:
Code:
root@pve23:~# cat /var/log/syslog.1 | grep "Jan  7 04:"
Jan  7 04:00:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:00:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:01:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:01:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:02:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:02:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
....

Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbad7 4dbad8 4dbad9 4dbada 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:03:05 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbad7 4dbad8 4dbad9 4dbada 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbaf5 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7 4dbae8 4dbae9
Jan  7 04:03:05 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbaf5 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7 4dbae8 4dbae9
Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbad8 4dbad9 4dbada 4dbaf3 4dbaf4 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6
Jan  7 04:03:06 pve23 corosync[1772]: error   [TOTEM ] FAILED TO RECEIVE
Jan  7 04:03:06 pve23 corosync[1772]:  [TOTEM ] FAILED TO RECEIVE
Jan  7 04:03:08 pve23 corosync[1772]: notice  [TOTEM ] A new membership (192.168.1.23:1112) was formed. Members left: 3 4
Jan  7 04:03:08 pve23 corosync[1772]: notice  [TOTEM ] Failed to receive the leave message. failed: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [TOTEM ] A new membership (192.168.1.23:1112) was formed. Members left: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [TOTEM ] Failed to receive the leave message. failed: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [CPG   ] downlist left_list: 2 received
Jan  7 04:03:08 pve23 corosync[1772]: warning [CPG   ] downlist left_list: 2 received
Jan  7 04:03:08 pve23 corosync[1772]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  7 04:03:08 pve23 corosync[1772]: notice  [QUORUM] Members[1]: 2
Jan  7 04:03:08 pve23 corosync[1772]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  7 04:03:08 pve23 corosync[1772]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  7 04:03:08 pve23 corosync[1772]:  [QUORUM] Members[1]: 2
Jan  7 04:03:08 pve23 corosync[1772]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [status] notice: node lost quorum
Jan  7 04:03:08 pve23 pmxcfs[1720]: [status] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] crit: received write while not quorate - trigger resync
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] crit: leaving CPG group
Jan  7 04:03:08 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:03:08 pve23 pve-ha-lrm[4378]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve23/lrm_status.tmp.4378' - Permission denied
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: start cluster connection
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: all data is up to date
Jan  7 04:03:13 pve23 pve-ha-crm[2423]: status change slave => wait_for_quorum
Jan  7 04:03:18 pve23 pve-ha-lrm[4378]: lost lock 'ha_agent_pve23_lock - cfs lock update failed - Permission denied
Jan  7 04:03:23 pve23 pve-ha-lrm[4378]: status change active => lost_agent_lock

<<<<< REBOOT ?

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'iscsi_tcp'
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'ib_iser'
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'vhost_net'
Jan  7 04:05:18 pve23 keyboard-setup.sh[384]: cannot open file /tmp/tmpkbd.AuAvPt
Jan  7 04:05:18 pve23 systemd[1]: Starting Flush Journal to Persistent Storage...
Jan  7 04:05:18 pve23 systemd[1]: Started udev Coldplug all Devices.
Jan  7 04:05:18 pve23 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jan  7 04:05:18 pve23 systemd[1]: Started Flush Journal to Persistent Storage.
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'zfs'
Jan  7 04:05:18 pve23 systemd[1]: Started Load Kernel Modules.
Jan  7 04:05:18 pve23 systemd[1]: Mounting Configuration File System...
Jan  7 04:05:18 pve23 systemd[1]: Mounting FUSE Control File System...
Jan  7 04:05:18 pve23 systemd[1]: Starting Apply Kernel Variables...
Jan  7 04:05:18 pve23 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Jan  7 04:05:18 pve23 kernel: [    0.000000] Linux version 4.15.18-9-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100) ()
Jan  7 04:05:18 pve23 systemd[1]: apt-daily-upgrade.timer: Adding 21min 59.729743s random time.
Jan  7 04:05:18 pve23 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-9-pve root=/dev/mapper/pve-root ro quiet
Jan  7 04:05:18 pve23 kernel: [    0.000000] KERNEL supported cpus:
Jan  7 04:05:18 pve23 kernel: [    0.000000]   Intel GenuineIntel
Jan  7 04:05:18 pve23 kernel: [    0.000000]   AMD AuthenticAMD
Jan  7 04:05:18 pve23 systemd[1]: Started Daily apt upgrade and clean activities.
Jan  7 04:05:18 pve23 kernel: [    0.000000]   Centaur CentaurHauls
Jan  7 04:05:18 pve23 kernel: [    0.000000] x86/fpu: x87 FPU will use FXSAVE
 
Last edited:
Hi,

your node was fenced because you lost quorum and it looks like your network is the problem.

Retransmit List: 4dbad8 4dbad9 4dbada 4dbaf3 4dbaf4 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6
 
That is strange, as the traffic to and from that node has worked according to the RRD graphs. And the other nodes on the same switch worked just fine. So how do I go from here ? Should I replace my NIC ?
 
Do you have all on the same nic?
If so then this is normal.
It is vital for HA to have a separate corosync network and this network should also be redundant.
 
Do you have all on the same nic?
If so then this is normal.
It is vital for HA to have a separate corosync network and this network should also be redundant.
I have a split the cluster and public networking (1x 10gbit NIC and 1 gbit NIC)

I dont necessarily see a forced reboot as something normal tho ?
 
I have a split the cluster and public networking (1x 10gbit NIC and 1 gbit NIC)
You have to explain a little more on what network is which service.

I dont necessarily see a forced reboot as something normal tho ?
If you have HA active and operate under the needed Hardware specification it is normal.
Or better it is expected with this setup.
 
Hi,
The corosync network is on the 10gb NIC with 172.16 as IP. The public network is the 1gb one on 192.168 net.

I had no idea that it was both normal and expected for a node to magically reboot itself? Because the way I see it, the network worked just fine, it was just corosync that messed up, hence no reboot was needed IMHO.
 
I had no idea that it was both normal and expected for a node to magically reboot itself?
As far I can see it from the logs your node was fenced. I guess you have activated HA?
If you have HA active or it was former activated it is essential for it that the cluster network is working[1].
At the Wikipedia side[1], you see under principle point 3 "Detection of failures as they occur" a cluster communication issue is a failure and must be handled.

Because the way I see it, the network worked just fine, it was just corosync that messed up, hence no reboot was needed IMHO.
Your logs show packets retransmit. Retransmits never happens on a working Network.
No working also means the latency gets to hight what happens if you transmit a huge amount of storage date.

see
1.) https://pve.proxmox.com/wiki/High_Availability
2.) https://en.wikipedia.org/wiki/High_availability
 
Thank you. I will check the logs further in ceph see if I can see the 172.16 net dropping or having heavy traffic.

However corosync shouuld speak on both nets as they are in a totem, I am thinking.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!