Cluster node member - sudden reboot

AlexLup · Jan 7, 2019

Hi,
I have a node that all of the sudden begun acting up, the memory pressure is high on all 3 nodes (90%+) since newest luminous but I have 2 other identical nodes that dont have this issue and everything is working perfectly which I attribute to ceph reserving memory for future use(?).

So I am pasting the excerpt from syslog, since I dont know if you bugzilla allows the full log to be uploaded (100mb).

pve-manager/5.3-6/37b3c8df (running kernel: 4.15.18-9-pve)

Syslog:

Code:

root@pve23:~# cat /var/log/syslog.1 | grep "Jan  7 04:"
Jan  7 04:00:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:00:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:01:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:01:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:02:00 pve23 systemd[1]: Starting Proxmox VE replication runner...
Jan  7 04:02:01 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
Jan  7 04:02:58 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7
....

Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbad7 4dbad8 4dbad9 4dbada 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:03:05 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbad7 4dbad8 4dbad9 4dbada 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5
Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbaf5 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7 4dbae8 4dbae9
Jan  7 04:03:05 pve23 corosync[1772]:  [TOTEM ] Retransmit List: 4dbaf5 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6 4dbae7 4dbae8 4dbae9
Jan  7 04:03:05 pve23 corosync[1772]: notice  [TOTEM ] Retransmit List: 4dbad8 4dbad9 4dbada 4dbaf3 4dbaf4 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6
Jan  7 04:03:06 pve23 corosync[1772]: error   [TOTEM ] FAILED TO RECEIVE
Jan  7 04:03:06 pve23 corosync[1772]:  [TOTEM ] FAILED TO RECEIVE
Jan  7 04:03:08 pve23 corosync[1772]: notice  [TOTEM ] A new membership (192.168.1.23:1112) was formed. Members left: 3 4
Jan  7 04:03:08 pve23 corosync[1772]: notice  [TOTEM ] Failed to receive the leave message. failed: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [TOTEM ] A new membership (192.168.1.23:1112) was formed. Members left: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [TOTEM ] Failed to receive the leave message. failed: 3 4
Jan  7 04:03:08 pve23 corosync[1772]:  [CPG   ] downlist left_list: 2 received
Jan  7 04:03:08 pve23 corosync[1772]: warning [CPG   ] downlist left_list: 2 received
Jan  7 04:03:08 pve23 corosync[1772]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  7 04:03:08 pve23 corosync[1772]: notice  [QUORUM] Members[1]: 2
Jan  7 04:03:08 pve23 corosync[1772]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  7 04:03:08 pve23 corosync[1772]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan  7 04:03:08 pve23 corosync[1772]:  [QUORUM] Members[1]: 2
Jan  7 04:03:08 pve23 corosync[1772]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [status] notice: node lost quorum
Jan  7 04:03:08 pve23 pmxcfs[1720]: [status] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] crit: received write while not quorate - trigger resync
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] crit: leaving CPG group
Jan  7 04:03:08 pve23 systemd[1]: Started Proxmox VE replication runner.
Jan  7 04:03:08 pve23 pve-ha-lrm[4378]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve23/lrm_status.tmp.4378' - Permission denied
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: start cluster connection
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: members: 2/1720
Jan  7 04:03:08 pve23 pmxcfs[1720]: [dcdb] notice: all data is up to date
Jan  7 04:03:13 pve23 pve-ha-crm[2423]: status change slave => wait_for_quorum
Jan  7 04:03:18 pve23 pve-ha-lrm[4378]: lost lock 'ha_agent_pve23_lock - cfs lock update failed - Permission denied
Jan  7 04:03:23 pve23 pve-ha-lrm[4378]: status change active => lost_agent_lock

<<<<< REBOOT ?

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'iscsi_tcp'
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'ib_iser'
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'vhost_net'
Jan  7 04:05:18 pve23 keyboard-setup.sh[384]: cannot open file /tmp/tmpkbd.AuAvPt
Jan  7 04:05:18 pve23 systemd[1]: Starting Flush Journal to Persistent Storage...
Jan  7 04:05:18 pve23 systemd[1]: Started udev Coldplug all Devices.
Jan  7 04:05:18 pve23 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jan  7 04:05:18 pve23 systemd[1]: Started Flush Journal to Persistent Storage.
Jan  7 04:05:18 pve23 systemd-modules-load[393]: Inserted module 'zfs'
Jan  7 04:05:18 pve23 systemd[1]: Started Load Kernel Modules.
Jan  7 04:05:18 pve23 systemd[1]: Mounting Configuration File System...
Jan  7 04:05:18 pve23 systemd[1]: Mounting FUSE Control File System...
Jan  7 04:05:18 pve23 systemd[1]: Starting Apply Kernel Variables...
Jan  7 04:05:18 pve23 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Jan  7 04:05:18 pve23 kernel: [    0.000000] Linux version 4.15.18-9-pve (build@pve) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100) ()
Jan  7 04:05:18 pve23 systemd[1]: apt-daily-upgrade.timer: Adding 21min 59.729743s random time.
Jan  7 04:05:18 pve23 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.18-9-pve root=/dev/mapper/pve-root ro quiet
Jan  7 04:05:18 pve23 kernel: [    0.000000] KERNEL supported cpus:
Jan  7 04:05:18 pve23 kernel: [    0.000000]   Intel GenuineIntel
Jan  7 04:05:18 pve23 kernel: [    0.000000]   AMD AuthenticAMD
Jan  7 04:05:18 pve23 systemd[1]: Started Daily apt upgrade and clean activities.
Jan  7 04:05:18 pve23 kernel: [    0.000000]   Centaur CentaurHauls
Jan  7 04:05:18 pve23 kernel: [    0.000000] x86/fpu: x87 FPU will use FXSAVE

wolfgang · Jan 8, 2019

Hi,

your node was fenced because you lost quorum and it looks like your network is the problem.

AlexLup said:
Retransmit List: 4dbad8 4dbad9 4dbada 4dbaf3 4dbaf4 4dbacd 4dbadb 4dbadc 4dbadd 4dbade 4dbadf 4dbae0 4dbae1 4dbae2 4dbae3 4dbae4 4dbae5 4dbae6

AlexLup · Jan 8, 2019

That is strange, as the traffic to and from that node has worked according to the RRD graphs. And the other nodes on the same switch worked just fine. So how do I go from here ? Should I replace my NIC ?

wolfgang · Jan 8, 2019

Do you have all on the same nic?
If so then this is normal.
It is vital for HA to have a separate corosync network and this network should also be redundant.

AlexLup · Jan 8, 2019

wolfgang said:
Do you have all on the same nic?
If so then this is normal.
It is vital for HA to have a separate corosync network and this network should also be redundant.

I have a split the cluster and public networking (1x 10gbit NIC and 1 gbit NIC)

I dont necessarily see a forced reboot as something normal tho ?

wolfgang · Jan 9, 2019

AlexLup said:
I have a split the cluster and public networking (1x 10gbit NIC and 1 gbit NIC)

You have to explain a little more on what network is which service.

AlexLup said:
I dont necessarily see a forced reboot as something normal tho ?

If you have HA active and operate under the needed Hardware specification it is normal.
Or better it is expected with this setup.

AlexLup · Jan 9, 2019

Hi,
The corosync network is on the 10gb NIC with 172.16 as IP. The public network is the 1gb one on 192.168 net.

I had no idea that it was both normal and expected for a node to magically reboot itself? Because the way I see it, the network worked just fine, it was just corosync that messed up, hence no reboot was needed IMHO.

wolfgang · Jan 11, 2019

AlexLup said:
I had no idea that it was both normal and expected for a node to magically reboot itself?

As far I can see it from the logs your node was fenced. I guess you have activated HA?
If you have HA active or it was former activated it is essential for it that the cluster network is working[1].
At the Wikipedia side[1], you see under principle point 3 "Detection of failures as they occur" a cluster communication issue is a failure and must be handled.

AlexLup said:
Because the way I see it, the network worked just fine, it was just corosync that messed up, hence no reboot was needed IMHO.

Your logs show packets retransmit. Retransmits never happens on a working Network.
No working also means the latency gets to hight what happens if you transmit a huge amount of storage date.

see
1.) https://pve.proxmox.com/wiki/High_Availability
2.) https://en.wikipedia.org/wiki/High_availability

AlexLup · Jan 11, 2019

Thank you. I will check the logs further in ceph see if I can see the 172.16 net dropping or having heavy traffic.

However corosync shouuld speak on both nets as they are in a totem, I am thinking.

Search

Search

Cluster node member - sudden reboot

AlexLup

Well-Known Member

wolfgang

Proxmox Retired Staff

AlexLup

Well-Known Member

wolfgang

Proxmox Retired Staff

AlexLup

Well-Known Member

wolfgang

Proxmox Retired Staff

AlexLup

Well-Known Member

wolfgang

Proxmox Retired Staff

AlexLup

Well-Known Member

We value your privacy