Node hard reboot after quorum lost?

Stereo973 · Nov 29, 2019

Hello!

I hope you feel good! First sorry for my bad english!

We use on my company Cluster Proxmox on 3 nodes for production usage with only LXC containers without CEPH (we use ZFS replication).
The 2 first nodes is hosted by OVH, and the last on Online.net for HA reasons, they use only link0 for corosync (Public NIC).
Last evening, the node 3 experiencing a lot of little network failure, the both firsts nodes detect it without problems, all work fine.

But after 3 minutes of lost node 3, come back again, lost again... The node 1 and node 2 hard reboot.
I have watch syslog on each servers, no Sigterm shutdown.
I have experienced one time in past, when i was have only two node pve cluster, if one node reboot, the secondary reboot too.

Did you have any idea of that? I'm a little bit lost...
Thanks you in advance for your help!

PVE1 Package Version

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

PVE1 Syslog just before hard reboot

Code:

Nov 28 20:10:52 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 80
Nov 28 20:10:53 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 90
Nov 28 20:10:53 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 1250 ms
Nov 28 20:10:54 pve1 snmpd[2003]: error on subcontainer 'ia_addr' insert (-1)
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 100
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retried 100 times
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] crit: cpg_send_message failed: 6
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 2948 ms
Nov 28 20:10:55 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 10
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] A new membership (1:520) was formed. Members
Nov 28 20:10:56 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 20
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'iscsi_tcp'
Nov 28 20:12:39 pve1 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'ib_iser'

PVE2 Syslog just before hard reboot

Code:

Nov 28 20:10:52 pve2 corosync[2050]:   [TOTEM ] A new membership (1:508) was formed. Members
Nov 28 20:10:52 pve2 pmxcfs[1969]: [status] notice: cpg_send_message retry 50
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'iscsi_tcp'
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'ib_iser'
Nov 28 20:12:42 pve2 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'vhost_net'

PVE3 Syslog just before node 1 & 2 reboot

Code:

Nov 28 20:10:52 pve3 corosync[1402]:   [TOTEM ] A new membership (3:508) was formed. Members
Nov 28 20:10:52 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:52 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:52 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:55 pve3 corosync[1402]:   [TOTEM ] A new membership (3:520) was formed. Members
Nov 28 20:10:55 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:55 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:55 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:57 pve3 corosync[1402]:   [TOTEM ] A new membership (3:524) was formed. Members
Nov 28 20:10:57 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:57 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:57 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 2 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 1 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 has no active links
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 has no active links
Nov 28 20:11:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Nov 28 20:11:00 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 28 20:11:01 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...

mira · Nov 29, 2019

Is HA active? If so then that's to be expected. With HA active if the cluster no longer has quorum then the nodes fence themselves.

Stereo973 · Nov 29, 2019

Thanks you for your reply!

Yes, i use the HA.
So the solution is i need to turn off HA, or i need to add more PVE for the Quorum?

Thanks you in advance!

mira · Nov 29, 2019

A stable, separated network is recommended for corosync, especially when using HA.
So yes, either disable HA or make sure the network is stable and has low latency.

Stereo973 · Nov 29, 2019

Ok so i have remove all groups and containers entry in HA.
Just for confirmation on the attached picture, the HA is disabled after removing all entry or i need to stop a service?

Thanks you!

mira · Nov 29, 2019

You have to reboot the node, otherwise the watchdog is still active. (or restart pve-ha-crm and pve-ha-lrm)

Node hard reboot after quorum lost?

Stereo973

Active Member

mira

Proxmox Staff Member

Stereo973

Active Member

mira

Proxmox Staff Member

Stereo973

Active Member

Attachments

mira

Proxmox Staff Member

We value your privacy