Node hard reboot after quorum lost?

Stereo973

Member
Nov 29, 2019
17
0
21
28
France
Hello!

I hope you feel good! First sorry for my bad english!

We use on my company Cluster Proxmox on 3 nodes for production usage with only LXC containers without CEPH (we use ZFS replication).
The 2 first nodes is hosted by OVH, and the last on Online.net for HA reasons, they use only link0 for corosync (Public NIC).
Last evening, the node 3 experiencing a lot of little network failure, the both firsts nodes detect it without problems, all work fine.

But after 3 minutes of lost node 3, come back again, lost again... The node 1 and node 2 hard reboot.
I have watch syslog on each servers, no Sigterm shutdown.
I have experienced one time in past, when i was have only two node pve cluster, if one node reboot, the secondary reboot too.

Did you have any idea of that? I'm a little bit lost...
Thanks you in advance for your help!

PVE1 Package Version
Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

PVE1 Syslog just before hard reboot
Code:
Nov 28 20:10:52 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 80
Nov 28 20:10:53 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 90
Nov 28 20:10:53 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 1250 ms
Nov 28 20:10:54 pve1 snmpd[2003]: error on subcontainer 'ia_addr' insert (-1)
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 100
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retried 100 times
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] crit: cpg_send_message failed: 6
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 2948 ms
Nov 28 20:10:55 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 10
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] A new membership (1:520) was formed. Members
Nov 28 20:10:56 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 20
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'iscsi_tcp'
Nov 28 20:12:39 pve1 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'ib_iser'

PVE2 Syslog just before hard reboot
Code:
Nov 28 20:10:52 pve2 corosync[2050]:   [TOTEM ] A new membership (1:508) was formed. Members
Nov 28 20:10:52 pve2 pmxcfs[1969]: [status] notice: cpg_send_message retry 50
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'iscsi_tcp'
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'ib_iser'
Nov 28 20:12:42 pve2 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'vhost_net'

PVE3 Syslog just before node 1 & 2 reboot
Code:
Nov 28 20:10:52 pve3 corosync[1402]:   [TOTEM ] A new membership (3:508) was formed. Members
Nov 28 20:10:52 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:52 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:52 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:55 pve3 corosync[1402]:   [TOTEM ] A new membership (3:520) was formed. Members
Nov 28 20:10:55 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:55 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:55 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:57 pve3 corosync[1402]:   [TOTEM ] A new membership (3:524) was formed. Members
Nov 28 20:10:57 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:57 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:57 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 2 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 1 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 has no active links
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 has no active links
Nov 28 20:11:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Nov 28 20:11:00 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 28 20:11:01 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...
 
Is HA active? If so then that's to be expected. With HA active if the cluster no longer has quorum then the nodes fence themselves.
 
Thanks you for your reply!

Yes, i use the HA.
So the solution is i need to turn off HA, or i need to add more PVE for the Quorum?

Thanks you in advance!
 
A stable, separated network is recommended for corosync, especially when using HA.
So yes, either disable HA or make sure the network is stable and has low latency.
 
Ok so i have remove all groups and containers entry in HA.
Just for confirmation on the attached picture, the HA is disabled after removing all entry or i need to stop a service?

Thanks you!
 

Attachments

  • HAProxmox.png
    HAProxmox.png
    42.4 KB · Views: 17
You have to reboot the node, otherwise the watchdog is still active. (or restart pve-ha-crm and pve-ha-lrm)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!