Node hard reboot after quorum lost?

Stereo973 · Nov 29, 2019

Hello!

I hope you feel good! First sorry for my bad english!

We use on my company Cluster Proxmox on 3 nodes for production usage with only LXC containers without CEPH (we use ZFS replication).
The 2 first nodes is hosted by OVH, and the last on Online.net for HA reasons, they use only link0 for corosync (Public NIC).
Last evening, the node 3 experiencing a lot of little network failure, the both firsts nodes detect it without problems, all work fine.

But after 3 minutes of lost node 3, come back again, lost again... The node 1 and node 2 hard reboot.
I have watch syslog on each servers, no Sigterm shutdown.
I have experienced one time in past, when i was have only two node pve cluster, if one node reboot, the secondary reboot too.

Did you have any idea of that? I'm a little bit lost...
Thanks you in advance for your help!

PVE1 Package Version

Code:

proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-4 (running version: 6.0-4/2a719255)
pve-kernel-5.0: 6.0-5
pve-kernel-helper: 6.0-5
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-2
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-5
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-61
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-4
pve-container: 3.0-3
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-5
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-3
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-5
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

PVE1 Syslog just before hard reboot

Code:

Nov 28 20:10:52 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 80
Nov 28 20:10:53 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 90
Nov 28 20:10:53 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 1250 ms
Nov 28 20:10:54 pve1 snmpd[2003]: error on subcontainer 'ia_addr' insert (-1)
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 100
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retried 100 times
Nov 28 20:10:54 pve1 pmxcfs[2141]: [status] crit: cpg_send_message failed: 6
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] Token has not been received in 2948 ms
Nov 28 20:10:55 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 10
Nov 28 20:10:55 pve1 corosync[2231]:   [TOTEM ] A new membership (1:520) was formed. Members
Nov 28 20:10:56 pve1 pmxcfs[2141]: [status] notice: cpg_send_message retry 20
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'iscsi_tcp'
Nov 28 20:12:39 pve1 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:39 pve1 systemd-modules-load[1518]: Inserted module 'ib_iser'

PVE2 Syslog just before hard reboot

Code:

Nov 28 20:10:52 pve2 corosync[2050]:   [TOTEM ] A new membership (1:508) was formed. Members
Nov 28 20:10:52 pve2 pmxcfs[1969]: [status] notice: cpg_send_message retry 50
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'iscsi_tcp'
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'ib_iser'
Nov 28 20:12:42 pve2 kernel: [    0.000000] Linux version 5.0.15-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) ()
Nov 28 20:12:42 pve2 systemd-modules-load[1478]: Inserted module 'vhost_net'

PVE3 Syslog just before node 1 & 2 reboot

Code:

Nov 28 20:10:52 pve3 corosync[1402]:   [TOTEM ] A new membership (3:508) was formed. Members
Nov 28 20:10:52 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:52 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:52 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:55 pve3 corosync[1402]:   [TOTEM ] A new membership (3:520) was formed. Members
Nov 28 20:10:55 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:55 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:55 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:57 pve3 corosync[1402]:   [TOTEM ] A new membership (3:524) was formed. Members
Nov 28 20:10:57 pve3 corosync[1402]:   [CPG   ] downlist left_list: 0 received
Nov 28 20:10:57 pve3 corosync[1402]:   [QUORUM] Members[1]: 3
Nov 28 20:10:57 pve3 corosync[1402]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 2 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] link: host: 1 link: 0 is down
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 2 has no active links
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 28 20:10:58 pve3 corosync[1402]:   [KNET  ] host: host: 1 has no active links
Nov 28 20:11:00 pve3 systemd[1]: Starting Proxmox VE replication runner...
Nov 28 20:11:00 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 28 20:11:01 pve3 pvesr[2175]: trying to acquire cfs lock 'file-replication_cfg' ...

mira · Nov 29, 2019

Is HA active? If so then that's to be expected. With HA active if the cluster no longer has quorum then the nodes fence themselves.

Stereo973 · Nov 29, 2019

Thanks you for your reply!

Yes, i use the HA.
So the solution is i need to turn off HA, or i need to add more PVE for the Quorum?

Thanks you in advance!

mira · Nov 29, 2019

A stable, separated network is recommended for corosync, especially when using HA.
So yes, either disable HA or make sure the network is stable and has low latency.

Stereo973 · Nov 29, 2019

Ok so i have remove all groups and containers entry in HA.
Just for confirmation on the attached picture, the HA is disabled after removing all entry or i need to stop a service?

Thanks you!

mira · Nov 29, 2019

You have to reboot the node, otherwise the watchdog is still active. (or restart pve-ha-crm and pve-ha-lrm)

Search

Search

Node hard reboot after quorum lost?

Stereo973

Member

mira

Proxmox Staff Member

Stereo973

Member

mira

Proxmox Staff Member

Stereo973

Member

Attachments

mira

Proxmox Staff Member