Cluster reboot after single node reboot

sander93 · Jan 22, 2022

Hello,

I have a strange problem, this is sometimes and not always.

We reboot for whatever reason a node (hardware maintenance for example).

When the node comes back online all the nodes with vm's with HA enabled are getting rebooted.
This is also happend before when we are adding a node to the cluster.

it is a cluster of 9 nodes.
3 nodes only have local storage vm's so HA is not enabled, those servers not getting rebooted.

I have try to search the logs of the rebooted node but i cannot find anything specific.

Things i found/think can maybe be the issiue:
- spanning three (short disconnect maybe when host comes online, hello time)
- LACP, we use LACP on all the nodes

Maybe this is saying something, this is the part of the corosync log where i first get al the members and then it says link down

an 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: Global data MTU changed to: 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] rx: host: 9 link: 0 is up
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from 469 to 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[9]: 1 2 3 4 5 6 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[8]: 1 2 3 4 5 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (1.25b4) was formed. Members joined: 1 2 3 4 5 7 9 10
Jan 22 17:15:52 IDC-PVE002 corosync[1437]: [TOTEM ] Token has not been received in 5662 ms
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] link: host: 1 link: 0 is down
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 has no active links
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[2]: 3 4
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (3.25b8) was formed. Members left: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 9 link: 0 is down
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 has no active links
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 10 link: 0 is down

sander93 · Jan 22, 2022

pve -v
-bash: pve: command not found
root@IDC-PVE001:~# pveversion
pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)
root@IDC-PVE001:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.143-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-8
pve-kernel-5.4: 6.4-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1

JRG · May 10, 2022

Hello, did you figured out this issue ? This is very extrange but I got the same issue on my 6 nodes cluster, I just rejoined a noded to the cluster and applied updated and I had the same issue you describe on your post. So far I couldn't find anything on logs.

This is the pveversion -v of the last node on my cluster which I rebooted:

proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

sander93 · May 11, 2022

Hello,

The problem was with a specific corosync version and a bug in it.
corosync: 3.1.2-pve1
https://bugzilla.proxmox.com/show_bug.cgi?id=3672

Also we have disable spanning tree (edge port) on the ports for the proxmox servers, so this ports doesn't flap when the STP topoly changed.

We don't have the problem again after it but i am always a bit nervous/anxious when i have to do maintenance/updates on the cluster.

it would be very useful if there where some kind of maintenance mode in Proxmox which disables al the HA options etc.

JRG · May 11, 2022

sander93 said:
Hello,

The problem was with a specific corosync version and a bug in it.
corosync: 3.1.2-pve1
https://bugzilla.proxmox.com/show_bug.cgi?id=3672

Also we have disable spanning tree (edge port) on the ports for the proxmox servers, so this ports doesn't flap when the STP topoly changed.

We don't have the problem again after it but i am always a bit nervous/anxious when i have to do maintenance/updates on the cluster.

it would be very useful if there where some kind of maintenance mode in Proxmox which disables al the HA options etc.

Hello,

Thanks for sharing the extra info. I added more info in here https://forum.proxmox.com/threads/nodes-reboot-after-upgrade-to-7-1.103767/post-470159 if you may be interested on follow this the issue I'm having. I know that feeling of being nervious when touching the cluster and and that's sad cause one of the reasons to have a cluster is looking for mental peace, haha By the way I think If you stop and disable services pve-ha-lrm and pve-ha-crm that will disable or stop HA on proxmox so in theory the fencing mechanism shouldn't be triggered by nodes going in and out or a flapping network port.

Search

Search

Cluster reboot after single node reboot

sander93

Renowned Member

sander93

Renowned Member

JRG

Member

sander93

Renowned Member

JRG

Member

We value your privacy