Cluster reboot after single node reboot

sander93

Renowned Member
Sep 30, 2014
57
2
73
Hello,

I have a strange problem, this is sometimes and not always.

We reboot for whatever reason a node (hardware maintenance for example).

When the node comes back online all the nodes with vm's with HA enabled are getting rebooted.
This is also happend before when we are adding a node to the cluster.

it is a cluster of 9 nodes.
3 nodes only have local storage vm's so HA is not enabled, those servers not getting rebooted.

I have try to search the logs of the rebooted node but i cannot find anything specific.

Things i found/think can maybe be the issiue:
- spanning three (short disconnect maybe when host comes online, hello time)
- LACP, we use LACP on all the nodes

Maybe this is saying something, this is the part of the corosync log where i first get al the members and then it says link down

an 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 5 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 22 17:15:23 IDC-PVE002 corosync[1437]: [KNET ] pmtud: Global data MTU changed to: 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] rx: host: 9 link: 0 is up
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [KNET ] pmtud: PMTUD link change for host: 9 link: 0 from 469 to 1397
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[9]: 1 2 3 4 5 6 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[8]: 1 2 3 4 5 7 9 10
Jan 22 17:15:25 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (1.25b4) was formed. Members joined: 1 2 3 4 5 7 9 10
Jan 22 17:15:52 IDC-PVE002 corosync[1437]: [TOTEM ] Token has not been received in 5662 ms
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] link: host: 1 link: 0 is down
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 17:15:59 IDC-PVE002 corosync[1437]: [KNET ] host: host: 1 has no active links
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Sync joined[2]: 3 4
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] A new membership (3.25b8) was formed. Members left: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [TOTEM ] Failed to receive the leave message. failed: 1 2 5 7 9 10
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [QUORUM] Members[3]: 3 4 6
Jan 22 17:16:11 IDC-PVE002 corosync[1437]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 9 link: 0 is down
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 (passive) best link: 0 (pri: 1)
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] host: host: 9 has no active links
Jan 22 17:16:27 IDC-PVE002 corosync[1437]: [KNET ] link: host: 10 link: 0 is down
 
pve -v
-bash: pve: command not found
root@IDC-PVE001:~# pveversion
pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)
root@IDC-PVE001:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.143-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-helper: 6.4-8
pve-kernel-5.4: 6.4-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve1~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1
 
Hello, did you figured out this issue ? This is very extrange but I got the same issue on my 6 nodes cluster, I just rejoined a noded to the cluster and applied updated and I had the same issue you describe on your post. So far I couldn't find anything on logs.

This is the pveversion -v of the last node on my cluster which I rebooted:

proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Hello,

The problem was with a specific corosync version and a bug in it.
corosync: 3.1.2-pve1
https://bugzilla.proxmox.com/show_bug.cgi?id=3672

Also we have disable spanning tree (edge port) on the ports for the proxmox servers, so this ports doesn't flap when the STP topoly changed.

We don't have the problem again after it but i am always a bit nervous/anxious when i have to do maintenance/updates on the cluster.

it would be very useful if there where some kind of maintenance mode in Proxmox which disables al the HA options etc.
 
Hello,

The problem was with a specific corosync version and a bug in it.
corosync: 3.1.2-pve1
https://bugzilla.proxmox.com/show_bug.cgi?id=3672

Also we have disable spanning tree (edge port) on the ports for the proxmox servers, so this ports doesn't flap when the STP topoly changed.

We don't have the problem again after it but i am always a bit nervous/anxious when i have to do maintenance/updates on the cluster.

it would be very useful if there where some kind of maintenance mode in Proxmox which disables al the HA options etc.
Hello,

Thanks for sharing the extra info. I added more info in here https://forum.proxmox.com/threads/nodes-reboot-after-upgrade-to-7-1.103767/post-470159 if you may be interested on follow this the issue I'm having. I know that feeling of being nervious when touching the cluster and and that's sad cause one of the reasons to have a cluster is looking for mental peace, haha By the way I think If you stop and disable services pve-ha-lrm and pve-ha-crm that will disable or stop HA on proxmox so in theory the fencing mechanism shouldn't be triggered by nodes going in and out or a flapping network port.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!