Custer reboot on one node apt upgrade

Craig St George · Dec 28, 2021

I m confused when this happened we apt upgraded one node and the nine node cluster rebooted except for one node
with is PVE 7.1-6
The corosync package was on of the items been upgraded
The other nodes saw this one down

Like which loks right
ec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [CFG ] Node 7 was shut down by sysadmin
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync left[1]: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] A new membership (1.884) was formed. Members left: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 0 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 1 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 29
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 3b 3c 29 3d

Then two other nodes started complaining and that ] Retransmit List started increasing
[TOTEM ] Retransmit List: 29 3d
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 5 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 6 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 0 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 1 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 6 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 5 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 28 10:57:47 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 4c
Dec 28 10:57:49 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4c e 12 13 1c 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 6b 6c 6d 6a 6e 6f 70 71 7>
Dec 28 10:57:51 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4a 4b 6b 6c 6d 6e 6f 70 71 73 74 75 76 77 45 46 47 48 49

Finial the nodes started rebooting but that is what I do not understand as the pvecm status says 5 nodes is a quorum
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5

Also the corosync has two links
node {
name: sgrc-kv-pmox-07
nodeid: 9
quorum_votes: 1
ring0_addr: 10.10.1.107
ring1_addr: 10.10.2.107
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: kvcluster1
config_version: 9
interface {
bindnetaddr: 10.10.1.103
ringnumber: 0
}
interface {
bindnetaddr: 10.10.2.103
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2

I know the bind address should not be there as this is the new corosync but the logs say
interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used

SO I assume what happened is the watchdog timeout and it fenced but I m confused why it did that when maybe only two or three nodes had some issues
Maybe there is some config in the corosync to delay this and some timeout
Or increasing the timeout on the watchdog kernel module could help.

any ideas ?

matrix · Dec 28, 2021

by upgrading always use apt dist-upgrade instead of apt upgrade

aaron · Dec 28, 2021

Craig St George said:
any ideas ?

Which corosync version was installed pre update?

There was a hard to reproduce bug that would only trigger under the correct circumstances that would cause Corosync to lock up when a node joined.

From the changelog:

corosync (3.1.5-pve2) bullseye; urgency=medium

* cherry-pick fix for high retransmit load

* cherry-pick fix for CPG corruption during membership change bug

If you want to make sure that Corosync problems will not cause fencing during updates (could be other things as well, like Network issues or whatever), you can stop the HA services for the time of the upgrade.
First stop all pve-ha-lrm services on all nodes, then the pve-ha-crm. Once you are done with the upgrade and everything is working as expected, start them in the same order on all nodes.

Search

Search

Custer reboot on one node apt upgrade

Craig St George

Well-Known Member

matrix

Active Member

aaron

Proxmox Staff Member

We value your privacy