I m confused when this happened we apt upgraded one node and the nine node cluster rebooted except for one node
with is PVE 7.1-6
The corosync package was on of the items been upgraded
The other nodes saw this one down
Like which loks right
ec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [CFG ] Node 7 was shut down by sysadmin
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync left[1]: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] A new membership (1.884) was formed. Members left: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 0 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 1 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 29
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 3b 3c 29 3d
Then two other nodes started complaining and that ] Retransmit List started increasing
[TOTEM ] Retransmit List: 29 3d
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 5 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 6 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 0 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 1 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 6 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 5 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 28 10:57:47 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 4c
Dec 28 10:57:49 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4c e 12 13 1c 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 6b 6c 6d 6a 6e 6f 70 71 7>
Dec 28 10:57:51 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4a 4b 6b 6c 6d 6e 6f 70 71 73 74 75 76 77 45 46 47 48 49
Finial the nodes started rebooting but that is what I do not understand as the pvecm status says 5 nodes is a quorum
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Also the corosync has two links
node {
name: sgrc-kv-pmox-07
nodeid: 9
quorum_votes: 1
ring0_addr: 10.10.1.107
ring1_addr: 10.10.2.107
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: kvcluster1
config_version: 9
interface {
bindnetaddr: 10.10.1.103
ringnumber: 0
}
interface {
bindnetaddr: 10.10.2.103
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
I know the bind address should not be there as this is the new corosync but the logs say
interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used
SO I assume what happened is the watchdog timeout and it fenced but I m confused why it did that when maybe only two or three nodes had some issues
Maybe there is some config in the corosync to delay this and some timeout
Or increasing the timeout on the watchdog kernel module could help.
any ideas ?
with is PVE 7.1-6
The corosync package was on of the items been upgraded
The other nodes saw this one down
Like which loks right
ec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [CFG ] Node 7 was shut down by sysadmin
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Sync left[1]: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] A new membership (1.884) was formed. Members left: 7
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [QUORUM] Members[8]: 1 2 3 4 5 6 8 9
Dec 28 10:57:35 sgrc-kv-pmox-04 corosync[17473]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 0 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 7 link: 1 is down
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:36 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 has no active links
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 29
Dec 28 10:57:39 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 2a 2b 3b 3c 29 3d
Then two other nodes started complaining and that ] Retransmit List started increasing
[TOTEM ] Retransmit List: 29 3d
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 5 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] link: host: 6 link: 0 is down
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 1 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 0 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 7 link: 1 is up
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:41 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 6 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] rx: host: 5 link: 0 is up
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 28 10:57:45 sgrc-kv-pmox-04 corosync[17473]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec 28 10:57:47 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 4c
Dec 28 10:57:49 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4c e 12 13 1c 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 6b 6c 6d 6a 6e 6f 70 71 7>
Dec 28 10:57:51 sgrc-kv-pmox-04 corosync[17473]: [TOTEM ] Retransmit List: 4a 4b 6b 6c 6d 6e 6f 70 71 73 74 75 76 77 45 46 47 48 49
Finial the nodes started rebooting but that is what I do not understand as the pvecm status says 5 nodes is a quorum
Expected votes: 9
Highest expected: 9
Total votes: 9
Quorum: 5
Also the corosync has two links
node {
name: sgrc-kv-pmox-07
nodeid: 9
quorum_votes: 1
ring0_addr: 10.10.1.107
ring1_addr: 10.10.2.107
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: kvcluster1
config_version: 9
interface {
bindnetaddr: 10.10.1.103
ringnumber: 0
}
interface {
bindnetaddr: 10.10.2.103
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
I know the bind address should not be there as this is the new corosync but the logs say
interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used
SO I assume what happened is the watchdog timeout and it fenced but I m confused why it did that when maybe only two or three nodes had some issues
Maybe there is some config in the corosync to delay this and some timeout
Or increasing the timeout on the watchdog kernel module could help.
any ideas ?