Mass reboot of all cluster nodes

emptness · Tuesday at 13:16

Hello, everybody.

There was a crash on our cluster today.
In the beginning, one node lost connection to the cluster, and then all the others rebooted.

Code:

Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.17f8) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.17fc) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.1800) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.1804) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.1808) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (8.180c) was formed. Members
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Members[1]: 8
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync members[10]: 1 2 3 4 6 7 8 9 10 11
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [QUORUM] Sync joined[9]: 1 2 3 4 6 7 9 10 11
Jan 28 03:56:33 tvr-pve-09 corosync[3318]:   [TOTEM ] A new membership (1.1810) was formed. Members joined: 1 2 3 4 6 7 9 10 11
Jan 28 03:56:34 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10
Jan 28 03:56:35 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 20
Jan 28 03:56:36 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 30
Jan 28 03:56:37 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 40
Jan 28 03:56:38 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 50
Jan 28 03:56:39 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 60
Jan 28 03:56:40 tvr-pve-09 pve-ha-lrm[3526]: loop take too long (33 seconds)
Jan 28 03:56:40 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 70
Jan 28 03:56:41 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 80
Jan 28 03:56:42 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 90
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 100
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retried 100 times
Jan 28 03:56:43 tvr-pve-09 pmxcfs[3208]: [status] crit: cpg_send_message failed: 6
Jan 28 03:56:44 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10
Jan 28 03:56:45 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 20
Jan 28 03:56:46 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 30
Jan 28 03:56:47 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 40
Jan 28 03:56:48 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 50
Jan 28 03:56:49 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 60
Jan 28 03:56:50 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 70
Jan 28 03:56:51 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 80
Jan 28 03:56:52 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 90
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 100
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retried 100 times
Jan 28 03:56:53 tvr-pve-09 pmxcfs[3208]: [status] crit: cpg_send_message failed: 6
Jan 28 03:56:53 tvr-pve-09 pve-ha-lrm[3526]: lost lock 'ha_agent_tvr-pve-09_lock - cfs lock update failed - Permission denied
Jan 28 03:56:54 tvr-pve-09 pmxcfs[3208]: [status] notice: cpg_send_message retry 10

The attachment contains logs from the first failed node and from the other node.

Please help me figure out the reason for the cluster failure. I am ready to provide all the necessary additional information.

Maximiliano · Tuesday at 15:46

Hello,

Could you please send us the output from

Code:

corosync-cfgtool -n

and share with us the Corosync config? The later is located at `/etc/pve/corosync.conf`. Do you have a dedicated network for Corosync? How many extra Corosync links do you have configured? Is the Corosync network configured on top of a network bond?

emptness · Tuesday at 16:10

Maximiliano said:
Hello,

Could you please send us the output from

Code:

corosync-cfgtool -n

and share with us the Corosync config? The later is located at `/etc/pve/corosync.conf`. Do you have a dedicated network for Corosync? How many extra Corosync links do you have configured? Is the Corosync network configured on top of a network bond?

Thank you so much for your quick reply!

Code:

root@tvr-pve-10:~# corosync-cfgtool -n
Local node ID 10, transport knet
nodeid: 1 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.10) enabled connected mtu: 8885

nodeid: 2 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.11) enabled connected mtu: 8885

nodeid: 3 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.12) enabled connected mtu: 8885

nodeid: 4 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.13) enabled connected mtu: 8885

nodeid: 6 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.14) enabled connected mtu: 8885

nodeid: 7 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.15) enabled connected mtu: 8885

nodeid: 8 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.16) enabled connected mtu: 8885

nodeid: 9 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.19) enabled connected mtu: 8885

nodeid: 11 reachable
   LINK: 0 udp (10.10.10.17->10.10.10.18) enabled connected mtu: 8885

Code:

root@tvr-pve-10:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: tvr-pve-01
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.10.10.8
  }
  node {
    name: tvr-pve-03
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.10
  }
  node {
    name: tvr-pve-04
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.11
  }
  node {
    name: tvr-pve-05
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.12
  }
  node {
    name: tvr-pve-06
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.10.10.13
  }
  node {
    name: tvr-pve-07
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.10.14
  }
  node {
    name: tvr-pve-08
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.10.10.15
  }
  node {
    name: tvr-pve-09
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.10.10.16
  }
  node {
    name: tvr-pve-10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.10.10.17
  }
  node {
    name: tvr-pve-11
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.10.10.18
  }
  node {
    name: tvr-pve-12
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.10.10.19
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: tvr-pve
  config_version: 11
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

No, the corosync network is combined with the ceph network (100Gbit)
Only one link for corosync.
Yes, we have a bond.

Maximiliano · Wednesday at 11:54

No, the corosync network is combined with the ceph network (100Gbit)
Only one link for corosync.
Yes, we have a bond.

Bonds can introduce some latency for corosync, and depending on the bond mode Corosync might react badly on how fail-over is handled. It is preferable to give it dedicated NICs instead. I would highly recommend to give Corosync at least one dedicate NIC, even 1G is more than enough for Corosync, then you can use the 100G as a fallback option.

emptness · Wednesday at 12:04

Maximiliano said:
Bonds can introduce some latency for corosync, and depending on the bond mode Corosync might react badly on how fail-over is handled. It is preferable to give it dedicated NICs instead. I would highly recommend to give Corosync at least one dedicate NIC, even 1G is more than enough for Corosync, then you can use the 100G as a fallback option.

The fact is that we do not have another dedicated switch for the corosync network. You will have to connect the cluster's client network to the switch.
Can we just add a backup corosync link via the client network? Or is this also a bad option?
Can you explain why our cluster basically rebooted (all nodes)? I provided the logs above. What kind of mechanism worked, or is it a software bug?

Maximiliano · Wednesday at 12:34

When a cluster has HA services running then nodes will fence if they lose (Corosync) quorum for over a minute so that their HA-resources can safely be migrated to other nodes. This is explained at [1]. Additionally, corosync is extremely sensitive to latency, we recommend the network to operate bellow 5ms at all times [2]. If many services are running on the 100G nic then it is possible for the NIC to become saturated and in turn for latency to go up to a point where Corosync deems it unusable . Hence why it is important to have a stable and dedicated Corosync network.

Since the quorum requires half-plus-one of the votes in the cluster, if half-plus-one nodes (2, in a 3 node cluster) lose connectivity, it follows that all nodes will lose connectivity and the entire cluster will fence. The output you share does not look quite normal, but it is easier to debug this when the network is suitable for Corosync.

Can we just add a backup corosync link via the client network? Or is this also a bad option?

If you don't have the hardware for a dedicated corosync network, then a stop-gap solution could consist of using a VLAN for Corosync and setting some quality-of-service on the switch side to prioritize this network, however this might not work if the switch itself is saturated. 1G hardware is readily available and this would be my first recommendation for the long-term.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

emptness · Wednesday at 13:47

Maximiliano said:
When a cluster has HA services running then nodes will fence if they lose (Corosync) quorum for over a minute so that their HA-resources can safely be migrated to other nodes. This is explained at [1]. Additionally, corosync is extremely sensitive to latency, we recommend the network to operate bellow 5ms at all times [2]. If many services are running on the 100G nic then it is possible for the NIC to become saturated and in turn for latency to go up to a point where Corosync deems it unusable . Hence why it is important to have a stable and dedicated Corosync network.

Since the quorum requires half-plus-one of the votes in the cluster, if half-plus-one nodes (2, in a 3 node cluster) lose connectivity, it follows that all nodes will lose connectivity and the entire cluster will fence. The output you share does not look quite normal, but it is easier to debug this when the network is suitable for Corosync.

If you don't have the hardware for a dedicated corosync network, then a stop-gap solution could consist of using a VLAN for Corosync and setting some quality-of-service on the switch side to prioritize this network, however this might not work if the switch itself is saturated. 1G hardware is readily available and this would be my first recommendation for the long-term.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_cluster_network_requirements

I agree with you about the dedicated network. Thank you for your recommendations.
But still, why did all the nodes fall out of the cluster at some point? After successfully isolating the node that was experiencing network problems, attempts to synchronize quorum members began to be repeated, followed by a timeout ([TOTEM ] Process pause detected for 4542 ms, flushing membership messages.).

Code:

2025-01-28T03:56:23.483640+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1518) was formed. Members
2025-01-28T03:56:23.496063+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.496094+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:23.633830+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.633878+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.151c) was formed. Members
2025-01-28T03:56:23.646028+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.646073+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:23.783824+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.783931+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1520) was formed. Members
2025-01-28T03:56:23.796766+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.796813+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:23.934237+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.934333+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1524) was formed. Members
2025-01-28T03:56:23.946631+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:23.946664+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:24.083640+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.083704+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1528) was formed. Members
2025-01-28T03:56:24.095836+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.095867+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:24.194581+03:00 tvr-pve-03 pmxcfs[3611]: [status] notice: cpg_send_message retried 1 times
2025-01-28T03:56:24.233802+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.233871+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.152c) was formed. Members
2025-01-28T03:56:24.246029+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.246071+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:24.384045+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.384122+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1530) was formed. Members
2025-01-28T03:56:24.396132+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:24.396194+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:24.425431+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425480+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425502+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425520+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425541+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425574+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425595+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.425725+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4442 ms, flushing membership messages.
2025-01-28T03:56:24.475421+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475587+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475607+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475627+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475643+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475660+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475685+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475715+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475733+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475750+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475780+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475798+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475815+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.475895+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4492 ms, flushing membership messages.
2025-01-28T03:56:24.525410+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4542 ms, flushing membership messages.
2025-01-28T03:56:24.525709+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4542 ms, flushing membership messages.
2025-01-28T03:56:24.525763+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Process pause detected for 4542 ms, flushing membership messages.

Then the message: pvescheduler[2422888]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
And after a while, the rest of the nodes began to drop out of the quorum.

Code:

2025-01-28T03:56:30.548932+03:00 tvr-pve-03 corosync[3798]:   [MAIN  ] Completed service synchronization, ready to provide service.
2025-01-28T03:56:30.604399+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.604466+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync joined[2]: 2 3
2025-01-28T03:56:30.604505+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync left[2]: 2 3
2025-01-28T03:56:30.604527+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.1598) was formed. Members joined: 2 3 left: 2 3
2025-01-28T03:56:30.604546+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Failed to receive the leave message. failed: 2 3
2025-01-28T03:56:30.614087+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.614146+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync joined[8]: 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.614168+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync left[8]: 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.614187+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.15a0) was formed. Members joined: 2 3 4 6 7 9 10 11 left: 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.614209+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.625254+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync members[9]: 1 2 3 4 6 7 9 10 11
2025-01-28T03:56:30.625282+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync joined[6]: 2 3 7 9 10 11
2025-01-28T03:56:30.625304+03:00 tvr-pve-03 corosync[3798]:   [QUORUM] Sync left[6]: 2 3 7 9 10 11
2025-01-28T03:56:30.625340+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] A new membership (1.15a8) was formed. Members joined: 2 3 7 9 10 11 left: 2 3 7 9 10 11
2025-01-28T03:56:30.625361+03:00 tvr-pve-03 corosync[3798]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 7 9 10 11

Why did the cluster collapse after isolating the problematic node?
However, there were no direct reports of cluster network loss (example below) before the incident.

Code:

2025-01-28T03:57:44.722168+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 11 link: 0 is down
2025-01-28T03:57:44.722290+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 10 link: 0 is down
2025-01-28T03:57:44.722343+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 8 link: 0 is down
2025-01-28T03:57:44.722387+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 7 link: 0 is down
2025-01-28T03:57:44.722434+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 6 link: 0 is down
2025-01-28T03:57:44.722475+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 4 link: 0 is down
2025-01-28T03:57:44.722517+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 3 link: 0 is down
2025-01-28T03:57:44.722564+03:00 tvr-pve-03 corosync[3798]:   [KNET  ] link: host: 2 link: 0 is down

Search

Search

Mass reboot of all cluster nodes

emptness

Member

Attachments

Maximiliano

Proxmox Staff Member

emptness

Member

Maximiliano

Proxmox Staff Member

emptness

Member

Maximiliano

Proxmox Staff Member

emptness

Member

We value your privacy