Corosync / HA - cluster wide reboot

Jul 6, 2024
22
2
3
Hello,
A few times per month I have experienced issue with entire cluster rebooting. With rebooting I mean all physical servers rebooting.
I don't see anything particular in the log but I believe it to be caused by Corosync / HA watch agent rebooting the physical servers because of network issue.

Previously I just had ring_0 and I know for a fact that sometimes it looses connectivity. I have a janky bond0 LACP config where 10.10.10.0/24 is only reachable via one of the uplinks while not the other (yes it's bad but it works until bond0 decides to move traffic over to another uplink).

Now I've added ring_1, using a dedicated NIC and switch just for this. Config looks like so:
Corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: server-04-vm
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.4
    ring1_addr: 10.250.30.4
  }
  node {
    name: server-05-vm
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.5
    ring1_addr: 10.250.30.5
  }
  node {
    name: server-06-vm
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.6
    ring1_addr: 10.250.30.6
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Proxmox-VM
  config_version: 30
  interface {
    linknumber: 0
    knet_link_priority: 100
  }
  interface {
    linknumber: 1
    knet_link_priority: 200
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Based on my understanding ring1_addr / link 1 should now always be used for corosync. Checking log I can confirm this is the case:
Code:
May 23 15:25:40 server-04-vm corosync[2418]:   [KNET  ] link: host: 2 link: 0 is down
May 23 15:25:40 server-04-vm corosync[2418]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:43 server-04-vm corosync[2418]:   [KNET  ] link: host: 3 link: 0 is down
May 23 15:25:43 server-04-vm corosync[2418]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server-04-vm corosync[2418]:   [KNET  ] rx: host: 3 link: 0 is up
May 23 15:25:49 server-04-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
May 23 15:25:49 server-04-vm corosync[2418]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server-04-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 23 15:25:59 server-04-vm corosync[2418]:   [KNET  ] rx: host: 2 link: 0 is up
May 23 15:25:59 server-04-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
May 23 15:25:59 server-04-vm corosync[2418]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 200)
May 23 15:25:59 server-04-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397


Am I in the clear now? When link 0 flaps again (on a majority of servers) can I expect it to not care or do I need to do any other config so that all server in the cluster isn't rebooted? For your own information cluster consists of more than 3 servers but I limited config to not make it so long!


I'm unsure if I need to do any other configuration here or if corosync is the only config needed to be touched. I use the HA feature, does it have seperate config to tell it to always use 10.250.30.0/24 and not use 10.10.10.0/24 (the unstable one) unless the primary goes down?

Thank you for the help.
If question was unclear: Can 10.10.10.0/24 die and everything is fine, no reboots?
 
Last edited:
Can 10.10.10.0/24 die and everything is fine, no reboots?

Cross-check the output of corosync-cfgtool -s for the actual current status of all connections.

Ideally all nodes are "connected" in both LINK groups. Anything else is calling for trouble.

In my personal understanding we are fine as long as enough nodes in one of the LINK groups are present and can produce Quorum.
 
Cross-check the output of corosync-cfgtool -s for the actual current status of all connections.

Ideally all nodes are "connected" in both LINK groups. Anything else is calling for trouble.

In my personal understanding we are fine as long as enough nodes in one of the LINK groups are present and can produce Quorum.
Yeah. I am just really worried HA-agent would fence the proxmox servers again when LINK0 goes bad again.
I never knew it would reboot them but one thing's for sure I'm always adding ring0, ring1 in the future! For now the command show it as connected for all servers in both.
 
Last edited:
  • Like
Reactions: UdoB
HA acts locally on each host and will fence a host if the host loses quorum. To lose quorum, corosync in that host has to decide that none of both link0 and link1 are operating properly (nic link down, switch down, too much jitter, too much packet loss, etc). As long as a host has at least one corosync link up, it won't lose quorum and won't be rebooted by HA.

HA does wait around 60s after both corosync links are down before the fence, so if the links are down for a brief time no fence will happen. Depending on the size of the cluster, latency, etc corosync may or may not have enough time to reestablish quorum within those 60 seconds [1].

[1] https://forum.proxmox.com/threads/h...-of-the-hypervisors-failed.164873/post-764076
 
HA does wait around 60s after both corosync links are down before the fence, so if the links are down for a brief time no fence will happen. Depending on the size of the cluster, latency, etc corosync may or may not have enough time to reestablish quorum within those 60 seconds [1].
Things are improving guys. After adding two links, I haven't had any clusterwide reboots in a week. I see link 0 flapping as hell. Check logs if you want it's crazy, but no reboot because link1 is stable yay!
Code:
May 23 15:25:40 server1-vm corosync[2418]:   [KNET  ] link: host: 4 link: 0 is down
May 23 15:25:40 server1-vm corosync[2418]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 200)
May 23 15:25:43 server1-vm corosync[2418]:   [KNET  ] link: host: 14 link: 0 is down
May 23 15:25:43 server1-vm corosync[2418]:   [KNET  ] host: host: 14 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server1-vm corosync[2418]:   [KNET  ] rx: host: 14 link: 0 is up
May 23 15:25:49 server1-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 14 joined
May 23 15:25:49 server1-vm corosync[2418]:   [KNET  ] host: host: 14 (passive) best link: 1 (pri: 200)
May 23 15:25:49 server1-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 23 15:25:59 server1-vm corosync[2418]:   [KNET  ] rx: host: 4 link: 0 is up
May 23 15:25:59 server1-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
May 23 15:25:59 server1-vm corosync[2418]:   [KNET  ] host: host: 4 (passive) best link: 1 (pri: 200)
May 23 15:25:59 server1-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 26 20:12:56 server1-vm corosync[2418]:   [KNET  ] link: host: 5 link: 0 is down
May 26 20:12:56 server1-vm corosync[2418]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 200)
May 26 20:13:00 server1-vm corosync[2418]:   [KNET  ] rx: host: 5 link: 0 is up
May 26 20:13:00 server1-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
May 26 20:13:00 server1-vm corosync[2418]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 200)
May 26 20:13:00 server1-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397
May 27 05:04:22 server1-vm corosync[2418]:   [KNET  ] link: host: 5 link: 0 is down
May 27 05:04:22 server1-vm corosync[2418]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 200)
May 27 05:04:29 server1-vm corosync[2418]:   [KNET  ] rx: host: 5 link: 0 is up
May 27 05:04:29 server1-vm corosync[2418]:   [KNET  ] link: Resetting MTU for link 0 because host 5 joined
May 27 05:04:29 server1-vm corosync[2418]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 200)
May 27 05:04:29 server1-vm corosync[2418]:   [KNET  ] pmtud: Global data MTU changed to: 1397

However, there was that one server that rebooted at May 25th due to HA fencing. Now while I didn't notice anything special (server had 40% cpu usage of 128 vCPU and 200G ram available). Server load was just at 40, and nothing wrong in the BMC etc, I did see "RT throttling activated" and afterwards corosync loosing connectivity alongside the fencing later. Quite strange, I'm going to migrate vm from this server to the rest and wait and see. Though if I get any more reboots I need to disable HA so I can investigate what's causing it. If there's high load I need to find what is causing the load.

Code:
May 25 18:55:06 server-1-vmpvestatd[2327]: restarting server after 379 cycles to reduce memory usage (free 167688 (15700) KB)
May 25 18:55:06 server-1-vmpvestatd[2327]: server shutdown (restart)
May 25 18:55:07 server-1-vmpvestatd[2327]: restarting server
May 25 18:55:17 server-1-vmkernel: libceph: osd26 (1)10.2.4.5:6815 socket closed (con state OPEN)
May 25 18:55:22 server-1-vmpvestatd[2327]: local sdn network configuration is not yet generated, please reload
May 25 18:55:23 server-1-vmpvestatd[2327]: status update time (5.678 seconds)
May 25 18:55:41 server-1-vmkernel: libceph: osd10 (1)10.2.4.4:6809 socket closed (con state OPEN)
May 25 18:55:47 server-1-vmkernel: libceph: osd9 (1)10.2.4.4:6815 socket closed (con state OPEN)
May 25 18:56:12 server-1-vmpvestatd[2327]: status update time (5.204 seconds)
May 25 18:56:20 server-1-vmkernel: libceph: osd19 (1)10.2.4.9:6804 socket closed (con state OPEN)
May 25 18:56:23 server-1-vmkernel: libceph: osd19 (1)10.2.4.9:6804 socket closed (con state OPEN)
May 25 18:56:33 server-1-vmpvestatd[2327]: status update time (5.294 seconds)
May 25 18:56:38 server-1-vmkernel: sched: RT throttling activated
May 25 18:56:38 server-1-vmcorosync[2301]:   [KNET  ] link: host: 1 link: 0 is down
May 25 18:56:38 server-1-vmcorosync[2301]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 200)
May 25 18:56:44 server-1-vmcorosync[2301]:   [KNET  ] link: host: 2 link: 0 is down
May 25 18:56:45 server-1-vmcorosync[2301]:   [KNET  ] link: host: 2 link: 1 is down
May 25 18:56:46 server-1-vmcorosync[2301]:   [KNET  ] link: host: 1 link: 1 is down
May 25 18:56:46 server-1-vmcorosync[2301]:   [KNET  ] link: host: 16 link: 0 is down
May 25 18:56:47 server-1-vmcorosync[2301]:   [KNET  ] link: host: 16 link: 1 is down
May 25 18:56:47 server-1-vmcorosync[2301]:   [KNET  ] link: host: 15 link: 0 is down
May 25 18:56:47 server-1-vmcorosync[2301]:   [TOTEM ] Token has not been received in 8606 ms
May 25 18:56:48 server-1-vmcorosync[2301]:   [KNET  ] link: host: 15 link: 1 is down
May 25 18:56:48 server-1-vmcorosync[2301]:   [KNET  ] link: host: 13 link: 0 is down
May 25 18:56:49 server-1-vmcorosync[2301]:   [KNET  ] link: host: 13 link: 1 is down
May 25 18:56:49 server-1-vmcorosync[2301]:   [KNET  ] link: host: 12 link: 0 is down
May 25 18:56:50 server-1-vmcorosync[2301]:   [KNET  ] link: host: 12 link: 1 is down
May 25 18:56:50 server-1-vmcorosync[2301]:   [KNET  ] link: host: 8 link: 0 is down
 
  • Like
Reactions: Johannes S