Corosync - how sensitive to network performance and interruptions

stevehughes · Jul 24, 2024

Hi, I'm building a new PVE kit with 6 hosts. I understand that Corosync pushes very little data but is sensitive to latency. I'll be using VLAN'd 100G NICs for storage, VM traffic and the PVE management GUI, but I thought I'd dedicate the 1G NICs that come standard with the servers to running CoroSync so I don't need to worry about Corosync behaviour if the 100G NICs hit a busy spike. I'd use two 1G NICs on each server, through two switches, running active/standby. I'd appreciate any pearls of wisdom regarding this proposed config.

If the active switch dies I expect to get a few seconds of pause before the standby kicks in. Would this cause any problem for the cluster? What happens to the cluster if there is an interruption or a break in Corosync communictions? Would it cause running VMs to falter?

bkinigadner · Jul 24, 2024

This is a very good Idea. You could theoretically also add the 100G connection as 3 backup, but two 1G should be very good.
If you are running high availability the magic number is two minutes until the servers start fencing, something you would not want it to do if the server doesn't have a real problem. So If your switch breaks down, some seconds are fine(just know that some things stop working until they switch to the next switch).

sw-omit · Jul 24, 2024

A few more notes:
1. You are planning an even set of hosts, it's generally recommended to have an odd set of hosts, or add a vote-deamon service off-host to have it be odd.
2. What seems to be the preferred method (at least in discussing with my proxmox-supplier) is to, instead of having an active/backup run by the network-interface, have that done by corosync itself. This so that if the switch breaks but keeps the link as "up" on the device, it isn't the network-adaptor's task to notice. So have both ports active, with different IP-ranges, and then in the corosync-config select the two IP's as the first two options and possible set the 100G link as a third option too. This also gives you the option, if you do use a vote-deamon, to give that device 2/3 IP's in the ranges of the corosync to also have multiple routes.
3. Like bkini said, there is a delay before HA starts kicking servers offline (although from what I have seen that delay is 1 minute, not 2)

stevehughes · Jul 25, 2024

Thank you both for your suggestions. That's very helpful.

What happens to running VMs if corosync fails? I'm coming from vSphere. We have configured the available options so that VMs will continue to run if vCenter disappears or if a host becomes isolated.

sw-omit · Jul 25, 2024

Depends on what you have set up:
If you only have a cluster (so no HA or similar):
All Running VM's will remain running, and rebooting from within the VM SHOULD work as well (since the KVM/QEMU never stops) but restarting from Proxmox and/or starting new VM's / making changes (including migrations) will not work till sync is fixed.

If you do have HA set up though, and one/less-then-half of the nodes become unreachable for long enough (somewhere between 10 and 60 seconds I believe), the host will reset-reboot (all VM's will be killed, not shut down, and the host restarted, this to assure that other hosts can start up the VM's on this host that has failed).

stevehughes · Jul 26, 2024

Thanks - very much appreciated.

Search

Search

Corosync - how sensitive to network performance and interruptions

stevehughes

Member

bkinigadner

Member

sw-omit

Active Member

stevehughes

Member

sw-omit

Active Member

stevehughes

Member