Corosync - how sensitive to network performance and interruptions

Hi, I'm building a new PVE kit with 6 hosts. I understand that Corosync pushes very little data but is sensitive to latency. I'll be using VLAN'd 100G NICs for storage, VM traffic and the PVE management GUI, but I thought I'd dedicate the 1G NICs that come standard with the servers to running CoroSync so I don't need to worry about Corosync behaviour if the 100G NICs hit a busy spike. I'd use two 1G NICs on each server, through two switches, running active/standby. I'd appreciate any pearls of wisdom regarding this proposed config.

If the active switch dies I expect to get a few seconds of pause before the standby kicks in. Would this cause any problem for the cluster? What happens to the cluster if there is an interruption or a break in Corosync communictions? Would it cause running VMs to falter?
 
This is a very good Idea. You could theoretically also add the 100G connection as 3 backup, but two 1G should be very good.
If you are running high availability the magic number is two minutes until the servers start fencing, something you would not want it to do if the server doesn't have a real problem. So If your switch breaks down, some seconds are fine(just know that some things stop working until they switch to the next switch).
 
  • Like
Reactions: stevehughes
A few more notes:
1. You are planning an even set of hosts, it's generally recommended to have an odd set of hosts, or add a vote-deamon service off-host to have it be odd.
2. What seems to be the preferred method (at least in discussing with my proxmox-supplier) is to, instead of having an active/backup run by the network-interface, have that done by corosync itself. This so that if the switch breaks but keeps the link as "up" on the device, it isn't the network-adaptor's task to notice. So have both ports active, with different IP-ranges, and then in the corosync-config select the two IP's as the first two options and possible set the 100G link as a third option too. This also gives you the option, if you do use a vote-deamon, to give that device 2/3 IP's in the ranges of the corosync to also have multiple routes.
3. Like bkini said, there is a delay before HA starts kicking servers offline (although from what I have seen that delay is 1 minute, not 2)
 
  • Like
Reactions: stevehughes
Depends on what you have set up:
If you only have a cluster (so no HA or similar):
All Running VM's will remain running, and rebooting from within the VM SHOULD work as well (since the KVM/QEMU never stops) but restarting from Proxmox and/or starting new VM's / making changes (including migrations) will not work till sync is fixed.

If you do have HA set up though, and one/less-then-half of the nodes become unreachable for long enough (somewhere between 10 and 60 seconds I believe), the host will reset-reboot (all VM's will be killed, not shut down, and the host restarted, this to assure that other hosts can start up the VM's on this host that has failed).
 
  • Like
Reactions: stevehughes

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!