Proxmox Corosync /cluster dedicated network, why?

pgro · Jan 10, 2024

Hi Everyone,

Since now I am always selecting to install a fresh proxmox cluster with it's own dedicated network port for Corosync/Cluster network. Now, can someone please explain WHY this is important or not? Let's assume we have a server with 4-LAN ports. In case we choose to install a cluster with only a vmbr0 (default management interface) but also additional vlan for cluster eg vmbr0.10 or so, whould could possibly go wrong ? What's the benefit in a 10G Network card to seperate the LAN traffic and Corosync traffic? (there are no Storage traffic at the moment, instead we are using replication every hour)

An other example. We have a proxmox cluster of 3 nodes with HA with vmbr0 and bond0 on eno1 & eno2. We also have a Cluster network setup on eno3 (no bridge interface just plain port). My node-1 have 3 vms ,while rest of nodes doesn't have any VM at all just sitting there. Now I am going to pull out eno1 and eno2 cables from Node-1, what will happen ? The cluster will be healthy but my VM will be alive without network connectivity, right? So I have a healthy cluster but my bridge interface of node-1 is down. So I am loosing my VM connectivity until someone or somehow manual migrate the VMs to the rest of Nodes, correct?

Now, let's take the solo example, we run the same scenario as above but this time everything is running under a single vmbr0 and cluster network under vmbr0.10. What will happen if i plugout the network cable from eno1 ?

Thank you

aaron · Jan 10, 2024

Corosync is used for quite a bit in a Proxmox VE cluster and needs low latency.
If the latency is going up, for example because another service is congesting the physical link, Corosync will consider that link as unusable. If you have configured multiple Corosync links, it will fall back to those and the cluster will stay stable.

If you only got one Corosync network or if all Corosync networks are congested, the result will be more or less dramatic.
In the case of no HA guests on the node, the Proxmox Cluster FS (pmxcfs aka pve-cluster.service) will become read only until the node will be part of the majority of the cluster again. The result is that writing to /etc/pve will not work. This will affect certain actions, like changing a VM config or starting a VM.

If the node does have HA guests currently running (LRM status is active) and the node cannot join the quorate part of the cluster within 1 minute (as of Jan 2024), it will fence itself. A fence is similar as if you would push the reset button on the server.
This happens to make sure that the HA guests are definitely powered off before the (hopefully) remaining rest of the cluster will recover these guests. If more or all nodes in the cluster are affected, then all nodes will fence. From the outside, it looks like the whole cluster just rebooted.

So anything that could congest the network is a problem, why the recommendation for a dedicated physical network for Corosync exists. Adding multiple Corosync links is a good idea to cover the case where the dedicated Corosync network might see some problem, e.g. broken cable.

Well-known examples that can cause network congestion is if Corosync shares its network(s) with Ceph, network shares, backup targets, the migration network, ...

Putting the Corosync network in a VLAN is not a solution to this, unless you can guarantee a bandwidth reservation. But that is another thing that could be misconfigured or not working as expected, which is why the recommended best practice is top have at least one physical network for Corosync alone.

pgro said:
Now I am going to pull out eno1 and eno2 cables from Node-1, what will happen ? The cluster will be healthy but my VM will be alive without network connectivity, right? So I have a healthy cluster but my bridge interface of node-1 is down. So I am loosing my VM connectivity until someone or somehow manual migrate the VMs to the rest of Nodes, correct?

IIUC yep.

pgro said:
Now, let's take the solo example, we run the same scenario as above but this time everything is running under a single vmbr0 and cluster network under vmbr0.10. What will happen if i plugout the network cable from eno1 ?

The node will fence itself if there are HA guests currently running on it.

pgro · Jan 10, 2024

Hi aaron and thank you for your reply,

aaron said:
Corosync is used for quite a bit in a Proxmox VE cluster and needs low latency.
If the latency is going up, for example because another service is congesting the physical link, Corosync will consider that link as unusable. If you have configured multiple Corosync links, it will fall back to those and the cluster will stay stable.

aaron said:
If you only got one Corosync network or if all Corosync networks are congested, the result will be more or less dramatic.

What do you mean by saying congested? Is this similar with latency?

aaron said:
In the case of no HA guests on the node, the Proxmox Cluster FS (pmxcfs aka pve-cluster.service) will become read only until the node will be part of the majority of the cluster again. The result is that writing to /etc/pve will not work. This will affect certain actions, like changing a VM config or starting a VM.

To understand more further, can you please describe if this is the scenario with all-in-one Lan port ? or with dedicated Cluster network ?
Since that node will be out of order due to cable plugout then this behavour sounds normal to me for corosync to became readonly. But all VM changes from the other nodes will be synced once Node Cluster become available again, correct?

aaron said:
If the node does have HA guests currently running (LRM status is active) and the node cannot join the quorate part of the cluster within 1 minute (as of Jan 2024), it will fence itself. A fence is similar as if you would push the reset button on the server.

Fencing via ipmi / pdu or Guest agent? Ok so this is a case if somehow the Cluster network of node for any reason become faulty ?

aaron said:
This happens to make sure that the HA guests are definitely powered off before the (hopefully) remaining rest of the cluster will recover these guests. If more or all nodes in the cluster are affected, then all nodes will fence. From the outside, it looks like the whole cluster just rebooted.

So anything that could congest the network is a problem, why the recommendation for a dedicated physical network for Corosync exists. Adding multiple Corosync links is a good idea to cover the case where the dedicated Corosync network might see some problem, e.g. broken cable.

It's worth mentioning here if it's a good idea to create a bond interface for cluster network? or not ?

aaron said:
Well-known examples that can cause network congestion is if Corosync shares its network(s) with Ceph, network shares, backup targets, the migration network, ...

And if corosync doesn't share its network with anything from above except VM and Management traffic?

aaron said:
Putting the Corosync network in a VLAN is not a solution to this, unless you can guarantee a bandwidth reservation. But that is another thing that could be misconfigured or not working as expected, which is why the recommended best practice is top have at least one physical network for Corosync alone.

IIUC yep.

The node will fence itself if there are HA guests currently running on it.

Ok but still its not clear to me.

- Assuming you have a single cluster of 3 nodes, by just using a bond between two ports (active-active) with a single IP address for Cluster network and VM/LAN Network/Management and also both servers are replicate each other the VMs together
What will happen on a node-cluster if bond (both ports) becomes unavailable, due to network issues like unplug or switch goes dead? In that case the Physical server will remain health. What decision will take for other two nodes? The VMs will start on other two nodes from the last success replication, but what will happen with the first node when this become Active again within Network Cluster? How the data will remain safe between Hosts/nodes?

Thank you

Dunuin · Jan 11, 2024

Its not only about a completely failing network. Corosync wants latency below 1ms. Some people say higher numbers are fine and it will still work with something like 10 or even up to 30ms latency. Once you saturate the network bandwidth by sending lots of packets (backup, replication, ceph, ...) the latency will go up and maybe exceed this limit. Once it does the cluster will fail even if the NICs are working perfectly fine...just too slow...so the packets will time out and the cluster isn't in sync anymore.

pgro · Jan 11, 2024

Dunuin said:
Its not only about a completely failing network. Corosync wants latency below 1ms. Some people say higher numbers are fine and it will still work with something like 10 or even up to 30ms latency. Once you saturate the network bandwidth by sending lots of packets (backup, replication, ceph, ...) the latency will go up and maybe exceed this limit. Once it does the cluster will fail even if the NICs are working perfectly fine...just too slow...so the packets will time out and the cluster isn't in sync anymore.

Ok I will test this out, regarding cluster, fencing , and latency in order to understand better how proxmox working is there any flow, in point form ? like for example the below explanation (that was taken from https ://www .alteeve .com/w/AN!Cluster_Tutorial_2#Concept;_Fencing ):

The totem token moves around the cluster members. As each member gets the token, it sends sequenced messages to the CPG members.
The token is passed from one node to the next, in order and continuously during normal operation.
Suddenly, one node stops responding.
- A timeout starts (~238ms by default), and each time the timeout is hit, and error counter increments and a replacement token is created.
- The silent node responds before the failure counter reaches the limit.
  - The failure counter is reset to 0
  - The cluster operates normally again.
Again, one node stops responding.
- Again, the timeout begins. As each totem token times out, a new packet is sent and the error count increments.
- The error counts exceed the limit (4 errors is the default); Roughly one second has passed (238ms * 4 plus some overhead).
- The node is declared dead.
- The cluster checks which members it still has, and if that provides enough votes for quorum.
  - If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
  - If there are enough votes for quorum, the silent node is declared dead.
    - corosync calls fenced, telling it to fence the node.
    - The fenced daemon notifies DLM and locks are blocked.
    - Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
    - For each configured fence device:
      - The agent is called and fenced waits for the fence_agent to exit.
      - The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
    - If all (or the only) configured fence fails, fenced will start over.
    - fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
  - Once a fence_agent succeeds, fenced notifies DLM and lost locks are recovered.
    - GFS2 partitions recover using their journal.
    - Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
Normal cluster operation is restored, minus the lost node.

Thank you

aaron · Jan 11, 2024

As @Dunuin already mentioned, a network can congest if some service(s) use up all the available bandwidth. Packets of other services will need to wait longer until they can be transmitted -> latency for that service goes up.

pgro said:
To understand more further, can you please describe if this is the scenario with all-in-one Lan port ? or with dedicated Cluster network ?
Since that node will be out of order due to cable plugout then this behavour sounds normal to me for corosync to became readonly. But all VM changes from the other nodes will be synced once Node Cluster become available again, correct?

In the case of no HA guests, the guests will remain and keep running on the node that lost the cluster communication.

pgro said:
Fencing via ipmi / pdu or Guest agent? Ok so this is a case if somehow the Cluster network of node for any reason become faulty ?

By default, via the Linux kernels watchdog on the PVE host.

pgro said:
It's worth mentioning here if it's a good idea to create a bond interface for cluster network? or not ?

Not really, as Corosync can handle multiple networks by itself. Something most other networked services cannot. Having multiple networks configured as Corosync links gives it options to switch if one network shows problems. It will also switch a lot faster than a bond.
You can either configure them when creating the cluster, or even later: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

Keep in mind, that ideally you have one physical network dedicated for Corosync, to avoid congestion by other services. Multiple Corosync links will then hopefully keep the cluster working should the dedicated Corosync network have problems.

pgro said:
And if corosync doesn't share its network with anything from above except VM and Management traffic?

Then you will not run into the situation that the latency might go up high enough for Corosync to think the network is unusable/down. But it could still have other problems like a broken or bad cable -> multiple links.

The link you found is roughly valid for Proxmox VE as well. But the details differ.

The important part for fencing works closely together with the HA stack. There are two services on each node handling HA. The CRM (cluster resource manager) and LRM (local resource manager).

The LRM updates its status via the pmxcfs every 10 seconds. You can see that when you look at the HA status page in the GUI when a node was last seen online.
On top of the Linux kernel watchdog device, we have the "watchdog-mux" service. The LRM will renew that watchdog via the watchdog-mux service.

Once a node realizes that Corosync lost the connection to the quorate part of the cluster, the LRM will still renew the watchdog for one minute. If the cluster connection cannot be reestablished in that time, it will just not renew the watchdog anymore. Which means, once the watchdog runs out, the kernel will issue the reset of the machine.

The other nodes in the cluster that can still form a quorum, will also see that the last time the lost node updated its status has been too long ago. They will wait for a total of two minutes before the current CRM master in the cluster will organize where to recover the HA configured guests that were on the lost node.

pgro · Jan 11, 2024

aaron said:
As @Dunuin already mentioned, a network can congest if some service(s) use up all the available bandwidth. Packets of other services will need to wait longer until they can be transmitted -> latency for that service goes up.

In the case of no HA guests, the guests will remain and keep running on the node that lost the cluster communication.

By default, via the Linux kernels watchdog on the PVE host.

Not really, as Corosync can handle multiple networks by itself. Something most other networked services cannot. Having multiple networks configured as Corosync links gives it options to switch if one network shows problems. It will also switch a lot faster than a bond.
You can either configure them when creating the cluster, or even later: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

Keep in mind, that ideally you have one physical network dedicated for Corosync, to avoid congestion by other services. Multiple Corosync links will then hopefully keep the cluster working should the dedicated Corosync network have problems.

Then you will not run into the situation that the latency might go up high enough for Corosync to think the network is unusable/down. But it could still have other problems like a broken or bad cable -> multiple links.

The link you found is roughly valid for Proxmox VE as well. But the details differ.

The important part for fencing works closely together with the HA stack. There are two services on each node handling HA. The CRM (cluster resource manager) and LRM (local resource manager).

The LRM updates its status via the pmxcfs every 10 seconds. You can see that when you look at the HA status page in the GUI when a node was last seen online.
On top of the Linux kernel watchdog device, we have the "watchdog-mux" service. The LRM will renew that watchdog via the watchdog-mux service.

Once a node realizes that Corosync lost the connection to the quorate part of the cluster, the LRM will still renew the watchdog for one minute. If the cluster connection cannot be reestablished in that time, it will just not renew the watchdog anymore. Which means, once the watchdog runs out, the kernel will issue the reset of the machine.

The other nodes in the cluster that can still form a quorum, will also see that the last time the lost node updated its status has been too long ago. They will wait for a total of two minutes before the current CRM master in the cluster will organize where to recover the HA configured guests that were on the lost node.

Thank you all for your great explanation. So, to keep things simple. It's always preferable to use a separate dedicated nic for the cluster network in order to avoid unexpected behavior with Proxmox. It is better to use the ip directly on physical interface? or instead create abridge interface and then assigne the ip to bridge interface?

pgro · Jan 11, 2024

aaron said:
Corosync is used for quite a bit in a Proxmox VE cluster and needs low latency.
If the latency is going up, for example because another service is congesting the physical link, Corosync will consider that link as unusable. If you have configured multiple Corosync links, it will fall back to those and the cluster will stay stable.

If you only got one Corosync network or if all Corosync networks are congested, the result will be more or less dramatic.
In the case of no HA guests on the node, the Proxmox Cluster FS (pmxcfs aka pve-cluster.service) will become read only until the node will be part of the majority of the cluster again. The result is that writing to /etc/pve will not work. This will affect certain actions, like changing a VM config or starting a VM.

If the node does have HA guests currently running (LRM status is active) and the node cannot join the quorate part of the cluster within 1 minute (as of Jan 2024), it will fence itself. A fence is similar as if you would push the reset button on the server.
This happens to make sure that the HA guests are definitely powered off before the (hopefully) remaining rest of the cluster will recover these guests. If more or all nodes in the cluster are affected, then all nodes will fence. From the outside, it looks like the whole cluster just rebooted.

So anything that could congest the network is a problem, why the recommendation for a dedicated physical network for Corosync exists. Adding multiple Corosync links is a good idea to cover the case where the dedicated Corosync network might see some problem, e.g. broken cable.

Well-known examples that can cause network congestion is if Corosync shares its network(s) with Ceph, network shares, backup targets, the migration network, ...

Putting the Corosync network in a VLAN is not a solution to this, unless you can guarantee a bandwidth reservation. But that is another thing that could be misconfigured or not working as expected, which is why the recommended best practice is top have at least one physical network for Corosync alone.

aaron said:
IIUC yep.

In case that my Cluster working with an iscsi shared storage device (on other nic) , then If I am loosing the Server what is going to happen to VM data? won't this remain in lock mode? will be able to replicate or HA to an other node success?

aaron said:
The node will fence itself if there are HA guests currently running on it.

aaron · Jan 12, 2024

pgro said:
Thank you all for your great explanation. So, to keep things simple. It's always preferable to use a separate dedicated nic for the cluster network in order to avoid unexpected behavior with Proxmox. It is better to use the ip directly on physical interface? or instead create abridge interface and then assigne the ip to bridge interface?

A bridge is only needed if you plan to give guests access to that network. Because a bridge is basically an internal switch. Otherwise you can always configure the IP for the host directly on the NIC or bond. Just make sure that the "autostart" checkbox is enabled.

pgro said:
In case that my Cluster working with an iscsi shared storage device (on other nic) , then If I am loosing the Server what is going to happen to VM data? won't this remain in lock mode? will be able to replicate or HA to an other node success?

That is why the node with HA guests will fence itself. Proxmox VE makes sure that there is only one instance of the guest accessing the disk image. Once the node is fenced, the guests that used the LV on which the disk image is stored, are not running anymore. So, even if the guest is recovered and started on another node, there will only be one instance accessing the disk image at the same time.

Search

Search

Proxmox Corosync /cluster dedicated network, why?

pgro

Member

aaron

Proxmox Staff Member

pgro

Member

Dunuin

Distinguished Member

pgro

Member

aaron

Proxmox Staff Member

pgro

Member

pgro

Member

aaron

Proxmox Staff Member

We value your privacy