Corosync network interaction

jdurand

New Member
Jun 3, 2020
8
1
3
44
Hello,

As many users, i have problem with node instability when another node reboot : https://forum.proxmox.com/threads/all-servers-from-cluster-reboot-after-one-server-reboots.37278/

I'm using promox 5.3.5 and corosync with auth and multicast. (I have read multicast note with latency, snooping and querier problem)
But i cant understand what network trafic could interact bad with corosync.
Message #10 posted by Alwin :
Based on my assumption, if that is the case, then what you are seeing, is possible caused, by the recovery of Ceph (+ other traffic), while one node reboots and this interferes with corosync's traffic.

What traffic can it interfere with corosync ?
 
Last edited:
What traffic can it interfere with corosync ?

Traffic using some bandwidth constantly, this can be storage traffic, especially the Ceph private network one as it acts as backbone for mirroring objects and rebalancing stuff around. But also VM migrations may have some impact, can be rate-limited though. But those patterns are often not directly a problem due to using most or even all bandwidth, corosync actually doesn't require that much of bandwidth. They can be a problem due to increasing latency as the network stack is constantly chattering so new packets may need a few milliseconds more to get from being sent by an application (e.g., corosync) until it really hits the network and its receivers. This is an issue because corosync uses a consensus forming (quorum) algorithm which depends on packets being transmitted in under a certain time - ideally < 2-5 ms maximal about 8-9 ms latency.

So, in general the most important thing is that latency spikes are avoided at all cost. A rather straight forward way to guarantee that is having its own physical network for the main corosync link.

I'm using promox 5.3.5

Note here that I highly recommend upgrading to Proxmox VE 5.4 and then as soon as possible to Proxmox VE 6.x as the 5.x release will go end of life at the end of July.
 
Thank you very much t.lamprecht.
Do you know bad interaction with multicast, when two cluster are in the same physical network ? (Different auth in corosync, but same multicast ip)
 
but same multicast ip)

In Proxmox VE 5.4 the multicast IP is derived from the cluster name, and we do not support multiple clusters with the same name in the same network. So this should not happen.

We're running here multiple clusters sharing the same network, this is generally not a problem. Neither with multicast and corosync from PVE 5.x nor with the unicast kronosnet which is used in PVE 6.x
 
In Proxmox VE 5.4 the multicast IP is derived from the cluster name

I have the following configuration :
Cluster 1 :
- node1 name : ABC05PVE01 ; multicast ip: 239.192.104.3, 239.192.104.2
- node2 name : CDE03PVE01 ; multicast ip : 239.192.104.3, 239.192.104.2
Cluster 2 :
-node 1 name : ABC01PVE02 ; multicast ip : 239.192.104.4, 239.192.104.3
-node 2 name : CDE05PVE2 ; multicast ip : 239.192.104.4, 239.192.104.3

All nodes have a different name and ip are not defined in corosync.cfg, but the two cluster share one identical ip : 239.192.104.3

Do you think it's a bug in automatic ip generation ?
 
All nodes have a different name

node name is not relevant, as said, the cluster name is.

What's: grep cluster_name /etc/pve/corosync.conf of both clusters?

How did you get the multicast IP, with? corosync-cmapctl -g totem.interface.0.mcastaddr

Also note that shared multicast IPs do not necessarily have to be a problem.

And you probably have node stability issues one node reboots as you only have two nodes per cluster. So, if one goes down the other isn't quorate anymore..
 
What's: grep cluster_name /etc/pve/corosync.conf of both clusters?

cluster name in corosync.conf are "cluster1" and "cluster2".

How did you get the multicast IP, with?

With "ip maddr".
corosync-cmapctl gives the sames IP.

And you probably have node stability issues one node reboots as you only have two nodes per cluster. So, if one goes down the other isn't quorate anymore..

Is this a problem with the following configuration in corosync.conf ?
quorum {
expected_votes: 1
provider: corosync_votequorum
two_node: 1
}
 
Also, what is your stability issue even? What happens?

Sometimes when a node reboot the other node in same cluster is unstable : reboot or vm restart
And i suspect a bad interaction between cluster1 and cluster2. Not sure, but it seems that a node reboot on cluster1 could affect a node in cluster2 (and vice versa).

two_node isn't supported by us

Do you say that is not possible to make a cluster of two node without separated quorum host and two_node option in corosync.conf is not supported by proxmox ? (I repeat what you say to be sure of what about we are speaking)

https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster#System_requirements
 
And i suspect a bad interaction between cluster1 and cluster2. Not sure, but it seems that a node reboot on cluster1 could affect a node in cluster2 (and vice versa).

I'd first check the syslog/journal during that time to get some idea what's really happening...

Do you say that is not possible to make a cluster of two node without separated quorum host and two_node option in corosync.conf is not supported by proxmox ? (I repeat what you say to be sure of what about we are speaking)

It is possible and often done, but during the time one node is offline the other may not change cluster related stuff.
VMs and CTs should keep working, at least as long as they're not HA managed, but you cannot create a new VM/CT or change some VM/CT, storage or user setting during that.
 
I'd first check the syslog/journal during that time to get some idea what's really happening...

I have only one example and it's not everytime the same behaviour.
In log i can see node changing state : changed from 'online' => 'unknown'

## Node2 rebooted
Mar 31 18:38:41 10.0.0.2 pve-ha-lrm[18589]: stopping service vm:104
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: received signal TERM
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: server closing
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: clear firewall rules
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: server stopped
Mar 31 18:39:58 10.0.0.2 pve-ha-lrm[26267]: received signal TERM
Mar 31 18:39:58 10.0.0.2 pve-ha-lrm[26267]: reboot LRM, stop and freeze all services
Mar 31 18:39:59 10.0.0.2 pvefw-logger[1085]: received terminate request (signal)
Mar 31 18:39:59 10.0.0.2 pvefw-logger[1085]: stopping pvefw logger
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: received signal TERM
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: server closing
Mar 31 18:39:59 10.0.0.2 pveproxy[1136]: worker exit
Mar 31 18:39:59 10.0.0.2 pveproxy[1138]: worker exit
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1138 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1136 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1137 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: server stopped
Mar 31 18:40:00 10.0.0.2 pveproxy[30357]: worker exit
Mar 31 18:40:02 10.0.0.2 pve-ha-lrm[26267]: watchdog closed (disabled)
Mar 31 18:40:02 10.0.0.2 pve-ha-lrm[26267]: server stopped
Mar 31 18:40:03 10.0.0.2 pve-ha-crm[26226]: received signal TERM
Mar 31 18:40:03 10.0.0.2 pve-ha-crm[26226]: server received shutdown request
Mar 31 18:40:04 10.0.0.2 pve-ha-crm[26226]: server stopped


## Node1 impacted
Mar 31 18:38:38 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'started' to 'request_stop'
Mar 31 18:39:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'request_stop' to 'stopped'
Mar 31 18:39:18 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'started' to 'request_stop'
Mar 31 18:39:38 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'request_stop' to 'stopped'
Mar 31 18:39:47 10.0.0.1 pveproxy[15531]: worker exit
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: worker 15531 finished
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: starting 1 worker(s)
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: worker 29478 started
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: node 'NODE01_PVE01': state changed from 'online' => 'unknown'
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'stopped' to 'freeze'
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'stopped' to 'freeze'
Mar 31 18:40:41 10.0.0.1 pveproxy[13983]: proxy detected vanished client connection
 
OK, you're trying to use HA with two nodes, that cannot work and is really unsupported and not possible.

So to clarify the answer to your previous question, are two node clusters supported:
* with HA: no, you need a QDevice or more nodes. https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_requirements
* without HA: yes, with some limitations if one node is offline - can be work'd around manually though.
 
  • Like
Reactions: jdurand

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!