Corosync network interaction

jdurand · Jun 5, 2020

Hello,

As many users, i have problem with node instability when another node reboot : https://forum.proxmox.com/threads/all-servers-from-cluster-reboot-after-one-server-reboots.37278/

I'm using promox 5.3.5 and corosync with auth and multicast. (I have read multicast note with latency, snooping and querier problem)
But i cant understand what network trafic could interact bad with corosync.
Message #10 posted by Alwin :

Based on my assumption, if that is the case, then what you are seeing, is possible caused, by the recovery of Ceph (+ other traffic), while one node reboots and this interferes with corosync's traffic.

What traffic can it interfere with corosync ?

t.lamprecht · Jun 5, 2020

jdurand said:
What traffic can it interfere with corosync ?

Traffic using some bandwidth constantly, this can be storage traffic, especially the Ceph private network one as it acts as backbone for mirroring objects and rebalancing stuff around. But also VM migrations may have some impact, can be rate-limited though. But those patterns are often not directly a problem due to using most or even all bandwidth, corosync actually doesn't require that much of bandwidth. They can be a problem due to increasing latency as the network stack is constantly chattering so new packets may need a few milliseconds more to get from being sent by an application (e.g., corosync) until it really hits the network and its receivers. This is an issue because corosync uses a consensus forming (quorum) algorithm which depends on packets being transmitted in under a certain time - ideally < 2-5 ms maximal about 8-9 ms latency.

So, in general the most important thing is that latency spikes are avoided at all cost. A rather straight forward way to guarantee that is having its own physical network for the main corosync link.

jdurand said:
I'm using promox 5.3.5

Note here that I highly recommend upgrading to Proxmox VE 5.4 and then as soon as possible to Proxmox VE 6.x as the 5.x release will go end of life at the end of July.

jdurand · Jun 5, 2020

Thank you very much t.lamprecht.
Do you know bad interaction with multicast, when two cluster are in the same physical network ? (Different auth in corosync, but same multicast ip)

t.lamprecht · Jun 5, 2020

jdurand said:
but same multicast ip)

In Proxmox VE 5.4 the multicast IP is derived from the cluster name, and we do not support multiple clusters with the same name in the same network. So this should not happen.

We're running here multiple clusters sharing the same network, this is generally not a problem. Neither with multicast and corosync from PVE 5.x nor with the unicast kronosnet which is used in PVE 6.x

jdurand · Jun 5, 2020

t.lamprecht said:
In Proxmox VE 5.4 the multicast IP is derived from the cluster name

I have the following configuration :
Cluster 1 :
- node1 name : ABC05PVE01 ; multicast ip: 239.192.104.3, 239.192.104.2
- node2 name : CDE03PVE01 ; multicast ip : 239.192.104.3, 239.192.104.2
Cluster 2 :
-node 1 name : ABC01PVE02 ; multicast ip : 239.192.104.4, 239.192.104.3
-node 2 name : CDE05PVE2 ; multicast ip : 239.192.104.4, 239.192.104.3

All nodes have a different name and ip are not defined in corosync.cfg, but the two cluster share one identical ip : 239.192.104.3

Do you think it's a bug in automatic ip generation ?

t.lamprecht · Jun 5, 2020

jdurand said:
All nodes have a different name

node name is not relevant, as said, the cluster name is.

What's: grep cluster_name /etc/pve/corosync.conf of both clusters?

How did you get the multicast IP, with? corosync-cmapctl -g totem.interface.0.mcastaddr

Also note that shared multicast IPs do not necessarily have to be a problem.

And you probably have node stability issues one node reboots as you only have two nodes per cluster. So, if one goes down the other isn't quorate anymore..

jdurand · Jun 5, 2020

t.lamprecht said:
What's: grep cluster_name /etc/pve/corosync.conf of both clusters?

cluster name in corosync.conf are "cluster1" and "cluster2".

t.lamprecht said:
How did you get the multicast IP, with?

With "ip maddr".
corosync-cmapctl gives the sames IP.

t.lamprecht said:
And you probably have node stability issues one node reboots as you only have two nodes per cluster. So, if one goes down the other isn't quorate anymore..

Is this a problem with the following configuration in corosync.conf ?

quorum {
expected_votes: 1
provider: corosync_votequorum
two_node: 1
}

t.lamprecht · Jun 5, 2020

jdurand said:
Is this a problem with the following configuration in corosync.conf ?

two_node isn't supported by us, some users say it works for them, though.

Also, what is your stability issue even? What happens?

As said quorum loss can be expected, if you do not want that use a third node (or cluster all 4 in one single cluster) or alternatively use a QDevice: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_corosync_external_vote_support

jdurand · Jun 5, 2020

t.lamprecht said:
Also, what is your stability issue even? What happens?

Sometimes when a node reboot the other node in same cluster is unstable : reboot or vm restart
And i suspect a bad interaction between cluster1 and cluster2. Not sure, but it seems that a node reboot on cluster1 could affect a node in cluster2 (and vice versa).

t.lamprecht said:
two_node isn't supported by us

Do you say that is not possible to make a cluster of two node without separated quorum host and two_node option in corosync.conf is not supported by proxmox ? (I repeat what you say to be sure of what about we are speaking)

https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster#System_requirements

t.lamprecht · Jun 5, 2020

jdurand said:
And i suspect a bad interaction between cluster1 and cluster2. Not sure, but it seems that a node reboot on cluster1 could affect a node in cluster2 (and vice versa).

I'd first check the syslog/journal during that time to get some idea what's really happening...

jdurand said:
Do you say that is not possible to make a cluster of two node without separated quorum host and two_node option in corosync.conf is not supported by proxmox ? (I repeat what you say to be sure of what about we are speaking)

It is possible and often done, but during the time one node is offline the other may not change cluster related stuff.
VMs and CTs should keep working, at least as long as they're not HA managed, but you cannot create a new VM/CT or change some VM/CT, storage or user setting during that.

jdurand · Jun 5, 2020

t.lamprecht said:
I'd first check the syslog/journal during that time to get some idea what's really happening...

I have only one example and it's not everytime the same behaviour.
In log i can see node changing state : changed from 'online' => 'unknown'

## Node2 rebooted
Mar 31 18:38:41 10.0.0.2 pve-ha-lrm[18589]: stopping service vm:104
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: received signal TERM
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: server closing
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: clear firewall rules
Mar 31 18:39:58 10.0.0.2 pve-firewall[26174]: server stopped
Mar 31 18:39:58 10.0.0.2 pve-ha-lrm[26267]: received signal TERM
Mar 31 18:39:58 10.0.0.2 pve-ha-lrm[26267]: reboot LRM, stop and freeze all services
Mar 31 18:39:59 10.0.0.2 pvefw-logger[1085]: received terminate request (signal)
Mar 31 18:39:59 10.0.0.2 pvefw-logger[1085]: stopping pvefw logger
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: received signal TERM
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: server closing
Mar 31 18:39:59 10.0.0.2 pveproxy[1136]: worker exit
Mar 31 18:39:59 10.0.0.2 pveproxy[1138]: worker exit
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1138 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1136 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: worker 1137 finished
Mar 31 18:39:59 10.0.0.2 pveproxy[26244]: server stopped
Mar 31 18:40:00 10.0.0.2 pveproxy[30357]: worker exit
Mar 31 18:40:02 10.0.0.2 pve-ha-lrm[26267]: watchdog closed (disabled)
Mar 31 18:40:02 10.0.0.2 pve-ha-lrm[26267]: server stopped
Mar 31 18:40:03 10.0.0.2 pve-ha-crm[26226]: received signal TERM
Mar 31 18:40:03 10.0.0.2 pve-ha-crm[26226]: server received shutdown request
Mar 31 18:40:04 10.0.0.2 pve-ha-crm[26226]: server stopped

## Node1 impacted
Mar 31 18:38:38 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'started' to 'request_stop'
Mar 31 18:39:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'request_stop' to 'stopped'
Mar 31 18:39:18 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'started' to 'request_stop'
Mar 31 18:39:38 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'request_stop' to 'stopped'
Mar 31 18:39:47 10.0.0.1 pveproxy[15531]: worker exit
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: worker 15531 finished
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: starting 1 worker(s)
Mar 31 18:39:47 10.0.0.1 pveproxy[5365]: worker 29478 started
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: node 'NODE01_PVE01': state changed from 'online' => 'unknown'
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:1001': state changed from 'stopped' to 'freeze'
Mar 31 18:40:08 10.0.0.1 pve-ha-crm[5347]: service 'vm:104': state changed from 'stopped' to 'freeze'
Mar 31 18:40:41 10.0.0.1 pveproxy[13983]: proxy detected vanished client connection

t.lamprecht · Jun 5, 2020

OK, you're trying to use HA with two nodes, that cannot work and is really unsupported and not possible.

So to clarify the answer to your previous question, are two node clusters supported:
* with HA: no, you need a QDevice or more nodes. https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_requirements
* without HA: yes, with some limitations if one node is offline - can be work'd around manually though.

jdurand · Jun 5, 2020

Thank you very much t.lamprecht with this explanation i will inform the administrator of our proxmox server.
Have a very good day.

Search

Search

Corosync network interaction

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

t.lamprecht

Proxmox Staff Member

jdurand

New Member

We value your privacy