corosync show link flapping (down/up) about every 3-4 minutes, but switch shows no problem

skraw

Well-Known Member
Aug 13, 2019
90
1
48
59
Hello all,

I recently experience a problem with corosync showing link flapping, but it seems to me that these are really fake. Neither the corresponding switch shows a link problem, nor the kernels of the boxes (3-box cluster). I use a 10G fiber main links and 1G copper backup links. Flapping is shown on the copper links.
Is this kind of a timing problem with corosync?
example log:
Apr 01 17:13:13 pm-248 corosync[2090]: [KNET ] rx: host: 2 link: 1 is up
Apr 01 17:13:13 pm-248 corosync[2090]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
Apr 01 17:13:13 pm-248 corosync[2090]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 01 17:13:13 pm-248 corosync[2090]: [KNET ] pmtud: Global data MTU changed to: 1397
Apr 01 17:15:26 pm-248 corosync[2090]: [KNET ] link: host: 2 link: 1 is down
Apr 01 17:15:26 pm-248 corosync[2090]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 01 17:15:28 pm-248 corosync[2090]: [KNET ] rx: host: 2 link: 1 is up
Apr 01 17:15:28 pm-248 corosync[2090]: [KNET ] link: Resetting MTU for link 1 because host 2 joined
Apr 01 17:15:28 pm-248 corosync[2090]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr 01 17:15:28 pm-248 corosync[2090]: [KNET ] pmtud: Global data MTU changed to: 1397
 
Any chance that you have an IP-address conflict? Triple-check this detail...
 
Is the 1gb link dedicated to corosync ? Link saturation could give flap like this.
No, the fiber link is dedicated, the copper is also used for other purposes. I checked that in detail and had to find out that the latest kernel networking is not really as good as one might think - after all those years. If there is nfs traffic (ok, heavy nfs traffic) going on on the link ICMPs are lost (below 1%). I don't know how corosync checks such a link, my suspicion is that they check the interface statistics for drops and misses. Anybody knows facts?
 
No, the fiber link is dedicated, the copper is also used for other purposes. I checked that in detail and had to find out that the latest kernel networking is not really as good as one might think - after all those years. If there is nfs traffic (ok, heavy nfs traffic) going on on the link ICMPs are lost (below 1%). I don't know how corosync checks such a link, my suspicion is that they check the interface statistics for drops and misses. Anybody knows facts?
basically, if you have network saturation, the latency increase, and corosync remove the node from the cluster. (down/up don't mean that the physical link is flapping, it's simply that the node where you are checking the log, don't have a response enough fast from the remote node, and display it as down then up).
As it seem to be in a small interval, it could be small network spike saturation. (you should use grafana or another tool to monitor traffic each second to be sure).
 
basically, if you have network saturation, the latency increase, and corosync remove the node from the cluster. (down/up don't mean that the physical link is flapping, it's simply that the node where you are checking the log, don't have a response enough fast from the remote node, and display it as down then up).
As it seem to be in a small interval, it could be small network spike saturation. (you should use grafana or another tool to monitor traffic each second to be sure).
That is not the complete truth. Look at this:

--- 192.168.192.250 Ping-Statistiken ---
14000 Pakete übertragen, 13862 empfangen, 0.985714% packet loss, time 14133836ms
rtt min/avg/max/mdev = 0.103/1.065/3.615/1.127 ms

This is quite a long running ping during nfs load. If there was really heavy rising latency the max should be a lot higher than 3.6 s. I feel that the kernel does instead completely fill the interface queue with one users data (nfs) and drops other users packets if there is no spare left in the queue. I suspect there would not be much throughput deficit if he would only fill half the buffers with one users data and keep the rest for others so it does not need to drop a user with 1 small packet every second.
And if you really think about the situation. it could well be that you are just transferring some nfs-based drive from one server to another within proxmox. would you expect corosync to loose connection to the other nodes in such a case, only because the ongoing action saturates the network?
Because this would mean you can only safely use proxmox if you have a network situation with higher bandwidth than your local disks...
 
I wanted to try the problem situation with a bbr congestion variant. But I found that the kernel delivered with proxmox does not supply this congestion protocol. Why is this?
 
Did you try 7.0 kernel?

sudo modprobe tcp_bbr
sysctl net.ipv4.tcp_congestion_control

On 7.0 i get:
net.ipv4.tcp_available_congestion_control = reno cubic bbr
 
Last edited:
This is quite a long running ping during nfs load. If there was really heavy rising latency the max should be a lot higher than 3.6 s.
but you have packet loss, this is even worse. (you can look at corosync stats too
And if you really think about the situation. it could well be that you are just transferring some nfs-based drive from one server to another within proxmox. would you expect corosync to loose connection to the other nodes in such a case, only because the ongoing action saturates the network?
yes, definitively
Because this would mean you can only safely use proxmox if you have a network situation with higher bandwidth than your local disks...
the recommandations is to have dedicated links for corosync
https://pve.proxmox.com/pve-docs/chapter-pvecm.html



1775209876493.png


but note that last corosync version sur dscp protocol
https://github.com/corosync/corosync/commit/5678836caf7ff21bf0abe81fe61b092f89528665

so it could be possible to do traffic priorization on your switch if it support the protocol
 
but you have packet loss, this is even worse. (you can look at corosync stats too

yes, definitively

the recommandations is to have dedicated links for corosync
https://pve.proxmox.com/pve-docs/chapter-pvecm.html



View attachment 96952


but note that last corosync version sur dscp protocol
https://github.com/corosync/corosync/commit/5678836caf7ff21bf0abe81fe61b092f89528665

so it could be possible to do traffic priorization on your switch if it support the protocol
Just to make that clear again: the networt link is not saturated. Monitoring shows an average of around 400 MBit/s on a GBit interface. Which means it is quite far away from a bandwith problem. So the real question here is: why are packets lost at all? And still: how does corosync really "find out" about the problem? Shall I imagine that some of its UDP packets are lost, too?
corosync docs talk about latency around 10ms getting fishy, but I am nowhere near that either. I really think this is more a kernel/config problem. I can see no reason for packet drops here.
The switch is the last in the list that is to bother with that. it is not a matter of priority at all ...
 
Monitoring shows an average of around 400 MBit/s on a GBit interface.
do you have a granular monitoring (prometheus,...) which can check bandwidth every second ? don't trust average.

do you have also checked your switch port buffer stats ?


if it was a kernel bug, you'll have too on your 10gbit nic.
I'm running 100 nodes in production without any problem.

I have already seen exactly this behaviour with differents customers, and it was always network spike saturation.