corosync: OOM killed after some days with 100%CPU, cluster nodes isolated ("Quorate")

Sep 13, 2022
93
20
13
Hi,

I have an almost idle test cluster with 12 nodes (pve-7.4). With an uptime of a year or so without issues, last week I noticed in web GUI, that only the local node information showed up. I logged in and saw on three randomly picked nodes that corosync had 100%CPU, apparently busy-loop over some network error; the log showed many "Retransmit List:" entries. Today I planned to take a look. The load is gone, as corosyncs have been out-of-memory-killed and nodes are "Flags: Quorate".
Code:
ansible all -m shell -a "systemctl status corosync.service" | grep Active

     Active: inactive (dead)
     Active: active (running) since Fri 2023-09-22 07:08:15 CEST; 1 years 3 months ago
     Active: failed (Result: oom-kill) since Thu 2025-01-16 18:02:25 CET; 3 days ago
     Active: active (running) since Mon 2023-08-14 19:45:28 CEST; 1 years 5 months ago
     Active: active (running) since Mon 2023-08-14 19:45:25 CEST; 1 years 5 months ago
     Active: failed (Result: oom-kill) since Thu 2025-01-16 16:04:09 CET; 3 days ago
     Active: failed (Result: oom-kill) since Thu 2025-01-16 17:56:14 CET; 3 days ago
     Active: active (running) since Thu 2025-01-16 16:15:10 CET; 3 days ago
     Active: failed (Result: oom-kill) since Thu 2025-01-16 17:47:49 CET; 3 days ago
     Active: active (running) since Fri 2023-09-22 07:08:23 CEST; 1 years 3 months ago
     Active: active (running) since Tue 2023-11-21 19:44:39 CET; 1 years 1 months ago
     Active: active (running) since Fri 2023-09-22 07:08:23 CEST; 1 years 3 months ago
     Active: active (running) since Fri 2023-09-22 07:08:08 CEST; 1 years 3 months ago
     Active: active (running) since Fri 2023-09-22 07:08:08 CEST; 1 years 3 months ago
I have no clue what now would be the best steps to continue.
I think I will start upgrading to latest PVE version and consider building corosync with debug symbols to be able to debug it, but I'm walking in the dark...
Any hints or recommendations?
 
well, the corosync logs might shed some light.. most of our packages come with matching -dbgsym counterparts btw, but I doubt debugging with gdb is the way to go here before taking an in-depth look at the logs..
 
Hi, thanks for your quick reply!
I see I even already installed corosync-dbgsym in 2023.
I think according to log issues start with corosync[1321]: [KNET ] link: host: 101 link: 0 is down for each node, but I think it could possibly because of the high load (because interactively testing with ping and alike, connections seem to be good). Maybe some temporary issue causes some error and as follow-up-error corosync gets 100% CPU. I think without fixing the 100% CPU issue possibly latency could be so big that the real cause could be invisible.

it is just a test cluster, so I leave it running in the hope it happens again and that I notice it, to be able to provide further details.
I was able to build corosync and libqb from sources, install and run it, let's see if it happens again.

Best regards

NB: sources suggest that root cause could be network issues, but flood ping is stable and SSH stays fast even then
Code:
root@l-101:~# ping -s 1500 -f l-102
PING l-102 (10.1.1.102) 1500(1528) bytes of data.
.^C
--- l-102 ping statistics ---
4423004 packets transmitted, 4423003 received, 2.26091e-05% packet loss, time 810948ms
rtt min/avg/max/mdev = 0.117/0.171/3.755/0.019 ms, pipe 2, ipg/ewma 0.183/0.181 ms
(IP and hostname symbolic)
 
Last edited:
it is possible that you are running into some old, already fixed bug.. note that corosync is very latency sensitive, links being marked down can also mean that they struggle to keep up or lose single packets. could you maybe describe the network setup/corosync configuration/cluster size?
 
  • Like
Reactions: sdettmer
it is possible that you are running into some old, already fixed bug..
Ahh that would be perfect, just installing updates and being done :)

But I think I updated everything:

Code:
pve-manager/8.3.3/f157a38b211595d6 (running kernel: 6.8.12-7-pve)

root@l-103:/home/sdettmer/work/corosync/corosync-pve# dpkg -l "coro*" libqb* *knet*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                  Version      Architecture Description
+++-=====================-============-============-=============================================================
ii  corosync              3.1.7-pve3   amd64        cluster engine daemon and utilities
ii  corosync-dbgsym       3.1.7-pve3   amd64        debug symbols for corosync
ii  libknet1:amd64        1.28-pve1    amd64        kronosnet core switching implementation
ii  libqb-dev:amd64       2.0.3-1      amd64        high performance client server features library (devel files)
ii  libqb100:amd64        2.0.3-1      amd64        high performance client server features library
ii  libqb100-dbgsym:amd64 2.0.3-1      amd64        debug symbols for libqb100

NB: I build corosync and libqb from source but packages have same versions, so I hope its right.

note that corosync is very latency sensitive,
Yes, it is quite.... interesting. For example, they dynamically create new timers all the time and even cyclic calls are non-trivial to see by this. I did some interactive debugging (latency in range of minutes) and it did behave quite well!

Code:
Jan 22 13:00:48 l-103 corosync[41243]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 13:00:48 l-103 corosync[41243]:   [KNET  ] host: host: 1 has no active links
Jan 22 13:00:55 l-103 corosync[41243]:   [TOTEM ] Process pause detected for 6857 ms, flushing membership messages.
Jan 22 13:00:55 l-103 corosync[41243]:   [MAIN  ] Corosync main process was not scheduled (@1737547255253) for 16256.2598 ms (threshold is 7600.0000 ms). Consider token timeout increase.
Jan 22 13:00:58 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 22 13:00:58 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 112 joined
Jan 22 13:03:11 l-103 corosync[41243]:   [MAIN  ] Corosync main process was not scheduled (@1737547391528) for 136274.9062 ms (threshold is 7600.0000 ms). Consider token timeout increase.
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] link: host: 112 link: 0 is down
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] link: host: 1 link: 0 is down
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 1 has no active links
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 112 (passive) best link: 0 (pri: 1)
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 112 has no active links
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 112 (passive) best link: 0 (pri: 1)
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 112 has no active links
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 13:03:11 l-103 corosync[41243]:   [KNET  ] host: host: 1 has no active links
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 107 joined
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] host: host: 107 (passive) best link: 0 (pri: 1)
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 104 joined
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] host: host: 104 (passive) best link: 0 (pri: 1)
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 111 joined
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] host: host: 111 (passive) best link: 0 (pri: 1)
Jan 22 13:03:12 l-103 corosync[41243]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 102 joined
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] host: host: 102 (passive) best link: 0 (pri: 1)
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 105 joined
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 112 joined
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] host: host: 105 (passive) best link: 0 (pri: 1)
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] host: host: 112 (passive) best link: 0 (pri: 1)
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] link: Resetting MTU for link 0 because host 106 joined
Jan 22 13:03:13 l-103 corosync[41243]:   [KNET  ] host: host: 106 (passive) best link: 0 (pri: 1)
It's difficult to even check, because every node could have a different understanding of the quorum...

Also some things are surprising:
Code:
Jan 22 13:03:13 l-103 corosync[41243]:   [TOTEM ] A new membership (67.10f35) was formed. Members
Jan 22 13:03:13 l-103 corosync[41243]:   [QUORUM] Members[1]: 103
Jan 22 13:03:13 l-103 corosync[41243]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 22 13:03:13 l-103 corosync[41243]:   [QUORUM] Sync members[1]: 103
Jan 22 13:03:13 l-103 corosync[41243]:   [TOTEM ] A new membership (67.10f39) was formed. Members
Jan 22 13:03:13 l-103 corosync[41243]:   [QUORUM] Members[1]: 103
Jan 22 13:03:13 l-103 corosync[41243]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 22 13:03:13 l-103 corosync[41243]:   [QUORUM] Sync members[1]: 103
in total this "loops" many times, for ~1000 lines, in same second.


links being marked down can also mean that they struggle to keep up or lose single packets. could you maybe describe the network setup/corosync configuration/cluster size?
I think this "link down" does not refer to ethernet links, but to KNET UDP links ("connections" would be slighly better).
When I stop corosync service on host 110, I get:

Code:
Jan 22 15:28:23 l-103 corosync[494507]:   [KNET  ] udp: Received ICMP error from 10.1.1.110: Connection refused 10.1.1.110
Jan 22 15:28:23 l-103 corosync[494507]:   [KNET  ] udp: Setting down host 110 link 0
Jan 22 15:28:23 l-103 corosync[494507]:   [KNET  ] link: host: 110 link: 0 is down
Jan 22 15:28:23 l-103 corosync[494507]:   [KNET  ] host: host: 110 (passive) best link: 0 (pri: 1)
Jan 22 15:28:23 l-103 corosync[494507]:   [KNET  ] host: host: 110 has no active links
and when starting again

[KNET ] pmtud: Starting PMTUD for host: 110 link: 0
[KNET ] udp: detected kernel MTU: 1500



I have a top -d 0.2 and ping running beside in different terminal and even if a "load" the network a bit with a "flood ping", things work well:

Code:
root@l-103:/home/sdettmer/work/corosync/corosync-pve# ping -f l-110
PING l-110.x.net (10.1.1.110) 56(84) bytes of data.
.^C
--- l-110.x.net ping statistics ---
285355 packets transmitted, 285354 received, 0.000350441% packet loss, time 33961ms
rtt min/avg/max/mdev = 0.081/0.112/0.733/0.012 ms, ipg/ewma 0.119/0.113 ms
(this ran while I stopped corosync on 110, waited maybe 20 secs, and restarted it).

Maybe I leave the debugger running, because then it works lol :) just kidding
 
it is possible that you are running into some old, already fixed bug.. note that corosync is very latency sensitive, links being marked down can also mean that they struggle to keep up or lose single packets. could you maybe describe the network setup/corosync configuration/cluster size?
Interestingly this cluster has 12 nodes, but in
I wrote that with more than 10 it is unstable, and then found it working with 12.

I think then I did not notice 100%CPU, but seems I have a related issue. I plotted two hours memory usage data of corosync, but it seems to be stable (i.e. not explaining the OOM kill).

Network essentially are little devices (i5-10400H CPU @ 2.60GHz, 8 cores, 32 GB RAM, 1 TB SSD, I219-LM), mostly idle, directly on a switch (some cisco enterprise grade I think), 1Gbe (single NIC for all traffic). Some nodes run 1-2 idle Devuan linux VMs (cpu <10%).

Any suggestions what to test / try / do are welcome.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!