Corosync very high CPU and Memory usage on v 6.4-13

Bidi · Jan 16, 2022

Hello Guys,

In the last mo i started to have some problems with the servers and didnt know why, like loosing cluster and i had to restart pve-cluster and corosync, and this was happening weekly and i just ignore to check why, just made it work and nothink else.

Today i started to check this problem because i have it now again, on all servers i sow massive Memory usage, ones i have 64GB was complet full.

I sow Corosync process (htop) having 300-400% CPU and 30% Memory :| so i restarted Corosync and from 62GB Ram usage droped down to 30 after i restarted Corosync.

I took all the servers one by one restarting Corosync and on all servers was dropping 30GB Ram after just restarted Corosync ( systemctl restart corosync )

Dose anyone else have this issue ?

I never had this problem before i upgrade to 6.4-13

I got 8 servers in my cluster all same issue

fabian · Jan 17, 2022

anything out of the ordinary visible in the corosync logs?

Bidi · Jan 17, 2022

fabian said:
anything out of the ordinary visible in the corosync logs?

Hello, where can i fiind file ?

fabian · Jan 17, 2022

by default corosync logs to syslog/journal (journalctl -u corosync --since ... --until ...)

Bidi · Jan 17, 2022

I have plenty of this:

[TOTEM ] Retransmit List:................

And this was from last night when he made the same loosing cluster

Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6884 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6933 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6933 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6934 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6934 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6934 ms, flushing membership messages.
Jan 17 00:33:09 d3 corosync[20686]: [TOTEM ] Process pause detected for 6983 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [QUORUM] Sync members[6]: 1 2 3 4 6 9
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] A new membership (1.1f4a5) was formed. Members
Jan 17 00:33:18 d3 corosync[20686]: [QUORUM] Members[6]: 1 2 3 4 6 9
Jan 17 00:33:18 d3 corosync[20686]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6634 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6635 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6635 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6635 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6635 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [QUORUM] Sync members[6]: 1 2 3 4 6 9
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] A new membership (1.1f4a9) was formed. Members
Jan 17 00:33:18 d3 corosync[20686]: [QUORUM] Members[6]: 1 2 3 4 6 9
Jan 17 00:33:18 d3 corosync[20686]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6684 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6684 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6684 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6685 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6685 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6728 ms, flushing membership messages.
Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6729 ms, flushing membership messages.

this repeat all night until

Jan 17 00:33:22 d3 corosync[20686]: [TOTEM ] A new membership (1.1f509) was formed. Members joined: 2 3 4 6 9 left: 2 3 4 6 9
Jan 17 00:33:22 d3 corosync[20686]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 6 9
Jan 17 00:33:22 d3 corosync[20686]: [QUORUM] Sync members[6]: 1 2 3 4 6 9
Jan 17 00:33:22 d3 corosync[20686]: [QUORUM] Sync joined[5]: 2 3 4 6 9
Jan 17 00:33:22 d3 corosync[20686]: [QUORUM] Sync left[5]: 2 3 4 6 9
Jan 17 00:33:27 d3 corosync[20686]: [TOTEM ] A new membership (1.1f975) was formed. Members joined: 4 6 9 left: 4 6 9
Jan 17 00:33:27 d3 corosync[20686]: [TOTEM ] Failed to receive the leave message. failed: 4 6 9
Jan 17 00:33:27 d3 corosync[20686]: [QUORUM] Sync members[7]: 1 2 3 4 5 6 9
Jan 17 00:33:27 d3 corosync[20686]: [QUORUM] Sync joined[1]: 5
Jan 17 00:33:27 d3 corosync[20686]: [TOTEM ] A new membership (1.1f979) was formed. Members joined: 5

this repeat all night until

Jan 17 03:44:30 d3 corosync[20686]: [KNET ] link: host: 3 link: 0 is down
Jan 17 03:44:30 d3 corosync[20686]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 17 03:44:30 d3 corosync[20686]: [KNET ] host: host: 3 has no active links
Jan 17 03:44:37 d3 corosync[20686]: [KNET ] rx: host: 3 link: 0 is up
Jan 17 03:44:37 d3 corosync[20686]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

i restarted corosync cuz cluster was lost not working
and this after

Jan 17 08:19:18 d3 corosync[17949]: [TOTEM ] A new membership (1.1f98b) was formed. Members joined: 5
Jan 17 08:19:18 d3 corosync[17949]: [QUORUM] Members[2]: 1 5
Jan 17 08:19:18 d3 corosync[17949]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 17 08:19:18 d3 corosync[17949]: [QUORUM] Sync members[7]: 1 2 3 4 5 6 9
Jan 17 08:19:18 d3 corosync[17949]: [QUORUM] Sync joined[5]: 2 3 4 6 9
Jan 17 08:19:18 d3 corosync[17949]: [TOTEM ] A new membership (1.1f98f) was formed. Members joined: 2 3 4 6 9
Jan 17 08:19:18 d3 corosync[17949]: [QUORUM] This node is within the primary component and will provide service.
Jan 17 08:19:18 d3 corosync[17949]: [QUORUM] Members[7]: 1 2 3 4 5 6 9
Jan 17 08:19:18 d3 corosync[17949]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 17 08:19:19 d3 corosync[17949]: [TOTEM ] Retransmit List: 82
Jan 17 08:19:20 d3 corosync[17949]: [TOTEM ] Retransmit List: 8c 8e 8f 91 92 94
Jan 17 08:19:20 d3 corosync[17949]: [TOTEM ] Retransmit List: 92 94
Jan 17 08:19:20 d3 corosync[17949]: [TOTEM ] Retransmit List: 89 8b 8c 8e 8f 91 92 94

and i sow this two

Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [CPG ] *** 0x5586dd1bc1a0 can't mcast to group pve_kvstore_v1 state:1, error:12
Jan 17 12:21:30 d3 corosync[17949]: [MAIN ] qb_ipcs_event_send: Transport endpoint is not connected (107)

The log its big with this

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2022-01-16 22:10:49 EET; 16h ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 23303 (corosync)
Tasks: 9 (limit: 4915)
Memory: 4.1G
CGroup: /system.slice/corosync.service
└─23303 /usr/sbin/corosync -f

Jan 17 14:21:06 d4 corosync[23303]: [TOTEM ] Retransmit List: 5d51
Jan 17 14:28:46 d4 corosync[23303]: [TOTEM ] Retransmit List: 6ba1
Jan 17 14:43:16 d4 corosync[23303]: [TOTEM ] Retransmit List: 86ce
Jan 17 14:49:06 d4 corosync[23303]: [TOTEM ] Retransmit List: 9196
Jan 17 14:51:29 d4 corosync[23303]: [TOTEM ] Retransmit List: 9603
Jan 17 14:59:03 d4 corosync[23303]: [KNET ] link: host: 3 link: 0 is down
Jan 17 14:59:03 d4 corosync[23303]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 17 14:59:03 d4 corosync[23303]: [KNET ] host: host: 3 has no active links
Jan 17 14:59:11 d4 corosync[23303]: [KNET ] rx: host: 3 link: 0 is up
Jan 17 14:59:11 d4 corosync[23303]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)

4.1G in 1 week will be again 30GB

fabian · Jan 17, 2022

frequent retransmit lists indicate some sort of network issue / instability. these messages might indicate scheduling/load problems (whether one causes the other and in which direction is hard to tell with the limited information):

Code:

Jan 17 00:33:18 d3 corosync[20686]:   [TOTEM ] Process pause detected for 6684 ms, flushing membership messages.

there is a known (but not trivial to fix) issue with corosync where membership changes cause a small memory leak - if you have very frequent membership changes caused by network issues/overload, this might lead to high memory usage by the corosync process. if you fix the underlying issue causing the membership changes this issue should disappear as well..

Bidi · Jan 17, 2022

fabian said:
frequent retransmit lists indicate some sort of network issue / instability. these messages might indicate scheduling/load problems (whether one causes the other and in which direction is hard to tell with the limited information):

Code:

Jan 17 00:33:18 d3 corosync[20686]: [TOTEM ] Process pause detected for 6684 ms, flushing membership messages.

there is a known (but not trivial to fix) issue with corosync where membership changes cause a small memory leak - if you have very frequent membership changes caused by network issues/overload, this might lead to high memory usage by the corosync process. if you fix the underlying issue causing the membership changes this issue should disappear as well..

Whell ii dont think we have network issues, all ower network its 10/40GB and we have difrent cluster with 3 nodes and there is never a problem.

What can you recomand to check ?

fabian · Jan 17, 2022

configure corosync to use a dedicated physical link (or two, if you can) - not shared with other usage that can affect the latency. monitor the corosync logs for "hickups" like retransmits or link down events and try to cross-reference with logs by other services/monitoring/.. what the cause is.

Bidi · Jan 17, 2022

"configure corosync to use a dedicated physical link"
Using local network ? Like: 192.168..... on the secand dedicated network port on the server ?
It will cause speed problems when i migrate VMs ?

On all the servers we have dedicated nic 10GB port and the integrated server has ar 1GB, if i use the 1GB ports for local network and migrate an VM witch nic he will use ?

fabian · Jan 18, 2022

see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvecm , especially https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Bidi · Jan 20, 2022

Foounded the issue, on one of ower server the cable was broke and it was at 100Mbps, when one VM use the whole speed cluster brakes, we changed the cable and now its 1000Mbps and its all ok,

The other servers ar 10/40GB Speed and they dont have this issue.

Thank you.

Bidi · Jan 20, 2022

One more think and ii think i founded another issue

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: server12
nodeid: 5
quorum_votes: 1
ring0_addr: 89.XX.XX.4
}
node {
name: server3
nodeid: 1
quorum_votes: 1
ring0_addr: 89.XX.XX.253
}
node {
name: server4
nodeid: 2
quorum_votes: 1
ring0_addr: 89.XX.XX.243
}
node {
name: server5
nodeid: 4
quorum_votes: 1
ring0_addr: 89.XX.XX.245
}
node {
name: server7
nodeid: 6
quorum_votes: 1
ring0_addr: 188.XX.XX.253
}
node {
name: server8
nodeid: 9
quorum_votes: 1
ring0_addr: 188.XX.XX.249
}
node {
name: server9
nodeid: 3
quorum_votes: 1
ring0_addr: 188.XX.XX.254
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Cluster
config_version: 41
interface {
bindnetaddr: 89.XX.XX.251 < This, was a server on the cluster but was removed, its normal to be this ip ?! I think it should be one ips from another server in cluster right ?
ringnumber: 0
}
ip_version: ipv4
secauth: on
token: 10000
version: 2
}

fabian · Jan 20, 2022

the bindnetaddr isn't really relevant anymore (it was used for udp / multicast, but for knet it doesn't do anything)

Search

Search

Corosync very high CPU and Memory usage on v 6.4-13

Bidi

Renowned Member

Attachments

fabian

Proxmox Staff Member

Bidi

Renowned Member

fabian

Proxmox Staff Member

Bidi

Renowned Member

fabian

Proxmox Staff Member

Bidi

Renowned Member

fabian

Proxmox Staff Member

Bidi

Renowned Member

fabian

Proxmox Staff Member

Bidi

Renowned Member

Bidi

Renowned Member

fabian

Proxmox Staff Member

We value your privacy