Corosync 100% CPU load [solved]

SellerOfSmiles

New Member
Feb 29, 2024
5
0
1
Hi. After install PVE 8.0.3 on new node and add it in my cluster (other nodes have versions PVE 7.*-***) Corosync useing 100% CPU in one thread.

Why he do it wrong? :) Or maybe I do something wrong?..

Code:
# apt list corosync
Listing... Done
corosync/now 3.1.7-pve3 amd64 [installed,local]

Code:
c# journalctl -u corosync -f -n 30
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 2 has no active links
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 0)
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 has no active links
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 has no active links
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 has no active links
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [QUORUM] Sync members[1]: 4
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [QUORUM] Sync joined[1]: 4
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [TOTEM ] A new membership (4.4f32) was formed. Members joined: 4
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [QUORUM] Members[1]: 4
Feb 29 11:54:07 S-VIRT04 corosync[433373]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 29 11:54:07 S-VIRT04 systemd[1]: Started corosync.service - Corosync Cluster Engine.
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [QUORUM] Sync members[4]: 1 2 3 4
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [QUORUM] Sync joined[3]: 1 2 3
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [TOTEM ] A new membership (1.4f36) was formed. Members joined: 1 2 3
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [QUORUM] This node is within the primary component and will provide service.
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [QUORUM] Members[4]: 1 2 3 4
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 453 to 65397
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 453 to 65397
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] pmtud: Global data MTU changed to: 65397

Code:
# htop

    0[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||95.0%] Tasks: 58, 39 thr, 131 kthr; 3 running
    1[||||||                                                                                5.0%] Load average: 2.13 2.06 1.95
    2[|||||                                                                                 4.1%] Uptime: 2 days, 03:51:21
    3[|||||                                                                                 4.5%]
  Mem[||||||||||||                                                                   2.39G/31.1G]
  Swp[                                                                                  0K/8.00G]

  [Main] [I/O]
    PID USER       PRI  NI  VIRT   RES   SHR S  CPU%▽MEM%   TIME+  Command
 433373 root        RT   0  676M  165M 53084 S  96.3  0.5  2h01:17 /usr/sbin/corosync -f
 433380 root        RT   0  676M  165M 53084 R  95.3  0.5  2h00:12 /usr/sbin/corosync -f
   1220 root        20   0 2935M 1168M 19840 S   6.6  3.7  2h44:52 /usr/bin/kvm -id 112...
 
Can you post your /etc/network/interfaces file please?
 
Code:
auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address x.x.x.19/24
        gateway x.x.x.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
 
Where does this MTU come from?


Code:
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 453 to 65397
Feb 29 11:54:16 S-VIRT04 corosync[433373]:   [KNET  ] pmtud: Global data MTU changed to: 65397

This looks wrong, should look more like this (depending on if you use jumbo-frames on one of your links or not):

Code:
Feb 28 13:29:50 PMX4 corosync[2498]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 28 13:29:50 PMX4 corosync[2498]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 8885
Feb 28 13:29:50 PMX4 corosync[2498]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Feb 28 13:29:50 PMX4 corosync[2498]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 8885
Feb 28 13:29:50 PMX4 corosync[2498]:   [KNET  ] pmtud: Global data MTU changed to: 1397
 
I'm not understend why "Global data MTU changed to: 65397". It's still like that.
But, when i set netmtu: 1400 in corosync.conf and reboot Corosync on all nodes CPU utilization has decreased.

o_O