Corosync breaks the cluster and put down all network

pidumenk · Jan 3, 2024

Good afternoon,

Last days we experience unpredictable behaviour our Proxmox cluster based on pve-manager/7.4-16/0f39f621 (running kernel: 5.15.83-1-pve). Every single time when we leave corosync.service running at night time that breaks the whole network and stops all services working properly. Initially, we thought that could be a problem of two accidentally joined nodes based on Proxmox 8 to the cluster. After removing them, we still keep getting the same problem. Below I've attached a full log from one of the node. The corosync problem and network issue appears around 00:40. Around 5:00 corosync component has been disabled on all nodes and it healed everything back. Does anyone have any ideas of the root cause the such problem?

Moreover, I noticed that one of the nodes has the same issue as was described here: https://forum.proxmox.com/threads/one-node-in-cluster-brings-everything-down.128862/

I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

Bash:

# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
# systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2024-01-03 14:29:31 CET; 1s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 1633553 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
   Main PID: 1633553 (code=exited, status=8)
        CPU: 24ms


Jan 03 14:29:31 r1c2s2 systemd[1]: Starting Corosync Cluster Engine...
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync Cluster Engine 3.1.7 starting up
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf>
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Could not open /etc/corosync/authkey: No such file or directory
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1417.
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 03 14:29:31 r1c2s2 systemd[1]: Failed to start Corosync Cluster Engine.

This is the output of pveversion -v from all nodes:

Bash:

proxmox-ve: 7.4-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Maximiliano · Jan 4, 2024

Hello,

Do you have HA resources in your cluster? I see that you lose Corosync connection quite often on many nodes, e.g.

```
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [TOTEM ] A new membership (3.16810) was formed. Members
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[6]: 1 3 5 15 17 18
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync joined[5]: 1 5 15 17 18
```
Could you please send us the output of `pvecm status` and the contents of `/etc/pve/corosync.conf`? Additionally, could you verify that the nodes that are not part of the cluster are not listed in either output?

spirit · Jan 5, 2024

I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Could not open /etc/corosync/authkey: No such file or directory

your really need the authkey on each node . (you can recopy it, it should be the same).

I'm not sure of the cluster behavior if a node if trying again again to connect with the authkey ...

pidumenk · Jan 5, 2024

spirit said:
your really need the authkey on each node . (you can recopy it, it should be the same).

I'm not sure of the cluster behavior if a node if trying again again to connect with the authkey ...

I cannot copy that into /etc/pve/ folder. It's read-only file.

Bash:

mv authkey.pub /etc/pve/
mv: inter-device move failed: 'authkey.pub' to '/etc/pve/authkey.pub'; unable to remove target: Permission denied

spirit · Jan 6, 2024

/etc/corosync/authkey is local is each node.

This is not the same file than /etc/pve/authkey.pub

pidumenk · Jan 25, 2024

Maximiliano said:
Hello,

Do you have HA resources in your cluster? I see that you lose Corosync connection quite often on many nodes, e.g.

```
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [TOTEM ] A new membership (3.16810) was formed. Members
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[6]: 1 3 5 15 17 18
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync joined[5]: 1 5 15 17 18
```
Could you please send us the output of `pvecm status` and the contents of `/etc/pve/corosync.conf`? Additionally, could you verify that the nodes that are not part of the cluster are not listed in either output?

Hello, so we problem still exists even after I deleted all suspicious nodes from the cluster. By the way, deletion has been done manually. Below is a way how it was done.

Code:

# Remove modes from cluster locally
# ssh to all nodes sequentially

systemctl stop corosync
systemctl stop pve-cluster
pmxcfs -l
cd /etc/pve
rm corosync.conf.*

# delete <NODE_NAME>
vim corosync.conf
 
# Delete node dependencies fron other nodes in the cluster, where NODE_NAME is a node for deleteion
rm -r /etc/pve/nodes/<NODE_NAME>

killall pmxcfs
systemctl start pve-cluster
systemctl start corosync

When corosync is on and cluster has a green status, it lasts a short time, after the network is going crazy again. Yesterday I compared all /etc/pve/corosync.conf with other nodes by Ansible - everything is ok. There is no difference between configs. I also compared data between /etc/pve/corosync.conf and /etc/corosync/corosync.conf - same thing, it's ok.

But I noticed that /etc/corosync/authkey has different permissions in comparison to others. Why is it happening? Last time one node has no /etc/authkey at all. Could this be a problem?

Bash:

r3c3s1 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Mar 28  2023 /etc/corosync/authkey
r3c4s1 | CHANGED | rc=0 >>
-r-------- 1 root root 256 Apr  1  2022 /etc/corosync/authkey
r3c7s4 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Apr  1  2022 /etc/corosync/authkey
r3c9s1 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Mar 30  2023 /etc/corosync/authkey

Corosync is disabled now, therefore the output of pvecm is the following:

Bash:

Cluster information
-------------------
Name:             pve-production
Config Version:   33
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

Search

Search

Corosync breaks the cluster and put down all network

pidumenk

New Member

Attachments

Maximiliano

Proxmox Staff Member

spirit

Distinguished Member

pidumenk

New Member

spirit

Distinguished Member

pidumenk

New Member

We value your privacy