Corosync breaks the cluster and put down all network

pidumenk

New Member
Oct 2, 2023
8
1
3
Good afternoon,

Last days we experience unpredictable behaviour our Proxmox cluster based on pve-manager/7.4-16/0f39f621 (running kernel: 5.15.83-1-pve). Every single time when we leave corosync.service running at night time that breaks the whole network and stops all services working properly. Initially, we thought that could be a problem of two accidentally joined nodes based on Proxmox 8 to the cluster. After removing them, we still keep getting the same problem. Below I've attached a full log from one of the node. The corosync problem and network issue appears around 00:40. Around 5:00 corosync component has been disabled on all nodes and it healed everything back. Does anyone have any ideas of the root cause the such problem?

Moreover, I noticed that one of the nodes has the same issue as was described here: https://forum.proxmox.com/threads/one-node-in-cluster-brings-everything-down.128862/

I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

Bash:
# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
# systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2024-01-03 14:29:31 CET; 1s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 1633553 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
   Main PID: 1633553 (code=exited, status=8)
        CPU: 24ms


Jan 03 14:29:31 r1c2s2 systemd[1]: Starting Corosync Cluster Engine...
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync Cluster Engine 3.1.7 starting up
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf>
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Could not open /etc/corosync/authkey: No such file or directory
Jan 03 14:29:31 r1c2s2 corosync[1633553]:   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1417.
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 03 14:29:31 r1c2s2 systemd[1]: Failed to start Corosync Cluster Engine.

This is the output of pveversion -v from all nodes:
Bash:
proxmox-ve: 7.4-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 

Attachments

Last edited:
Hello,

Do you have HA resources in your cluster? I see that you lose Corosync connection quite often on many nodes, e.g.

```
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [TOTEM ] A new membership (3.16810) was formed. Members
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[6]: 1 3 5 15 17 18
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync joined[5]: 1 5 15 17 18
```
Could you please send us the output of `pvecm status` and the contents of `/etc/pve/corosync.conf`? Additionally, could you verify that the nodes that are not part of the cluster are not listed in either output?
 
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.

Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Could not open /etc/corosync/authkey: No such file or directory

your really need the authkey on each node . (you can recopy it, it should be the same).

I'm not sure of the cluster behavior if a node if trying again again to connect with the authkey ...
 
your really need the authkey on each node . (you can recopy it, it should be the same).

I'm not sure of the cluster behavior if a node if trying again again to connect with the authkey ...
I cannot copy that into /etc/pve/ folder. It's read-only file.

Bash:
mv authkey.pub /etc/pve/
mv: inter-device move failed: 'authkey.pub' to '/etc/pve/authkey.pub'; unable to remove target: Permission denied
 
Hello,

Do you have HA resources in your cluster? I see that you lose Corosync connection quite often on many nodes, e.g.

```
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [TOTEM ] A new membership (3.16810) was formed. Members
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Members[1]: 3
Jan 3 00:31:13 r2c13 corosync[2241956]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync members[6]: 1 3 5 15 17 18
Jan 3 00:31:13 r2c13 corosync[2241956]: [QUORUM] Sync joined[5]: 1 5 15 17 18
```
Could you please send us the output of `pvecm status` and the contents of `/etc/pve/corosync.conf`? Additionally, could you verify that the nodes that are not part of the cluster are not listed in either output?
Hello, so we problem still exists even after I deleted all suspicious nodes from the cluster. By the way, deletion has been done manually. Below is a way how it was done.

Code:
# Remove modes from cluster locally
# ssh to all nodes sequentially

systemctl stop corosync
systemctl stop pve-cluster
pmxcfs -l
cd /etc/pve
rm corosync.conf.*

# delete <NODE_NAME>
vim corosync.conf
 
# Delete node dependencies fron other nodes in the cluster, where NODE_NAME is a node for deleteion
rm -r /etc/pve/nodes/<NODE_NAME>

killall pmxcfs
systemctl start pve-cluster
systemctl start corosync

When corosync is on and cluster has a green status, it lasts a short time, after the network is going crazy again. Yesterday I compared all /etc/pve/corosync.conf with other nodes by Ansible - everything is ok. There is no difference between configs. I also compared data between /etc/pve/corosync.conf and /etc/corosync/corosync.conf - same thing, it's ok.

But I noticed that /etc/corosync/authkey has different permissions in comparison to others. Why is it happening? Last time one node has no /etc/authkey at all. Could this be a problem?

Bash:
r3c3s1 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Mar 28  2023 /etc/corosync/authkey
r3c4s1 | CHANGED | rc=0 >>
-r-------- 1 root root 256 Apr  1  2022 /etc/corosync/authkey
r3c7s4 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Apr  1  2022 /etc/corosync/authkey
r3c9s1 | CHANGED | rc=0 >>
-rw-r--r-- 1 root root 256 Mar 30  2023 /etc/corosync/authkey

Corosync is disabled now, therefore the output of pvecm is the following:
Bash:
Cluster information
-------------------
Name:             pve-production
Config Version:   33
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!