Good afternoon,
Last days we experience unpredictable behaviour our Proxmox cluster based on
Moreover, I noticed that one of the nodes has the same issue as was described here: https://forum.proxmox.com/threads/one-node-in-cluster-brings-everything-down.128862/
This is the output of
Last days we experience unpredictable behaviour our Proxmox cluster based on
pve-manager/7.4-16/0f39f621 (running kernel: 5.15.83-1-pve)
. Every single time when we leave corosync.service
running at night time that breaks the whole network and stops all services working properly. Initially, we thought that could be a problem of two accidentally joined nodes based on Proxmox 8 to the cluster. After removing them, we still keep getting the same problem. Below I've attached a full log from one of the node. The corosync problem and network issue appears around 00:40. Around 5:00 corosync component has been disabled on all nodes and it healed everything back. Does anyone have any ideas of the root cause the such problem?Moreover, I noticed that one of the nodes has the same issue as was described here: https://forum.proxmox.com/threads/one-node-in-cluster-brings-everything-down.128862/
I've noticed that the authkey.pub file on that node is older so taking notes from other threads, I've removed it and rebooted the machine. Cluster becomes unstable, corosync errors, UI is unresponsive (invalid ticket on that node), GUI throws me out.
Bash:
# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2024-01-03 14:29:31 CET; 1s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1633553 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 1633553 (code=exited, status=8)
CPU: 24ms
Jan 03 14:29:31 r1c2s2 systemd[1]: Starting Corosync Cluster Engine...
Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Corosync Cluster Engine 3.1.7 starting up
Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf>
Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Could not open /etc/corosync/authkey: No such file or directory
Jan 03 14:29:31 r1c2s2 corosync[1633553]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1417.
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 03 14:29:31 r1c2s2 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 03 14:29:31 r1c2s2 systemd[1]: Failed to start Corosync Cluster Engine.
This is the output of
pveversion -v
from all nodes:
Bash:
proxmox-ve: 7.4-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
Attachments
Last edited: