Dear all,
I am currently dealing with a problem in a hyperconverged (ceph) where the whole cluster reboots seemingly at random. Every single one (of the total of seven) node resets at the same time. I am suspecting corosync to not be able to communicate properly. This problem has only popped up after the newest upgrade. The cluster boots normally after the hard reset. Any kind of help would be highly appreciated.
I am currently dealing with a problem in a hyperconverged (ceph) where the whole cluster reboots seemingly at random. Every single one (of the total of seven) node resets at the same time. I am suspecting corosync to not be able to communicate properly. This problem has only popped up after the newest upgrade. The cluster boots normally after the hard reset. Any kind of help would be highly appreciated.
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 14.2.16-pve1
ceph-fuse: 14.2.16-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
root@PX-LI-04:~# cat /var/log/syslog | grep corosync
Jan 28 08:13:48 PX-LI-04 corosync[1831]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 28 08:13:48 PX-LI-04 corosync[1831]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] link: host: 5 link: 0 is down
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 has no active links
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] rx: host: 5 link: 0 is up
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 28 08:13:53 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:13:57 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 3442 ms
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [TOTEM ] A new membership (1.46c9) was formed. Members
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 08:14:07 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 153 ms
Jan 28 08:14:12 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:14:19 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 155 ms
Jan 28 08:14:19 PX-LI-04 corosync[1831]: [TOTEM ] A new membership (1.46d5) was formed. Members
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:14:32 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:14:32 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:40 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 155 ms
Jan 28 08:14:54 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] link: host: 3 link: 0 is down
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 has no active links
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 2176 ms
Jan 28 08:14:59 PX-LI-04 corosync[1831]: [TOTEM ] A processor failed, forming new configuration.
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] rx: host: 3 link: 0 is up
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:05 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:06 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:07 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:07 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [MAIN ] Corosync Cluster Engine 3.0.4 starting up
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [TOTEM ] Initializing transport (Kronosnet).
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [TOTEM ] totemknet initialized
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [QB ] server name: cmap
Jan 28 08:13:48 PX-LI-04 corosync[1831]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 28 08:13:48 PX-LI-04 corosync[1831]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] link: host: 5 link: 0 is down
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 28 08:13:50 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 has no active links
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] rx: host: 5 link: 0 is up
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:13:52 PX-LI-04 corosync[1831]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Jan 28 08:13:53 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:13:57 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 3442 ms
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [TOTEM ] A new membership (1.46c9) was formed. Members
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Jan 28 08:14:01 PX-LI-04 corosync[1831]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 28 08:14:07 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 153 ms
Jan 28 08:14:12 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:14:19 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 155 ms
Jan 28 08:14:19 PX-LI-04 corosync[1831]: [TOTEM ] A new membership (1.46d5) was formed. Members
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:29 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:14:32 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:14:32 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:40 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 155 ms
Jan 28 08:14:54 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 154 ms
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] link: host: 3 link: 0 is down
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] link: host: 6 link: 0 is down
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 has no active links
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 has no active links
Jan 28 08:14:58 PX-LI-04 corosync[1831]: [TOTEM ] Token has not been received in 2176 ms
Jan 28 08:14:59 PX-LI-04 corosync[1831]: [TOTEM ] A processor failed, forming new configuration.
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] rx: host: 3 link: 0 is up
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] rx: host: 6 link: 0 is up
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jan 28 08:15:01 PX-LI-04 corosync[1831]: [KNET ] host: host: 6 (passive) best link: 0 (pri: 1)
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:02 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:03 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:04 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:05 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:06 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:07 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:15:07 PX-LI-04 corosync[1831]: [TOTEM ] Retransmit List: 1b 1c
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [MAIN ] Corosync Cluster Engine 3.0.4 starting up
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 28 08:18:01 PX-LI-04 corosync[1817]: [TOTEM ] Initializing transport (Kronosnet).
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [TOTEM ] totemknet initialized
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jan 28 08:18:02 PX-LI-04 corosync[1817]: [QB ] server name: cmap