PVE6 - corosync 3 issues

lcaron

New Member
Feb 24, 2012
16
0
1
Hi,


After upgrading our 4 node cluster from PVE 5 to 6, we experience constant crashed (once every 2 days).

Those crashes seem related to corosync.

Since numerous users are reporting sych issues (broken cluster after upgrade, unstabilities, ...) I wonder if it is possible to downgrade corosync to version 2.4.4 without impacting functionnality ?

Basic steps would be:

On all nodes

# systemctl stop pve-ha-lrm

Once done, on all nodes:

# systemctl stop pve-ha-crm

Once done, on all nodes:

# apt-get install corosync=2.4.4-pve1 libcorosync-common4=2.4.4-pve1 libcmap4=2.4.4-pve1 libcpg4=2.4.4-pve1 libqb0=1.0.3-1~bpo9 libquorum5=2.4.4-pve1 libvotequorum8=2.4.4-pve1

Then, once corosync has been downgraded, on all nodes

# systemctl start pve-ha-lrm
# systemctl start pve-ha-crm

Would that work ?

Thanks
 
What's your current pveversion? (pveversion -v)
Please also provide the syslog(/var/log/syslog) or journal output for the timeframe these issues happened.

A simple package downgrade will not work as there were config changes with the switch to corosync 3 which are incompatible with corosync 2.
 
Hi,

Here you go:

# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-8
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.8-1-pve: 4.4.8-52
pve-kernel-2.6.32-45-pve: 2.6.32-174
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-19-pve: 2.6.32-96
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-10-pve: 2.6.32-63
pve-kernel-2.6.32-7-pve: 2.6.32-60
pve-kernel-2.6.32-6-pve: 2.6.32-55
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
 
Please also provide the previous syslogs. In this one there's nothing regarding corosync issues.
Also provide the corosync config please. (/etc/pve/corosync.conf)
 
Here you go.

Thanks
 

Attachments

  • corosync_conf.txt
    655 bytes · Views: 7
  • corosync_syslog_6.txt
    94.7 KB · Views: 3
  • corosync_syslog_5.txt
    94.7 KB · Views: 2
  • corosync_syslog_4.txt
    233.9 KB · Views: 2
  • corosync_syslog_3.txt
    247.1 KB · Views: 1
  • corosync_syslog_2.txt
    11.6 KB · Views: 2
  • corosync_syslog_1.txt
    23.2 KB · Views: 1
  • corosync_syslog.txt
    32.8 KB · Views: 0
  • corosync_syslog_7.txt
    94.7 KB · Views: 0
Code:
oxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 84 ms
syslog.4.gz:Sep 12 21:50:53 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 10117 ms
syslog.4.gz:Sep 12 21:50:58 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 15177 ms
syslog.4.gz:Sep 12 21:51:03 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 20237 ms
syslog.4.gz:Sep 12 21:51:08 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 25297 ms
syslog.4.gz:Sep 12 21:51:09 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:11 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:12 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:13 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 30357 ms
syslog.4.gz:Sep 12 21:51:14 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:15 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:17 proxmox-vty-002 corosync[15827]:   [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
syslog.4.gz:Sep 12 21:51:18 proxmox-vty-002 corosync[15827]:   [TOTEM ] Token has not been received in 35417 ms

beside we all have issues with the corosync3 sh*tshow your main problem at least seem to be network

please grab all dmesg and syslogs regarding yur network cards
as we can see, ping takes 35 seconds which is probably a new world record.

seriously do you have, by any chance, your machines networked with letter pigeons ? :)
 
Sure....the network was working fine _before_ migrating to proxmox 6, and suddenly (without any network change), the packets need 35seconds to travel from one port to another one...

I'm more inclined in thinking the token has not been received _because_ corosync caused trouble before...
 
Could you provide some more information about your setup? Which NICs are you using? Is corosync running in a separated network?(different network, different switch from all other traffic, not just a different VLAN)
 
Corosync is running on a different VLAN on the same bond as production traffic.

This setup has been running rock stable since proxmox 4.
 
It was and still is not recommended to run the corosync traffic over the same network as other traffic.
Corosync requires low latency which is hard to guarantee with other traffic on the same network. Try separating it, 1G NICs/Switches should be enough as it does not require high bandwidth.
 
Last edited:
Hi,
We did separate corosync traffic from the rest.
Our setup is now (on each node)
2x10G LACP for production traffic
2x10G LACP for cluster traffic

Still we experience corosync failures:
Oct 7 04:23:57 proxmox-siege-001 corosync[13778]: [KNET ] link: host: 4 link: 0 is down
Oct 7 04:23:57 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 1 (pri: 1)
Oct 7 04:23:59 proxmox-siege-001 corosync[13778]: [KNET ] rx: host: 4 link: 0 is up
Oct 7 04:23:59 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 04:31:42 proxmox-siege-001 corosync[13778]: [TOTEM ] Token has not been received in 84 ms
Oct 7 06:49:37 proxmox-siege-001 corosync[13778]: [KNET ] link: host: 4 link: 1 is down
Oct 7 06:49:37 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 06:49:38 proxmox-siege-001 corosync[13778]: [KNET ] rx: host: 4 link: 1 is up
Oct 7 06:49:38 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 08:00:17 proxmox-siege-001 corosync[13778]: [KNET ] link: host: 4 link: 1 is down
Oct 7 08:00:17 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 08:00:18 proxmox-siege-001 corosync[13778]: [KNET ] rx: host: 4 link: 1 is up
Oct 7 08:00:18 proxmox-siege-001 corosync[13778]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 09:57:28 proxmox-siege-001 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Oct 7 09:57:28 proxmox-siege-001 systemd[1]: corosync.service: Failed with result 'signal'.
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [MAIN ] Corosync Cluster Engine 3.0.2-dirty starting up
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [TOTEM ] Initializing transport (Kronosnet).
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [TOTEM ] kronosnet crypto initialized: aes256/sha256
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [TOTEM ] totemknet initialized
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync configuration map access [0]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QB ] server name: cmap
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync configuration service [1]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QB ] server name: cfg
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QB ] server name: cpg
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync profile loading service [4]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [WD ] Watchdog not enabled by configuration
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [WD ] resource load_15min missing a recovery key.
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [WD ] resource memory_used missing a recovery key.
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [WD ] no resources configured.
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync watchdog service [7]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QUORUM] Using quorum provider corosync_votequorum
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QB ] server name: votequorum
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [QB ] server name: quorum
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 0)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 3 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 2 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 1 has no active links
...
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 4 has no active links
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 7 09:57:47 proxmox-siege-001 corosync[29665]: [KNET ] host: host: 4 has no active links
Oct 7 09:57:49 proxmox-siege-001 corosync[29665]: [KNET ] rx: host: 2 link: 0 is up
...
Oct 7 09:57:49 proxmox-siege-001 corosync[29665]: [QUORUM] This node is within the primary component and will provide service.
Oct 7 09:57:49 proxmox-siege-001 corosync[29665]: [QUORUM] Members[4]: 1 2 3 4
Oct 7 09:57:49 proxmox-siege-001 corosync[29665]: [MAIN ] Completed service synchronization, ready to provide service.

Corosync was automatically restarted by monit
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!