Corosync upgrade udpu vs knet

Helmo

Well-Known Member
Jan 11, 2018
34
3
48
I've upgraded corosync in my cluster in preparation of the upgrade to Proxmox 6.x. (followed https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0)

I did have some difficulties ... but it worked.

After upgrading the first node I got:
Code:
# pvecm status
Cannot initialize CMAP service

Then it started logging this very frequently:

Code:
Mar 10 22:19:12 host pmxcfs[13939]: [quorum] crit: quorum_initialize failed: 2
Mar 10 22:19:12 host pmxcfs[13939]: [confdb] crit: cmap_initialize failed: 2
Mar 10 22:19:12 host pmxcfs[13939]: [dcdb] crit: cpg_initialize failed: 2
Mar 10 22:19:12 host pmxcfs[13939]: [status] crit: cpg_initialize failed: 2

After that I also found an earlier log:
Code:
Mar 10 22:06:26 host corosync[13992]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Mar 10 22:06:26 host corosync[13992]:   [MAIN  ] Please migrate config file to nodelist.
Mar 10 22:06:26 host corosync[13992]:   [MAIN  ] parse error in config: crypto_cipher & crypto_hash are only valid for the Knet transport.
Mar 10 22:06:26 host corosync[13992]:   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1386.

After some research I found "transport: udpu" in corosync.conf which needed to be changed to "transport: knet". Thanks to https://forum.proxmox.com/threads/unicast.56141/

Maybe a nice extra check to add in the pve5to6 tool?
There were multicast issues some years ago when setting up the cluster, probably the reason why udpu was set. It seems to work now ...

It's a bit noisy in the log though. The line below is repeated every minute. ... is that normal?

Code:
Mar 10 22:42:09 hostname corosync[24995]:   [KNET  ] pmtud: Starting PMTUD for host: 1 link: 0
Mar 10 22:42:09 hostname corosync[24995]:   [KNET  ] udp: detected kernel MTU: 1500
Mar 10 22:42:09 hostname corosync[24995]:   [KNET  ] pmtud: PMTUD completed for host: 1 link: 0 current link mtu: 1397

The pve5to6 tool also suggested to update the ring0_addr of a few nodes to an IP(it sitill had a hostname).
Can we add a WARNING in the file itself that the config_version number needs to be incremented?
I'm not sure is this contributed to the trouble I had here but I came across https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf while investigating, and manually updated it while also changing the transport value.
 
AFAICT this is logged when running the pve5to6 check tool.
"Corosync transport explicitly set to '$transport' instead of implicit default!" -> results in an error message

And how to manually modify the corosync.conf is well explained in our docs, I would encourage everyone to read them before changing files manually.

The first log messages are related and due to the fact that the cluster couldn't establish communication.

The last log block is hard to interpret as I don't know what the overall cluster status was at that time.
 
Thanks, I don't have a screenlog of the check output from before but must have interpreted it as related to having to do the corosync upgrade.

Maybe extending the current
"Corosync transport explicitly set to '$transport' instead of implicit default!" with

"Change to 'knet' before upgrading to Corosync 3.x"

About the "how to manually modify the corosync.conf is well explained in our docs" ... That is very true, these docs are fine. I just did not think of it to look for them at this moment in time. A quick comment line at the top of that file would have helped to remember.

I've now added this to mine:
Code:
#
# Update totem -> config_version below when editing...
#
# See https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_edit_corosync_conf
 
Hello!
Finally I have upgraded my cluster to ProxMox 6 too, but had some problems: first I had faced that the pve5to6 showing me the error:
Code:
FAIL: Corosync transport explicitly set to 'udpu' instead of implicit default!
I had not used multicast in Corosync 2 with ProxMox 5, because some nodes of my cluster are located in different places and connected via slow networks (some nodes can speak to each other with speed of 1Gbit/s, some other with 200Mbit/s, some with about 50Mbit/s, and some with just 15Mbit/s), and I had issues with multicast but the "udpu" worked well. I had 6 nodes in the cluster and planned to install 3 more after the upgrade.

I have read that Knet in Corosync 3 does not use multicast anymore, so I had switched to the default knet via udp. I was upgrading my nodes one by one: first step: the corosync itself on all nodes, and then second step: all the rest packages on all nodes, one by one. I had followed the manual https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0. I have passed the first step successfully, but on the second step when two nodes were already upgraded to ProxMox 6 I was punched with a corosync udp storm. I have found the storm was mentioned by other users:
First I switched in corosync.conf knet transport to sctp protocol. And this helped to upgrade all my nodes, but then the storm repeated several times.

Some days later I have realized how to switch corosync to udpu transport: in corosync 3 it is needed to explicitly add three parameters in totem section: "crypto_cipher: none", "crypto_hash: none" and "transport: udpu". This makes the corosync to work with udpu again for me. The pve5to6 script is showing the same error I have mentioned above, but the cluster is working. Pve5to6 had just confused me at the start as I thought that ProxMox 6 cannot work with corosync on "transport: udpu". But it works. Now my cluster contains 8 nodes and works much better with udpu than was with knet... despite of this warning in ...if all else fails:
Unicast is a technology for sending messages to a single network destination. In corosync, unicast is implemented as UDP-unicast (UDPU). Due to increased network traffic (compared to multicast) the number of supported nodes is limited, do not use it with more that 4 cluster nodes

Now the corosync.conf looks like this:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Faraway1
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.58.10
  }
  node {
    name: Star1
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.58.12
  }
  node {
    name: local1
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.58.13
  }
  node {
    name: remote1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.58.8
  }
  node {
    name: remote2
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.58.14
  }
  node {
    name: remote3
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.58.15
  }
  node {
    name: local2
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.58.6
  }
  node {
    name: local3
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.58.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cloud1
  config_version: 44
  crypto_cipher: none
  crypto_hash: none
  interface {
    bindnetaddr: 192.168.58.0
    join: 150
    linknumber: 0
    ringnumber: 0
    token: 5000
    token_retransmits_before_loss_const: 10
  }
  ip_version: ipv4
  secauth: on
  transport: udpu
  version: 2
}
ProxMox is
Code:
pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.140-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-6
pve-kernel-helper: 6.4-6
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-1
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1
But on some nodes some package versions are bit older, as I had updated my nodes one by one beating with the storm on the way for several days.

Conclusion: it's a pity that knet is not working for me. But I'm happy that udpu still works. Thank's for that! Maybe I missed something important using knet, but maybe my story will help somebody.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!