latest update for proxmox community edition 8.3.5 broke corosync

lix · Apr 4, 2025

I have 8.3.5 proxmox. After latest update I have under loop, every 2-5 seconds, receiving next message for all cluster nodes:

Apr 04 22:17:04 prx-01 corosync[1488]: [KNET ] udp: Received packet on ifindex 1664013570 when expected ifindex 3
Apr 04 22:17:04 prx-01 corosync[1488]: [KNET ] udp: Received packet on ifindex 1664013570 when expected ifindex 3
Apr 04 22:17:04 prx-01 corosync[1488]: [KNET ] udp: Received packet on ifindex -714920235 when expected ifindex 3
Apr 04 22:17:03 prx-01 corosync[1488]: [KNET ] udp: Received packet on ifindex 1664013570 when expected ifindex 3
Apr 04 22:17:03 prx-01 corosync[1488]: [KNET ] udp: Received packet on ifindex 1664013570 when expected ifindex 3

Cluster proxmox have 4 nodes.

Corosync version -
corosync/stable,now 3.1.9-pve1 amd64 [installed,automatic]
cluster engine daemon and utilities

How to solve it or debug and solve it? Could you please help

UdoB · Apr 5, 2025

lix said:
How to solve it or debug and solve it?

Step zero: restart corosync on all nodes, one after another: systemctl restart corosync.service ; systemctl status corosync.service

Step one: verify that all rings are actually both enabled and connected, similar like this:

Code:

~# corosync-cfgtool   -n
Local node ID 3, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.6->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.6->10.11.16.9) enabled connected mtu: 1397

nodeid: 4 reachable
   LINK: 0 udp (10.3.16.6->10.3.16.10) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.6->10.11.16.10) enabled connected mtu: 1397

nodeid: 6 reachable
   LINK: 0 udp (10.3.16.6->10.3.16.7) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.6->10.11.16.7) enabled connected mtu: 1397
...

Disclaimer: just random thoughts...

ebenoit · Apr 5, 2025

I am having similar problems. The error I get differs slightly, but it is caused by the same piece of code - function check_dst_addr_is_valid in libknet/transport_udp.c from the kronosnet package. This code was added last year to kronosnet and only found its way to Proxmox a few days ago.

I am running a 5-node cluster with some mesh networking between the nodes, which is what is triggering the warning (as the comment for the function says, this is indeed "weird routing"...)

At this time, the only workaround I've found is to set the minimum log level for corosync to error (syslog_priority: error in the logging section), but I'm not too fond of that. If I don't do it though, all nodes spam the warning continuously, filling the log partition and saturating my Graylog's index as well.

fabian · Apr 7, 2025

could you describe your network setup/routing with more detail? thanks!

ebenoit · Apr 7, 2025

My setup has 6 physical network interfaces on each node.

Two of these interfaces are configured using Linux bonding, then connected to a VLAN-aware bridge interface on which the VMs are connected and another, non-VLAN aware bridge through which the hosts have network access. This is the "front" network.

The other interfaces are used to connect each node directly to each other. A dummy0 network interface bears the address of the node on this "back" network. FRR is configured for OSPFv3 with point-to-point connections through these interfaces, letting another node route between two peers if the direct link is somehow down. If all 4 interfaces are down on a node, routing will occur through the "front" network instead. This network has Ceph and VM migrations.

Corosync is configured with two rings. ring0 goes through the back network, while ring1 goes through the front network.

In my case, the problem is caused by kronosnet expecting the packets coming from the dummy0 network interface, while they are in fact coming from one of the 4 back interfaces (and could be coming from the front bridge if the back was down).

MeSU · Apr 7, 2025

Hello everyone,

We met the same issue this morning after reaching version 1.30-pve1 of libknet1 and 3.1.9-pve1 of corosync, libvotequorum8, libquorum5, libcmap4, libcfg7, libcpg4 and libcorosync-common4.
(the issue may be only due libknet1but since the update pulled all these packages together, I am listing them all together)

Downgrading those packages fixed the issue :

Code:

apt-get install libknet=1.28-pve1 corosync=3.1.7-pve3 libcfg7=3.1.7-pve3 libcmap4=3.1.7-pve3 libcorosync-common4=3.1.7-pve3 libcpg4=3.1.7-pve3 libquorum5=3.1.7-pve3 libvotequorum8=3.1.7-pve3

As ebenoit pointed out, the issue seems to be that libknet now checks packets origin and does that by associating an IP to a single network interface.
That looks incorrect as in a mesh network, the IP is shared (see interfaces configuration below).

We have a three-node cluster with mesh network. The error messages appeared for both rings despite corosync-cfgtool reporting that everything was fine.

Here is a sample (modified for privacy) of /etc/network/interfaces of our node1 for one of the rings :

Code:

auto eno399np0
iface eno399np0 inet static
        address  10.1.1.1/24
        up ip route add 10.1.1.2/32 dev eno399np0
        down ip route del 10.1.1.2/32
# MESH Corosync node1 <-> node2

auto eno409np1
iface eno409np1 inet static
        address  10.1.1.1/24
        up ip route add 10.1.1.3/32 dev eno409np1
        down ip route del 10.1.1.3/32
# MESH Corosync node1 <-> node3

Best regards

deac · Apr 7, 2025

My guess is, that somebody thought, if an IP-address is on an interface, than any packages to this IP-address must come through this interface and implemented this in knet.
But this is not correct and not really common in the wild.

This is only a guess, it could be wrong, but would make sense on our machines:

Code:

[KNET  ] udp: Received packet from 2a01:1234::12 to 2a01:1234::13 on i/f enp123s0f1 when expected lo

The IP-address 2a01:1234::13 is on the loopback lo interface:

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.13/32 scope global lo
       valid_lft forever preferred_lft forever
    inet6 2a01:1234::13/128 scope global
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever

On the other interfaces we have no-non link-local IP-addresses:

Code:

5: enp123s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
    link/ether xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet6 fe80::xx:xx:xx:xx/64 scope link
       valid_lft forever preferred_lft forever

fabian · Apr 8, 2025

we've reverted the change in kronosnet for now, and will implement some way to either disable the check or lower it's log level..

hevisko · Jun 11, 2025

Thank you!
Seems to be fixed in at least 8.4.1

Search

Search

latest update for proxmox community edition 8.3.5 broke corosync

lix

New Member

UdoB

Distinguished Member

ebenoit

Member

fabian

Proxmox Staff Member

ebenoit

Member

MeSU

New Member

deac

Member

fabian

Proxmox Staff Member

hevisko

Renowned Member

We value your privacy