Corosync - Cluster retransmit issues | pmxcfs / corosync synchronization problems | Proxmox cluster 3 nodes

kd-infradijon

New Member
Apr 10, 2026
17
0
1
FRANCE
Hello,

I am opening a forum discussion because I am experiencing an issue with the Proxmox cluster in my environment.

Indeed, I installed Proxmox VE 9.1.6 on each node and everything went very smoothly, right up until the cluster was created.
I’ll provide you with all the relevant information about my environment, followed by service logs, a record of actions taken, and so on...


Environment​

  • Proxmox VE Version : Proxmox VE 9.1.6
  • VM Version : 6.17.13-2-pve
  • Proxmox VE cluster: 3 nodes
  • Transport: knet
  • Nodes:
    • pve1: 10.100.37.250
    • pve2: 10.100.37.251
    • pve3: 10.100.37.252

The 3 nodes Proxmox Servers are installed on 3 Dell PowerEdge XR4510c servers with the dedicated networks ports following :

  • NIC 1 (10 Gbps Ethernet Connection) --> Dedicated for the VMs network
  • NIC 2 (10 Gbps Ethernet Connection) --> Dedicated for the host and cluster network (web access on 8006 port, SSH access and corosync communication for the PVE Cluster) [the 10.100.37.0/27 network]
  • NIC 3 & NIC 4 (10 Gbps Optic Connection) --> Dedicated for the CEPH cluster that we didn't intalled yet.

Actions taken​


Here is the actions that we have taken :
  • Intalled and configured properly Proxmox VE on each node --> OK
  • Add the Basic licence on each node --> OK
  • Updated each node with the Enterprise repository --> OK
  • Configured the network interfaces --> OK

Obviously we restarted the server multiple times and everything was OK and each node stable.

Then we created the clusther through the UI of the first node [pve1] in "Datacenter --> Cluster --> Create Cluster --> Select the 10.100.37.0/27i interface", copied informations and join the cluster on the 2 other nodes.

From that point on, the nodes became unstable, including issues with UI access (resulting in complete loss of access through the web, only SSH still works) and intermittent problems with the functional cluster.



Logs of services and some diagnostics​

In the logs we see retransmit list messages:
Apr 09 14:44:04 pve2-cdcserris corosync[4045]: [TOTEM ] Retransmit List: 1f 20 22 23 24 25 26 27 28 33

Apr 09 14:43:59 pve3-cdcserris corosync[3860]: [TOTEM ] Retransmit List: 28 1d

Apr 09 14:35:19 pve1-cdcserris corosync[3405]: [TOTEM ] Retransmit List: 15 16 17 19

Seems network issues cause D state for Proxmox services

Apr 09 14:31:14 pve2-cdcserris kernel: task:pvestatd state:D stack:0 pid:1296 tgid:1296 ppid:1 flags:0x00004002

Apr 09 14:41:24 pve1-cdcserris kernel: task:pvescheduler state:D stack:0 pid:4380 tgid:4380 ppid:1349 flags:0x00004002
 
Hi,
have you checked yet to see if you're experiencing network errors (drops)?
Code:
ip -s link show nic2 # change nic2 to your interface with corosync.

I understand that you might want to combine Corosync, SSH, and the WebUI. For stable operation, please keep in mind that Corosync is very sensitive (look at pvecm_cluster_requirements) . A separate, dedicated network interface would be best. As a backup, you could also use the management/SSH NIC, for example.

However, I assume you don’t have any additional interfaces available, so please check whether you’re experiencing any packet loss on the Proxmox node (or switch if possible).
 
Could be MTU issue? Check for MTU Mismatch, ensure that the MTU is consistent across all nodes and switches.
Check the switch for increased latacny/drops
 
Hi,
have you checked yet to see if you're experiencing network errors (drops)?
Code:
ip -s link show nic2 # change nic2 to your interface with corosync.

I understand that you might want to combine Corosync, SSH, and the WebUI. For stable operation, please keep in mind that Corosync is very sensitive (look at pvecm_cluster_requirements) . A separate, dedicated network interface would be best. As a backup, you could also use the management/SSH NIC, for example.

However, I assume you don’t have any additional interfaces available, so please check whether you’re experiencing any packet loss on the Proxmox node (or switch if possible).

Hello Abamalu,

Thank you for your help.

We have no drops or errors on the NIC2 interface:
ip -s link show nic2
3: nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether c4:d6:d3:5f:07:65 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
1776849 4299 0 0 0 17
TX: bytes packets errors dropped carrier collsns
4770 70 0 0 0 0
altname enp137s0f1np1
altname enxc4d6d35f0765

Same on the switch's port side.

We also tried to install the cluster on a dedicated network without the UI mgmt access and SSH, but the same behavior is observed.

Regards,
IDEZ Ugo
 
Could be MTU issue? Check for MTU Mismatch, ensure that the MTU is consistent across all nodes and switches.
Check the switch for increased latacny/drops
Hello YaZoal,

Thank you for your help.

The MTU is the same on each interfaces of each node :

ip link | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: nic1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr2 state UP mode DEFAULT group default qlen 1000
3: nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
4: nic3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP mode DEFAULT group default qlen 1000
5: nic4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP mode DEFAULT group default qlen 1000
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
7: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
8: vmbr2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000


Regards,
IDEZ Ugo
 
Hello IDES,
Alright, also If you haven't already, double check that the MTU settings are consistent between the nodes in the cluster.
I usually find it easier to verify the configuration by generating a system report from each node using the command:
pvereport > $(hostname)-report.txt
Once generated, copy the reports from all nodes to a single location and compare them using a tool like vimdiff or meld [0]:
Bash:
~# vimdiff pve1-pve-report.txt pve2-pve-report.txt pve3-pve-report.txt
Look for any discrepancies in the network configurations.

NIC 2 (10 Gbps Ethernet Connection) --> Dedicated for the host and cluster network (web access on 8006 port, SSH access and corosync communication for the PVE Cluster) [the 10.100.37.0/27 network]
Just for the sake of completeness, as it is probably not the cause of issue here. Proxmox VE defaults to using the management network (the IP linked to your node in /etc/hosts) for both cluster communication and migration traffic, its advised to configure migration network:
  • To configure the migration network on a different interface from the Corosync interface. You can do this by:
    DatacenterOptionsMigration Settings. check the wiki 5.14.2. Migration Network for details [1].
[0]
vimdiff: https://www.freecodecamp.org/news/compare-two-files-in-linux-using-vim/
meld: https://opensource.com/article/20/3/meld
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_migration_network
 
Last edited:
Hello IDES,
Alright, also If you haven't already, double check that the MTU settings are consistent between the nodes in the cluster.
I usually find it easier to verify the configuration by generating a system report from each node using the command:
pvereport > $(hostname)-report.txt
Once generated, copy the reports from all nodes to a single location and compare them using a tool like vimdiff [0]:
Bash:
~# vimdiff pve1-pve-report.txt pve2-pve-report.txt pve3-pve-report.txt
Look for any discrepancies in the network configurations.


Just for the sake of completeness, as it is probably not the cause of issue here. Proxmox VE defaults to using the management network (the IP linked to your node in /etc/hosts) for both cluster communication and migration traffic, its advised to configure migration network:
  • To configure the migration network on a different interface from the Corosync interface. You can do this by:
    DatacenterOptionsMigration Settings. check the wiki 5.14.2. Migration Network for details [1].
[0] https://www.freecodecamp.org/news/compare-two-files-in-linux-using-vim/
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_migration_network
I proceeded to the vimdiff between the différent reports, no differences about the version of the dependences.

I changed the migration network too, thk yu for the advice.

Regards,
IDEZ Ugo
 
Please also check /etc/corosync/corosync.conf . Does the addresses correct as the one you wanted? compare the nodes.
If possible please share from 2 nodes on this cluster:
Code:
~# pvecm status
~# cat /etc/pve/corosync.conf
Also collect and share journalctl output, for example:
Code:
~# journalctl -b -u corosync -u pve-cluster -u pveproxy -u pvedaemon | gzip > log.txt.gz
from any 2 nodes of this cluster.

The investigation on the below thread might help:
 
Hello,

Here is the response of the pve1 node and pve2 node :

PVE1 :
Cluster information
-------------------
Name: cluster-cdcprod
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Apr 13 15:09:52 2026
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1.96a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.100.37.251
0x00000002 1 10.100.37.250 (local)
0x00000003 1 10.100.37.252
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve1-cdcprod
nodeid: 2
quorum_votes: 1
ring0_addr: 10.100.37.250
}
node {
name: pve2-cdcprod
nodeid: 1
quorum_votes: 1
ring0_addr: 10.100.37.251
}
node {
name: pve3-cdcprod
nodeid: 3
quorum_votes: 1
ring0_addr: 10.100.37.252
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster-cdcprod
config_version: 3
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
token_coefficient: 125
version: 2
}

PVE2 :
Cluster information
-------------------
Name: cluster-cdcprod
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Apr 13 15:09:59 2026
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.96a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.100.37.251 (local)
0x00000002 1 10.100.37.250
0x00000003 1 10.100.37.252
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve1-cdcprod
nodeid: 2
quorum_votes: 1
ring0_addr: 10.100.37.250
}
node {
name: pve2-cdcprod
nodeid: 1
quorum_votes: 1
ring0_addr: 10.100.37.251
}
node {
name: pve3-cdcprod
nodeid: 3
quorum_votes: 1
ring0_addr: 10.100.37.252
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cluster-cdcprod
config_version: 3
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
token_coefficient: 125
version: 2
}


I attach you the 2 compressed files with the logs for both nodes and take a see on the topic you reference.
And please know that I truly appreciate your time and help. It's really kind of you.

Best regards,
IDEZ Ugo
 

Attachments

hmm I didn't find something concrete, but could you try to run corosync on another nic?
Or make sure the current IPs 10.100.37.x are not used as a gateway or routing
Code:
~# grep "has no active links" pve1-log.txt
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 1 has no active links
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 1 has no active links
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 1 has no active links
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:01:37 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:02:03 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:02:14 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:04:01 pve1-cdcprod corosync[1451]:   [KNET  ] host: host: 3 has no active links
~# grep "has no active links" pve2-log.txt
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 2 has no active links
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 2 has no active links
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 2 has no active links
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:01:41 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 3 has no active links
Apr 13 15:03:07 pve2-cdcprod corosync[1449]:   [KNET  ] host: host: 3 has no active links
Are the IPs 10.100.37.0/27 real public IPs used for access internet?
It might be that it have high latency and corosync is sensitive to it. The recommendation is keeping the latency below 5ms for stable operation.
In any case try to remove any routing/gateway from this nic and see if this help.

Also see:
 
  • Like
Reactions: abamalu
hmm I didn't find something concrete, but could you try to run corosync on another nic?
Or make sure the current IPs 10.100.37.x are not used as a gateway or routing

Hello,

We almost tried to install the cluster on the NIC2 interfaces and not the NIC1, nothing changed.
No, the interfaces are not routed, of course.
There are private IPs.

The latency between the nodes are very low, as you can see here :
ping -I vmbr0 -c 1000 10.100.37.251

PING 10.100.37.251 (10.100.37.251) from 10.100.37.250 vmbr0: 56(84) bytes of data.
64 bytes from 10.100.37.251: icmp_seq=1 ttl=64 time=0.082 ms
64 bytes from 10.100.37.251: icmp_seq=2 ttl=64 time=0.075 ms
64 bytes from 10.100.37.251: icmp_seq=3 ttl=64 time=0.056 ms
64 bytes from 10.100.37.251: icmp_seq=4 ttl=64 time=0.058 ms
64 bytes from 10.100.37.251: icmp_seq=5 ttl=64 time=0.065 ms
64 bytes from 10.100.37.251: icmp_seq=6 ttl=64 time=0.060 ms
64 bytes from 10.100.37.251: icmp_seq=7 ttl=64 time=0.070 ms
64 bytes from 10.100.37.251: icmp_seq=8 ttl=64 time=0.071 ms
64 bytes from 10.100.37.251: icmp_seq=9 ttl=64 time=0.069 ms
64 bytes from 10.100.37.251: icmp_seq=10 ttl=64 time=0.072 ms
64 bytes from 10.100.37.251: icmp_seq=11 ttl=64 time=0.072 ms
64 bytes from 10.100.37.251: icmp_seq=12 ttl=64 time=0.077 ms
64 bytes from 10.100.37.251: icmp_seq=13 ttl=64 time=0.061 ms
64 bytes from 10.100.37.251: icmp_seq=14 ttl=64 time=0.060 ms
64 bytes from 10.100.37.251: icmp_seq=15 ttl=64 time=0.072 ms
64 bytes from 10.100.37.251: icmp_seq=16 ttl=64 time=0.076 ms
64 bytes from 10.100.37.251: icmp_seq=17 ttl=64 time=0.063 ms
64 bytes from 10.100.37.251: icmp_seq=18 ttl=64 time=0.066 ms
64 bytes from 10.100.37.251: icmp_seq=19 ttl=64 time=0.069 ms
64 bytes from 10.100.37.251: icmp_seq=20 ttl=64 time=0.068 ms
64 bytes from 10.100.37.251: icmp_seq=21 ttl=64 time=0.076 ms
64 bytes from 10.100.37.251: icmp_seq=22 ttl=64 time=0.071 ms
64 bytes from 10.100.37.251: icmp_seq=23 ttl=64 time=0.072 ms
64 bytes from 10.100.37.251: icmp_seq=24 ttl=64 time=0.066 ms
64 bytes from 10.100.37.251: icmp_seq=25 ttl=64 time=0.060 ms
64 bytes from 10.100.37.251: icmp_seq=26 ttl=64 time=0.064 ms
64 bytes from 10.100.37.251: icmp_seq=27 ttl=64 time=0.080 ms
64 bytes from 10.100.37.251: icmp_seq=28 ttl=64 time=0.080 ms
64 bytes from 10.100.37.251: icmp_seq=29 ttl=64 time=0.058 ms
64 bytes from 10.100.37.251: icmp_seq=30 ttl=64 time=0.068 ms
64 bytes from 10.100.37.251: icmp_seq=31 ttl=64 time=0.077 ms
64 bytes from 10.100.37.251: icmp_seq=32 ttl=64 time=0.070 ms
64 bytes from 10.100.37.251: icmp_seq=33 ttl=64 time=0.073 ms
64 bytes from 10.100.37.251: icmp_seq=34 ttl=64 time=0.084 ms

Same for tcpdump connection :
tcpdump -i vmbr0 -n udp port 5405 -c 20h

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:29:42.769154 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769269 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769271 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769273 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769275 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769276 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769277 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769278 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769280 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769282 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769283 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769395 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769397 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769399 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769400 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769401 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769402 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769404 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769405 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
15:29:42.769408 IP 10.100.37.251.5405 > 10.100.37.252.5405: UDP, length 1472
20 packets captured
195 packets received by filter
0 packets dropped by kernel

I'll take a see on the referencee topic, thank you for your time.

Best regards,
IDEZ Ugo
 
please check the logs and configuration of all three nodes - these symptoms look like something is wired up/routed asymmetrically which tends to get corosync confused..
 
please check the logs and configuration of all three nodes - these symptoms look like something is wired up/routed asymmetrically which tends to get corosync confused..
Hello,

3 nodes are on the same network, so no routing...
I also checked multiple times the network logs, no errors appears.

Regards,
IDEZ Ugo
 
well, if you want help with analysis, it would still be helpful to provide the needed information/logs/configs/.. ;)
 
Please also share /etc/network/interfaces:
~# cat /etc/network/interfaces
Yes, here is :
cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface nic1 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.30.7.151/27
gateway 10.30.7.129
bridge-ports nic1
bridge-stp off
bridge-fd 0

iface nic2 inet manual

iface nic3 inet manual

iface nic4 inet manual


source /etc/network/interfaces.d/*
 
Yes, here is :
cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface nic1 inet manual

auto vmbr0
iface vmbr0 inet static
address 10.30.7.151/27
gateway 10.30.7.129
bridge-ports nic1
bridge-stp off
bridge-fd 0

iface nic2 inet manual

iface nic3 inet manual

iface nic4 inet manual


source /etc/network/interfaces.d/*
where is your 10.100.37.X ip adddress ?
 
I also provided logs files in a previous message :).

no you didn't - the third node is missing. like I wrote, please provide the configs and logs from all three nodes (covering the same boot/time period). please include the network configuration as well!