Corosync - Cluster retransmit issues | pmxcfs / corosync synchronization problems | Proxmox cluster 3 nodes

kd-infradijon

New Member
Apr 10, 2026
3
0
1
FRANCE
Hello,

I am opening a forum discussion because I am experiencing an issue with the Proxmox cluster in my environment.

Indeed, I installed Proxmox VE 9.1.6 on each node and everything went very smoothly, right up until the cluster was created.
I’ll provide you with all the relevant information about my environment, followed by service logs, a record of actions taken, and so on...


Environment​

  • Proxmox VE Version : Proxmox VE 9.1.6
  • VM Version : 6.17.13-2-pve
  • Proxmox VE cluster: 3 nodes
  • Transport: knet
  • Nodes:
    • pve1: 10.100.37.250
    • pve2: 10.100.37.251
    • pve3: 10.100.37.252

The 3 nodes Proxmox Servers are installed on 3 Dell PowerEdge XR4510c servers with the dedicated networks ports following :

  • NIC 1 (10 Gbps Ethernet Connection) --> Dedicated for the VMs network
  • NIC 2 (10 Gbps Ethernet Connection) --> Dedicated for the host and cluster network (web access on 8006 port, SSH access and corosync communication for the PVE Cluster) [the 10.100.37.0/27 network]
  • NIC 3 & NIC 4 (10 Gbps Optic Connection) --> Dedicated for the CEPH cluster that we didn't intalled yet.

Actions taken​


Here is the actions that we have taken :
  • Intalled and configured properly Proxmox VE on each node --> OK
  • Add the Basic licence on each node --> OK
  • Updated each node with the Enterprise repository --> OK
  • Configured the network interfaces --> OK

Obviously we restarted the server multiple times and everything was OK and each node stable.

Then we created the clusther through the UI of the first node [pve1] in "Datacenter --> Cluster --> Create Cluster --> Select the 10.100.37.0/27i interface", copied informations and join the cluster on the 2 other nodes.

From that point on, the nodes became unstable, including issues with UI access (resulting in complete loss of access through the web, only SSH still works) and intermittent problems with the functional cluster.



Logs of services and some diagnostics​

In the logs we see retransmit list messages:
Apr 09 14:44:04 pve2-cdcserris corosync[4045]: [TOTEM ] Retransmit List: 1f 20 22 23 24 25 26 27 28 33

Apr 09 14:43:59 pve3-cdcserris corosync[3860]: [TOTEM ] Retransmit List: 28 1d

Apr 09 14:35:19 pve1-cdcserris corosync[3405]: [TOTEM ] Retransmit List: 15 16 17 19

Seems network issues cause D state for Proxmox services

Apr 09 14:31:14 pve2-cdcserris kernel: task:pvestatd state:D stack:0 pid:1296 tgid:1296 ppid:1 flags:0x00004002

Apr 09 14:41:24 pve1-cdcserris kernel: task:pvescheduler state:D stack:0 pid:4380 tgid:4380 ppid:1349 flags:0x00004002
 
Hi,
have you checked yet to see if you're experiencing network errors (drops)?
Code:
ip -s link show nic2 # change nic2 to your interface with corosync.

I understand that you might want to combine Corosync, SSH, and the WebUI. For stable operation, please keep in mind that Corosync is very sensitive (look at pvecm_cluster_requirements) . A separate, dedicated network interface would be best. As a backup, you could also use the management/SSH NIC, for example.

However, I assume you don’t have any additional interfaces available, so please check whether you’re experiencing any packet loss on the Proxmox node (or switch if possible).
 
Could be MTU issue? Check for MTU Mismatch, ensure that the MTU is consistent across all nodes and switches.
Check the switch for increased latacny/drops