Proxmox Ports randomly restart

ithrasiel

New Member
Aug 30, 2024
5
0
1
Hi together,

i have installed Proxmox in 8.3.3 on my Intel NUCs and get over a long time massive problems with the networking. The ports on both hosts restart randomly multiple times a day, especially while the Networking Traffic is higher (Backups, writing on NFS, etc.)

This leads to lags on VMs / CTs and failing Replication + Backup Jobs.

The Hardware is the following (2 identical Systems):
Intel Nuc 13th Gen
i7-1360P
64GB RAM
1x SATA 1TB SSD (Enterprise)
1x NVME 1TB SSD (Consumer)

As Storage i'm using two zfs Datastores, foreach Disk one, which replicate the virtual Machines to the other host.
Connected is both to an Switch via RJ45 with 2.5G. In before i thought the switch would be my problem, but now i exchanged the switch to another model of another manufacturer and the problems persists. If tried the configuration with 1500 MTU (default) and 9000 MTU. The problem is identical. At this moment the configuration is running on 9000 MTU still.

Also i'm using an Synology DS923+ NAS as an NFS Target for an bigger VM with a ca. 30TB Disk. The NAS is connected via RJ45 10G to the switch.

Just as side-fact - i have an virtualized pfSense running on this cluster which does the routing, DNS and everything else networking-related. But both hosts are still able to connect to the other one and have local dns entries, to prepare for an outage of DNS. So this shouldn't have any impact.


Parts of the Log, where an Port restart was going on:
Code:
04 23:47:02 kd-node02 corosync[1871]:   [KNET  ] pmtud: Global data MTU changed to: 8885
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pvestatd[1902]: Backup: error fetching datastores - 500 Can't connect to kd-pbs:8007 (Name or service not known)
Feb 04 23:47:04 kd-node02 pvestatd[1902]: status update time (10.150 seconds)
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pve-ha-lrm[2052]: status change active => lost_agent_lock
Feb 04 23:47:06 kd-node02 corosync[1871]:   [KNET  ] link: host: 1 link: 0 is down
Feb 04 23:47:06 kd-node02 corosync[1871]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 23:47:06 kd-node02 corosync[1871]:   [KNET  ] host: host: 1 has no active links
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: -2069891428
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: -2069891428
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 0
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 0
Feb 04 23:47:07 kd-node02 corosync[1871]:   [TOTEM ] Token has not been received in 2250 ms
Feb 04 23:47:08 kd-node02 corosync[1871]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Feb 04 23:47:10 kd-node02 corosync[1871]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 04 23:47:10 kd-node02 corosync[1871]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 04 23:47:10 kd-node02 corosync[1871]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 23:47:10 kd-node02 teleport[1730]: 2025-02-04T23:47:10.619+01:00 WARN [UPLOAD:1]  The Instance connector is still not available, process-wide services such as session uploading will not function pid:1730.1 service/service.go:3136
Feb 04 23:47:10 kd-node02 corosync[1871]:   [QUORUM] Sync members[2]: 1 2

Thank you for your help in advance and kind regards
 
Hi,

The ports on both hosts restart randomly multiple times a day, especially while the Networking Traffic is higher (Backups, writing on NFS, etc.)
Connected is both to an Switch via RJ45 with 2.5G
Corosync is very latency-sensitive, i.e. requires a link latency of 5ms or below - see Cluster Network Requirements.
With high network traffic (esp. storage traffic), the latency can go up and thus corosync loses quorum - as can also be seen from the log.

We generally recommend a dedicated link for corosync traffic, as to avoid these problems. For corosync traffic, a MTU of 1500 is also recommended, again to counter any latency.

The Hardware is the following (2 identical Systems):
Do you have a QDevice configured? While possible, two-node cluster can present problems on reboots/hardware failures.
 
  • Like
Reactions: ithrasiel
Hi,



Corosync is very latency-sensitive, i.e. requires a link latency of 5ms or below - see Cluster Network Requirements.
With high network traffic (esp. storage traffic), the latency can go up and thus corosync loses quorum - as can also be seen from the log.

We generally recommend a dedicated link for corosync traffic, as to avoid these problems. For corosync traffic, a MTU of 1500 is also recommended, again to counter any latency.


Do you have a QDevice configured? While possible, two-node cluster can present problems on reboots/hardware failures.

I had run this constelation a long time with 1500 MTU and the scenario seems to be identical. I have an QDevice (Raspberry Pi) which is also connected to the switch via 1GB RJ45. Because of the build structure of the Intel NUC, i just have one uplink. I think it would more "Best-Practice" to use an dedicated link here, but i think this isn't my problem in this case. I have an Check_MK Instance running on this cluster and this is where i got the informations from, that my NICs are not reachable. I get multiple messages a day where multiple VMs and the host, which the vms are running on, is not reachable. This matches to the dmesg Entries, where the ports restarts. If i would have cluster problems with the QDevice, i'm sure the vms would be restarting on the other hosts, but the uptime is the whole time fine (HA is configured).

Thank you for answer :) Maybe you have another idea where i can continue my research in?