Hi together,
i have installed Proxmox in 8.3.3 on my Intel NUCs and get over a long time massive problems with the networking. The ports on both hosts restart randomly multiple times a day, especially while the Networking Traffic is higher (Backups, writing on NFS, etc.)
This leads to lags on VMs / CTs and failing Replication + Backup Jobs.
The Hardware is the following (2 identical Systems):
Intel Nuc 13th Gen
i7-1360P
64GB RAM
1x SATA 1TB SSD (Enterprise)
1x NVME 1TB SSD (Consumer)
As Storage i'm using two zfs Datastores, foreach Disk one, which replicate the virtual Machines to the other host.
Connected is both to an Switch via RJ45 with 2.5G. In before i thought the switch would be my problem, but now i exchanged the switch to another model of another manufacturer and the problems persists. If tried the configuration with 1500 MTU (default) and 9000 MTU. The problem is identical. At this moment the configuration is running on 9000 MTU still.
Also i'm using an Synology DS923+ NAS as an NFS Target for an bigger VM with a ca. 30TB Disk. The NAS is connected via RJ45 10G to the switch.
Just as side-fact - i have an virtualized pfSense running on this cluster which does the routing, DNS and everything else networking-related. But both hosts are still able to connect to the other one and have local dns entries, to prepare for an outage of DNS. So this shouldn't have any impact.
Parts of the Log, where an Port restart was going on:
Thank you for your help in advance and kind regards
i have installed Proxmox in 8.3.3 on my Intel NUCs and get over a long time massive problems with the networking. The ports on both hosts restart randomly multiple times a day, especially while the Networking Traffic is higher (Backups, writing on NFS, etc.)
This leads to lags on VMs / CTs and failing Replication + Backup Jobs.
The Hardware is the following (2 identical Systems):
Intel Nuc 13th Gen
i7-1360P
64GB RAM
1x SATA 1TB SSD (Enterprise)
1x NVME 1TB SSD (Consumer)
As Storage i'm using two zfs Datastores, foreach Disk one, which replicate the virtual Machines to the other host.
Connected is both to an Switch via RJ45 with 2.5G. In before i thought the switch would be my problem, but now i exchanged the switch to another model of another manufacturer and the problems persists. If tried the configuration with 1500 MTU (default) and 9000 MTU. The problem is identical. At this moment the configuration is running on 9000 MTU still.
Also i'm using an Synology DS923+ NAS as an NFS Target for an bigger VM with a ca. 30TB Disk. The NAS is connected via RJ45 10G to the switch.
Just as side-fact - i have an virtualized pfSense running on this cluster which does the routing, DNS and everything else networking-related. But both hosts are still able to connect to the other one and have local dns entries, to prepare for an outage of DNS. So this shouldn't have any impact.
Parts of the Log, where an Port restart was going on:
Code:
04 23:47:02 kd-node02 corosync[1871]: [KNET ] pmtud: Global data MTU changed to: 8885
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:03 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:04 kd-node02 pvestatd[1902]: Backup: error fetching datastores - 500 Can't connect to kd-pbs:8007 (Name or service not known)
Feb 04 23:47:04 kd-node02 pvestatd[1902]: status update time (10.150 seconds)
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:05 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:06 kd-node02 pve-ha-lrm[2052]: status change active => lost_agent_lock
Feb 04 23:47:06 kd-node02 corosync[1871]: [KNET ] link: host: 1 link: 0 is down
Feb 04 23:47:06 kd-node02 corosync[1871]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 23:47:06 kd-node02 corosync[1871]: [KNET ] host: host: 1 has no active links
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 9
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: -2069891428
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: -2069891428
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 0
Feb 04 23:47:07 kd-node02 pmxcfs[1846]: [dcdb] crit: cpg_send_message failed: 0
Feb 04 23:47:07 kd-node02 corosync[1871]: [TOTEM ] Token has not been received in 2250 ms
Feb 04 23:47:08 kd-node02 corosync[1871]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Feb 04 23:47:10 kd-node02 corosync[1871]: [KNET ] rx: host: 1 link: 0 is up
Feb 04 23:47:10 kd-node02 corosync[1871]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Feb 04 23:47:10 kd-node02 corosync[1871]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 04 23:47:10 kd-node02 teleport[1730]: 2025-02-04T23:47:10.619+01:00 WARN [UPLOAD:1] The Instance connector is still not available, process-wide services such as session uploading will not function pid:1730.1 service/service.go:3136
Feb 04 23:47:10 kd-node02 corosync[1871]: [QUORUM] Sync members[2]: 1 2
Thank you for your help in advance and kind regards