Random network glitches on VM

A_L_E_X · Aug 2, 2024

Hi,

Recently started experiencing a lot of issues with a docker VM hosted in Proxmox, which is hosting majority of home services.
Same VM has been running for multiple years just fine and changed 3 different physical hosts without a problem since last few weeks.

I got many random glitches related to connectivity to this particular VM - occasional drops and self restore after 5/10/15 minutes.
Proxmox host reboot does seems to help making those glitches a bit rare, but after a week of uptime you get them every couple of hours.

Setup - Small cluster of 3 nodes. Lately keep two of them up, as there is no good cooling in the rack.
Only thing changed to the VM is adding a PCI Pass-through GPU. (RTX A2000).

Some high level observations:
- Connection from PVE host to VM works 100% of the times, no drops, no broken pipes.
- Connections from my laptop to VM drops quite often. When connecting from laptop to PVE and then to VM - all good.

alex@onewithforce:~$ ssh alex@192.168.1.11
ssh: connect to host 192.168.1.11 port 22: Connection refused
alex@onewithforce:~$ ssh alex@192.168.1.11
ssh: connect to host 192.168.1.11 port 22: Connection refused
alex@onewithforce:~$ date
Fri Aug 2 10:07:52 PM EEST 2024
alex@docker:~$ client_loop: send disconnect: Broken pipe

Fri Aug 2 10:07:55 PM EEST 2024
root@docker:/etc/ssh# ip a l | grep 192.
inet 192.168.1.11/24 brd 192.168.1.255 scope global enp6s18
root@docker:/etc/ssh#

When glitches start - VM is only reachable from PVE host itself - not from my laptop or from any other LXC/VMs running on other PVE host.
Tried to disable the pve-firewall - no effect.

root@pve:~# pve-firewall status
Status: disabled/stopped
root@pve:~# date
Fri Aug 2 10:19:18 PM EEST 2024
root@pve:~#

Dmesg does not show anything useful, I have seen before the vmbr0 messages as well, but it has not been an issue.

[ 33.955990] tap112i0: entered promiscuous mode
[ 33.968276] vmbr0: port 5(tap112i0) entered blocking state
[ 33.968280] vmbr0: port 5(tap112i0) entered disabled state
[ 33.968296] tap112i0: entered allmulticast mode
[ 33.968342] vmbr0: port 5(tap112i0) entered blocking state
[ 33.968343] vmbr0: port 5(tap112i0) entered forwarding state
[ 54.665645] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000000)
[ 165.084973] tap112i0: left allmulticast mode
[ 165.084988] vmbr0: port 5(tap112i0) entered disabled state
[ 166.470584] tap112i0: entered promiscuous mode
[ 166.482798] vmbr0: port 5(tap112i0) entered blocking state
[ 166.482801] vmbr0: port 5(tap112i0) entered disabled state
[ 166.482812] tap112i0: entered allmulticast mode
[ 166.482854] vmbr0: port 5(tap112i0) entered blocking state
[ 166.482855] vmbr0: port 5(tap112i0) entered forwarding state

Journalctl does not give many interesting things, apart of :

Aug 02 18:45:54 pve corosync[1348]: [KNET ] link: host: 2 link: 0 is down
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 has no active links
Aug 02 18:45:54 pve corosync[1348]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 18:45:54 pve corosync[1348]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 02 19:11:52 pve corosync[1348]: [KNET ] link: host: 2 link: 0 is down
Aug 02 19:11:52 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 19:11:52 pve corosync[1348]: [KNET ] host: host: 2 has no active links
Aug 02 19:11:53 pve corosync[1348]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 02 19:11:53 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 19:11:53 pve corosync[1348]: [KNET ] pmtud: Global data MTU changed to: 1397

root@pve:~# corosync-cmapctl -m stats | grep -i down
stats.knet.node1.link0.down_count (u32) = 0
stats.knet.node2.link0.down_count (u32) = 20
stats.knet.node3.link0.down_count (u32) = 1

I was changing LAN cables, used a different ports on the switch and so far no result. Have idea to change the switch as well on Sunday.
It does not indicate to be the switch, as the connection from laptop to host is 100% reliable based on several days stats with no single packet drop, while the VM inside the PVE are dropping.

My Plan B is to drop the cluster completely and see if stand alone PVE instance would have the same behavior. Have to admit the cluster is not NW redundant yet and network is a flat one. Struggling to find the time to optimize it and introduce vlans, but first priority is to improve the reliability of the hosted services.

P.S: Running latest pve:

root@pve:~# pveversion
pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-3-pve)
root@pve:~#

VM was restored from backup to the new physical host. Nothing was changed on the NW side - router, switches and etc. Only changed LAN cables after the glitches started.

A_L_E_X · Aug 4, 2024

Removed the cluster, unfortunately same situation. Suspect vmbr issue with the new physical nic. Previous host had Intel one, new one a cheap Realtek, no extra drivers or config was done. Simple a clean new install and restored VM from backup.

Search

Search

Random network glitches on VM

A_L_E_X

Active Member

A_L_E_X

Active Member

We value your privacy