Random Network Glitches on a VM

A_L_E_X

Active Member
Apr 23, 2018
6
0
41
37
Hi,



Recently started experiencing a lot of issues with a docker VM hosted in Proxmox, which is hosting majority of home services.

Same VM has been running for multiple years just fine and changed 3 different physical hosts without a problem since last few weeks.



I got many random glitches related to connectivity to this particular VM - occasional drops and self restore after 5/10/15 minutes.

Proxmox host reboot does seems to help making those glitches a bit rare, but after a week of uptime you get them every couple of hours.





Setup - Small cluster of 3 nodes. Lately keep two of them up, as there is no good cooling in the rack.

Only thing changed to the VM is adding a PCI Pass-through GPU. (RTX A2000).



Some high level observations:

- Connection from PVE host to VM works 100% of the times, no drops, no broken pipes.

- Connections from my laptop to VM drops quite often. When connecting from laptop to PVE and then to VM - all good.



alex@onewithforce:~$ ssh alex@192.168.1.11
ssh: connect to host 192.168.1.11 port 22: Connection refused
alex@onewithforce:~$ ssh alex@192.168.1.11
ssh: connect to host 192.168.1.11 port 22: Connection refused
alex@onewithforce:~$ date
Fri Aug 2 10:07:52 PM EEST 2024
alex@docker:~$ client_loop: send disconnect: Broken pipe



Fri Aug 2 10:07:55 PM EEST 2024
root@docker:/etc/ssh# ip a l | grep 192.
inet 192.168.1.11/24 brd 192.168.1.255 scope global enp6s18
root@docker:/etc/ssh#
When glitches start - VM is only reachable from PVE host itself - not from my laptop or from any other LXC/VMs running on other PVE host.

Tried to disable the pve-firewall - no effect.

root@pve:~# pve-firewall status
Status: disabled/stopped
root@pve:~# date
Fri Aug 2 10:19:18 PM EEST 2024
root@pve:~#

Dmesg does not show anything useful, I have seen before the vmbr0 messages as well, but it has not been an issue.

[ 33.955990] tap112i0: entered promiscuous mode
[ 33.968276] vmbr0: port 5(tap112i0) entered blocking state
[ 33.968280] vmbr0: port 5(tap112i0) entered disabled state
[ 33.968296] tap112i0: entered allmulticast mode
[ 33.968342] vmbr0: port 5(tap112i0) entered blocking state
[ 33.968343] vmbr0: port 5(tap112i0) entered forwarding state
[ 54.665645] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000000)
[ 165.084973] tap112i0: left allmulticast mode
[ 165.084988] vmbr0: port 5(tap112i0) entered disabled state
[ 166.470584] tap112i0: entered promiscuous mode
[ 166.482798] vmbr0: port 5(tap112i0) entered blocking state
[ 166.482801] vmbr0: port 5(tap112i0) entered disabled state
[ 166.482812] tap112i0: entered allmulticast mode
[ 166.482854] vmbr0: port 5(tap112i0) entered blocking state
[ 166.482855] vmbr0: port 5(tap112i0) entered forwarding state


Journalctl does not give many interesting things, apart of :



Aug 02 18:45:54 pve corosync[1348]: [KNET ] link: host: 2 link: 0 is down
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 has no active links
Aug 02 18:45:54 pve corosync[1348]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 02 18:45:54 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 18:45:54 pve corosync[1348]: [KNET ] pmtud: Global data MTU changed to: 1397
Aug 02 19:11:52 pve corosync[1348]: [KNET ] link: host: 2 link: 0 is down
Aug 02 19:11:52 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 19:11:52 pve corosync[1348]: [KNET ] host: host: 2 has no active links
Aug 02 19:11:53 pve corosync[1348]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 02 19:11:53 pve corosync[1348]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 02 19:11:53 pve corosync[1348]: [KNET ] pmtud: Global data MTU changed to: 1397


root@pve:~# corosync-cmapctl -m stats | grep -i down
stats.knet.node1.link0.down_count (u32) = 0
stats.knet.node2.link0.down_count (u32) = 20
stats.knet.node3.link0.down_count (u32) = 1


I was changing LAN cables, used a different ports on the switch and so far no result. Have idea to change the switch as well on Sunday.

It does not indicate to be the switch, as the connection from laptop to host is 100% reliable based on several days stats with no single packet drop, while the VM inside the PVE are dropping.



My Plan B is to drop the cluster completely and see if stand alone PVE instance would have the same behavior. Have to admit the cluster is not NW redundant yet and network is a flat one. Struggling to find the time to optimize it and introduce vlans, but first priority is to improve the reliability of the hosted services.

P.S: Running latest PVE:
root@pve:~# pveversion
pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-3-pve)
root@pve:~#

P.S: Duplicate of the previous one. This was made, as the original disappear. Can be deleted as duplicate.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!