VMs loosing network

Drunkm0nk

New Member
May 23, 2025
4
0
1
HI, since 2 days, some of our VMs just loose network. Nobody did anything, no changes on the VM, the hosts and the physical switches.
The connection just drops and yersterday, after an hour or so, it came back. This is on different VMs on different hosts. We tried to migrate them to other hosts to no avail.
we have the issue came back today on a different VM, we see outgoing trafffic but nothing incoming. One of our admin perform theses steps yersterday and assumed it fixed the issue but it came back. If anyone has an idea how to get the VM to communicate, please help!

A CVE for the BIND9 package version 1:9.18.33-1~deb12u2 used and integrated into our PVE 1 clusters was installed. A DEBIAN note was issued mentioning
a connectivity issue (Blocking cache) which caused a network flow shutdown.

Remediation- 1- We cleared the cache, which released the blockage, and the entire platform and Tenant (VM) returned to normal.
2- A comprehensive check was conducted and revealed that this version must be replaced by version 1:9.18.44-1~deb12u2.
3- The version was installed on the environment and tested successfully, resolving the issue.
 
Hi @Drunkm0nk , welcome to the forum.

It does not seem probable that an upgrade of BIND9 package installed on a hypervisor would affect intermittent connectivity issue on random VMs.

You have not provided nearly sufficient information to get started. Here are some questions but this is truly just a start:
- PVE version
- Cluster configuration
- Network configuration
- VM OS
- VM network configuration
- VM logs during the issue
- Network traces during the issue (you mentioned that you do not see incoming traffic - how and where was this determined? How did you determine the outgoing traffic is flowing? Did it make it to the destination? Were any replies sent from the destination? Did the get lost at a certain point in network?)
- What is the state of the VM during the issue? Can you access it via a console? Can you run troubleshooting commands?
- What cache was cleared?
- The statement says that entire platform returned to normal, in addition to the VM, so it was more than VM affected?

Cheers,


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
HI bbgeek, thank you for getting back so quick.
- PVE version pve-manager/8.4.14/b502d23c55afcba1 (running kernel: 6.8.12-16-pve)
- Cluster configuration 4 nodes in a cluster with HA
- Network configuration OVS bridge, OVS bond with 2 nic in a active-backup mode on all 4 hosts.
- VM OS Windows 2016
- VM network configuration static ip with /26 mask
- VM logs during the issue : I cannot locate anything regarding the VMID in the logs
- Network traces during the issue (you mentioned that you do not see incoming traffic - how and where was this determined? How did you determine the outgoing traffic is flowing? Did it make it to the destination? Were any replies sent from the destination? Did the get lost at a certain point in network?)
The VM cannot be reached by the outside nor it cannot even reach its gateway. I can see the traffic from the VM summary but not within the VM
- What is the state of the VM during the issue? Can you access it via a console? Can you run troubleshooting commands? The VM was on and yes we can connect locally to the VM using the console. We run basic ping commands, tracert but nothing.. We disabled the windows firewall and there is non in proxmox.
- What cache was cleared? This is what I cannot find how to. Even after rebooting the VM and migrating the VM, will it not be cleared?
- The statement says that entire platform returned to normal, in addition to the VM, so it was more than VM affected? Yes, we had around 4 VMs that lost networking yersterday and out of nowhere it resumed. All VMs on different VLANS, hosts..

Thank you for your time!
 
Were windows Event logs checked? Can the VM ping/access another VM located within the same VLAN and same hypervisor?

Perhaps you should have rolling network capture, or be prepared to start one at various points in the network (hypervisor, bridge, VM, switch) when/if the issue repeats.

This sounds like a business environment with many moving pieces. None, so far, has pointed to PVE being the culprit. The symptoms are very generic.

When asking for help in the forum you are expected to do most of the legwork and be proactive with advanced troubleshooting steps. You will likely find that engaging a Network/OS/PVE specialist to assist you would be more efficient in your case.

There is an existing thread about VIRTIO driver versions for Windows Networking, and how versions above particular level can experience issues (different symptoms from yours). You should check it out, but again the symptoms are not the same.

Cheers,


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
This is what im trying to point that it is not proxmox related.
The windows VM cannot ping anything, not even its gateway.
we had our network team look into layer 3 and 2 and all switches.
we had our windows sys admins look into the VM.
It was my last resort to post on this forum as I ran out of ideas and the pressure is getting intense to get that VM to talk.
Anyhow, thank you for your time and have a great day.
 
To finish on a positive note, we were able to pin point the issue on the physical switches. It appears someone removed the VLAN tagging in the fabirc during the lunch time and juste walked away.
Thank you bbgeek17!