[SOLVED] Win2019 VMs losing network connectivity

AraToken · Apr 28, 2022

Hello,
i am currently trying to troubleshoot a problem with some of my windows server 2019 vms. They repeatedly lose network connectivity, regardless of the vm settings used. Rebooting the vm seems to solve the problem for a while until it reappears again. I already tried switching between virtio, e1000 and changed the machine mode to q35-5.1 and similar versions.
The proxmox-hostmachines are also all up to date.
It also seems that this problem is only present with my win 2019 VMs, other instances with Win10 or Win 7 run without problems.

Are there any known issues regarding win2019 and networking that i'm missing here? Or is there some other "basic settings" inside proxmox that i can try in order to fix it?

Thank you in advance for any help!

oguz · Apr 28, 2022

hi,

AraToken said:
They repeatedly lose network connectivity, regardless of the vm settings used.

could you describe the symptoms little more?

for example:
- is the network adapter in working state? (check the device manager in windows)
- can you still ping the gateway IP or a public IP address?

AraToken said:
ebooting the vm seems to solve the problem for a while until it reappears again. I already tried switching between virtio, e1000 and changed the machine mode to q35-5.1 and similar versions.

could you post your current VM configuration? qm config VMID (replace VMID with yours)

AraToken said:
Are there any known issues regarding win2019 and networking that i'm missing here?

nothing that i know of, probably network is misconfigured..

AraToken said:
some of my windows server 2019 vms

also please post the config of a working win2019 VM

AraToken · Apr 28, 2022

Hi!
thank you for your quick reply!

oguz said:
could you describe the symptoms little more?

As soon as the vms lose connection, windows shows "no internet access", i also cannot ping the gateway.

We use the following vm config:

Code:

qm config 104
boot: order=virtio0
cores: 4
machine: pc-q35-5.1
memory: 16088
net0: e1000=0A:F5:23:87:3F:85,bridge=vmbr2,firewall=1
numa: 0
ostype: win10
sata3: default:vm-104-disk-2,size=3500G
scsihw: virtio-scsi-pci
smbios1: uuid=e0af959f-15ee-41fe-9735-8097da0f5b4d
sockets: 4
virtio0: default:vm-104-disk-0,size=100G
vmgenid: f3453246-de00-4127-80fe-634cbaf1f284

oguz said:
nothing that i know of, probably network is misconfigured..

Network connectivity also works fine when the system is freshly bootet, so i assume that it is basically correctly configured.

oguz said:
also please post the config of a working win2019 VM

Sorry if i misspelled anything but i meant that this issue occures on all Win2019 VMs so we currently have no working 2019 box. I can, however, add a config of one of the working Win10 or Win7 instances if that helps.

oguz · Apr 28, 2022

AraToken said:
on all Win2019 VMs so we currently have no working 2019 box.

oh okay, from your post i thought only some of them aren't working

could you also check that the win2019 VMs have different MAC addresses for their network interfaces?

your network interface inside the vm:

Code:

net0: e1000=0A:F5:23:87:3F:85,bridge=vmbr2,firewall=1

is using the bridge vmbr2. could you show your /etc/network/interfaces file from your PVE node?

AraToken said:
can, however, add a config of one of the working Win10 or Win7 instances if that helps.

are those VMs also using the vmbr2 bridge?

also i noticed that you're using an e1000 adapter, this is not the recommended setting. maybe you should take a look at our windows guest best practices guide [0]

[0]: https://pve.proxmox.com/wiki/Windows_2019_guest_best_practices

ffabian · Apr 28, 2022

Same problem here with Server 2019 DE and 2022 DE, but also with Server 2022 ENG after adding a VLAN-ID

AraToken · Apr 28, 2022

Hi,

yes, every vm uses a different MAC address.

Here's the interface-config from our proxmox host:

Code:

cat /etc/network/interfaces

auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

auto enp94s0f0
iface enp94s0f0 inet manual

auto enp94s0f1
iface enp94s0f1 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves enp94s0f0 enp94s0f1
    bond-miimon 100
    bond-mode active-backup
    bond-primary enp94s0f0

auto vmbr0
iface vmbr0 inet static
    address xxx.xxx.xxx.xxx/24
    gateway xxx.xxx.xxx.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#intern-nodewide

auto vmbr2
iface vmbr2 inet static
    address 10.10.10.10/24
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#intern-clusterwide

vmbr2 is our bridge for clusterwide connection (We are running a 3 node Proxmox Cluster, sorry if i forgot to mention that, i thought this is not necessary because as mentioned before, other vms work perfectly fine).

Some of our other WIndows VMs (7 and 10) are also using vmbr2.

Currently i'm setting up another test box with your recommended settings without the e1000 but as i said i already tried switching to virtio with our boxes with no avail.
But i assume that it should still work with the e1000 aswell...

EDIT: Regarding ffabian's answer, could that issue maybe be related to the vlan aware of the bridge?

AraToken · May 6, 2022

Hello,

apparently switching to virtio did not solve the problem. We still experience disconnects with our windows machines.
Any news regarding my question if it may be related to vlan settings?

Thanks and best regards

ffabian · May 7, 2022

In our case a damaged arp table / a switch was the problem - not a proxmox bug as initially assumed. We had to flush the arp-table and reboot the switch. I'm sorry but i don't think this is related to your problem.

AraToken · Jun 20, 2022

Hello,

we still experience random disconnects within our vms. Switching the network devices (like virtio) did not solve it and we are unable to determine the cause of it within windows itself.

Is there any other way to perform troubleshooting on the linux bridge of the host to check if there are any problems with that? (even though i doubt it because problems with the bridge would result in all vms to lose connection i assume)

Thank you in advance and best regards.

oguz · Jun 20, 2022

AraToken said:
Is there any other way to perform troubleshooting on the linux bridge of the host to check if there are any problems with that? (even though i doubt it because problems with the bridge would result in all vms to lose connection i assume)

you could check the journalctl and syslog on your PVE node to see if there are any related error messages, but yes you're right, if there was a problem with the bridge itself then you would most likely experience the issue on all VMs (on non-windows ones as well).

AraToken said:
could that issue maybe be related to the vlan aware of the bridge?

you could test it and let us know if it makes a difference

AraToken · Jun 21, 2022

Hello Oguz,

i checked the logfiles of our PVE hosts but i cannot find any evidence of problems with the bridge.
We will disable the VLAN feature next and monitor the situation.
I still think it's weird that those disconnects only occur in our Windows 2019 machines.. If theres an issue with the vlans itself, other VMs should be affected aswell..

oguz · Jun 21, 2022

AraToken said:
If theres an issue with the vlans itself, other VMs should be affected aswell..

AraToken said:
I still think it's weird that those disconnects only occur in our Windows 2019 machines..

yes that's what i would think...
which makes it more likely to be something in the windows stack.

AraToken said:
As soon as the vms lose connection, windows shows "no internet access",

have you went through the windows troubleshooting menus?

does resetting the network adapter inside windows make a difference?

AraToken said:
Is there any other way to perform troubleshooting on the linux bridge of the host to check if there are any problems with that?

you can also try running tcpdump on your PVE machine, for example tcpdump -envi vmbr1 -w vmbr1.pcap or vmbr2 depending on which interface you want to watch, and use filters for your windows machines while viewing the packet dump (or for example in wireshark on windows)

AraToken · Jun 21, 2022

I tried different troubleshootings on the windows machines itself but resetting the network did not resolve the issue. Connectivity got restored after the whole windows machine was rebooted.
When i checked the windows event logs the only messages that occur are notifications that the network profile changed and (from our installed software) that the physical connection is not ready.

I can try to run a tcp dump on my host in case i see anything. Thank you for that tipp! But it could take a while until i see results at all because currently the disconnects appear at random intervals (the last two weeks, for example, everything worked just fine).

AraToken · Jul 12, 2022

Update:
We performed maintenance on our cluster and disabled the vlan feature on the bridge but the problem is still persistent. All nodes got updated to the current pve version as well (currently running: PVE 7.2-6 with kernel 5.15.39-1). This issue is giving me a serious headache..
Because of some of our policies i cannot provide a tcp dump of the bridge itself but at this point i somewhat doubt it that the bridge is the cause..

We might try - again - to use different network interface types, switching back to virtio instead of e1000 once more and hoping that we at least find some more clues about this, even though we tried those already.
According to my colleague, similar PVE-Cluster work fine, the main difference here being that they still run on PVE 6 instead of 7 but i can hardly believe that the newer version would have problems with Win 2019.

As i find myself in a somewhat-dead-end, i will continue looking for the cause on the windows-side of things.

Still, if there are any more tips or possible causes from any of you that may help with this i would really apprechiate it!

merkkg · Jul 13, 2022

Hi,

I have a large cluster and I have had similar issues with windows servers as well the version im using is windows server 2012 R2

I intially used to reboot the whole VM as well until I wrote a script for each interface that ran in a loop that pings another host in network or default gateway and if no ping response would disable and enable the interface from CLI

Issue has been there since pve 6 and still present on pve7

Code:

@echo off
set INTERFACE="vlan100"
set COUNT=1
set SLEEP=60
set IP=192.168.100.1
set LOG="C:\WD.log"

echo %DATE% %TIME%: Watchdog started %INTERFACE% >> %LOG%

:loop
rem Checking if have network connection and reset if down
timeout %SLEEP%
ping -n %COUNT% -w 1000 -l 0 %IP%
if %errorlevel% EQU 0 goto :loop

echo %DATE% %TIME%: Connection failed. Restarting interface %INTERFACE%.. >> %LOG%
netsh interface set interface %INTERFACE% disable
netsh interface set interface %INTERFACE% enable
goto :loop

NOTE: My windows VM's was migrated from vmware so I alwasy thought it could have been because of that.

AraToken · Jul 14, 2022

Thank you for this script merkkg! I will definetely keep that in mind as a workaround if we really cannot find the solution for this one.
Currently i am troubleshooting if there is a possible problem with one of my clusternodes as i noticed that all problematic vms sit on the same node, evne though everything else seems to work as expected.

merkkg · Jul 14, 2022

My VM's are on differnt nodes I tried moving them around a few times, it did not make difference at least in my case.

Noteq · Aug 4, 2022

Posting for exposure; we are also experiencing issues with Windows 2019 VM's losing connectivity on different proxmox hosts. No VLAN tagging to the VMs, single WAN uplink, single NIC on the VM.

AraToken · Aug 8, 2022

Update: apparently moving VMs to another host seems not to be the solution so i dont really think its a problem with one PVE host itself. The disconnects appeared on other clusternodes as well but still only on win 2019 machines.
I also tried reaching out for help or advise in microsoft forums but so far got successfully ignored.

Research on my windows machines seem to point to some event log messages regarding the NCSI service with error message "suspectArpProbeFailed" and therefore resulting in a network reset. But so far i'm not quite sure if that is the source (regarding windows itself) or just another symptome.

@Noteq and @merkkg did any of you notice something similar?

Best regards

merkkg · Aug 8, 2022

@AraToken I did something i never did before and I installed before which was installed the Qemu Guest Agent and drivers and since then i have not had a single issue anymore with my script stopped.

Im not sure why i didn't try that before but its been over 2 weeks with no incident.

https://pve.proxmox.com/wiki/Qemu-guest-agent#Windows

Fred

[SOLVED] Win2019 VMs losing network connectivity

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Member

Member

Member

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Member

Member

Member

Member

New Member

Member

Member