KVM Windows VMs and losing network connectivity

Dec 27, 2019
4
0
1
27
Update from my environment.
OPenVswitch
Proxmox 6.1-7
Virtio Drivers 173

I just experienced the disconnect issue a couple more times yesterday with a Windows Server 2019 VM. I notice it mostly that *sometimes* on booting up from a reboot the NIC will fail to pass traffic. It seems to happen on *some* reboots, but definitely not all. As in the past. Either rebooting the VM again, or disconnecting the NIC and reconnecting it (either from within windows or from the hardware tab in proxmox) will get the NIC working again.

I have two vms that are configured identically, and were cloned from the same template. One of them experienced the issue twice last night. The other one did not. (both VMs were getting the same application installed on them last night, hence the reboots)

I have tried to reproduce the issue on a 3rd VM (also cloned from the same template) by rebooting it many times this morning. I have been unable to reproduce it this morning. I also tried loading the NIC heavily as Cristiano and Luca reported that it does it when the NIC is under heavy load. I ran an iPerf test from a VM to the hypervisor for 1 hour, passing 11.5-12Gbps the entire time. The network stayed up.
 
Nov 19, 2019
43
6
13
31
Singapore
I can also confirm this. Two identical windows VMs and one VM getting this network hiccup. I had to restart the network adapter within the windows vm to fix. both using intel NICs
 
Apr 8, 2020
4
0
1
34
Norway
voit.no
We also have problems with this.
For years.

We had an old 2008 R2 server that suddenly lost network, the network card showed it was still connected and everything but nothing could get through. Resetting the network card sometimes worked, and sometimes didn't.
A full reboot of the VM was the only thing that was guaranteed to work.
This could happen once a month, or once per week.

We recently setup two new servers running Windows Server 2019, a DC and an terminal server running some ERP software.
These are standard new Windows VM's with the E1000.
After a week we have had to restart the DC once and the terminal server we sometimes have to restart every 12 hours because of the connectivity issues.

We have had this weird problem for over two years now, and we always thought it was the old windows server os doing, but now this problem happens even more on new VM's running new up to date Windows Server 2019's.

The trend seems to be if the server is running software that does use a lot of network resources/connections like our ERP software. it will end up loosing network connectivity.
This seems to not have anything to do with network transfers since our sql server does not have this issue, and we transfer a lot data out of that VM every day.

We have updated Proxmox regularly for years and the issue just don't want to go away.
In retrospect I think this issue was introduced in Proxmox VE 5.

I will try the latest VirtIO (virtio-win-0.1.173) drivers and see if it gets any better after people get back from easter vacation.
 

Luca Guerrini

Member
Feb 12, 2020
7
0
6
48
We also have problems with this.
For years.

We had an old 2008 R2 server that suddenly lost network, the network card showed it was still connected and everything but nothing could get through. Resetting the network card sometimes worked, and sometimes didn't.
A full reboot of the VM was the only thing that was guaranteed to work.
This could happen once a month, or once per week.

We recently setup two new servers running Windows Server 2019, a DC and an terminal server running some ERP software.
These are standard new Windows VM's with the E1000.
After a week we have had to restart the DC once and the terminal server we sometimes have to restart every 12 hours because of the connectivity issues.

We have had this weird problem for over two years now, and we always thought it was the old windows server os doing, but now this problem happens even more on new VM's running new up to date Windows Server 2019's.

The trend seems to be if the server is running software that does use a lot of network resources/connections like our ERP software. it will end up loosing network connectivity.
This seems to not have anything to do with network transfers since our sql server does not have this issue, and we transfer a lot data out of that VM every day.

We have updated Proxmox regularly for years and the issue just don't want to go away.
In retrospect I think this issue was introduced in Proxmox VE 5.

I will try the latest VirtIO (virtio-win-0.1.173) drivers and see if it gets any better after people get back from easter vacation.

Hi,

sorry for my bad english
I solved the problem i had from months ago on SERVER 2019. I don't touch the parameters of the network but the "max RSS queue". I set the RSS queue to 1 in the network device and in the proxmox network hardware. I also set down the firewall of the network
 
Apr 8, 2020
4
0
1
34
Norway
voit.no
Hi,

sorry for my bad english
I solved the problem i had from months ago on SERVER 2019. I don't touch the parameters of the network but the "max RSS queue". I set the RSS queue to 1 in the network device and in the proxmox network hardware. I also set down the firewall of the network

Thank you, I will try this if new VirtIO driver does not fix the problem.
 
Apr 8, 2020
4
0
1
34
Norway
voit.no
We also have problems with this.
For years.

We had an old 2008 R2 server that suddenly lost network, the network card showed it was still connected and everything but nothing could get through. Resetting the network card sometimes worked, and sometimes didn't.
A full reboot of the VM was the only thing that was guaranteed to work.
This could happen once a month, or once per week.

We recently setup two new servers running Windows Server 2019, a DC and an terminal server running some ERP software.
These are standard new Windows VM's with the E1000.
After a week we have had to restart the DC once and the terminal server we sometimes have to restart every 12 hours because of the connectivity issues.

We have had this weird problem for over two years now, and we always thought it was the old windows server os doing, but now this problem happens even more on new VM's running new up to date Windows Server 2019's.

The trend seems to be if the server is running software that does use a lot of network resources/connections like our ERP software. it will end up loosing network connectivity.
This seems to not have anything to do with network transfers since our sql server does not have this issue, and we transfer a lot data out of that VM every day.

We have updated Proxmox regularly for years and the issue just don't want to go away.
In retrospect I think this issue was introduced in Proxmox VE 5.

I will try the latest VirtIO (virtio-win-0.1.173) drivers and see if it gets any better after people get back from easter vacation.

After changing the network card to virtio and using virtio-win-0.1.173 the servers has not had any network issues for the past two weeks.
 

yahouzheng

New Member
Sep 21, 2020
1
0
1
22
I have the same problem with Proxmox 5.2.3 and Windows 2012 R2. It happens with virtio and E1000 drivers. There is no entry in the Windows-, or Proxmox log, it just looses connectivity and the only way to fix this is to reboot the VM. I install driver from virtio-win-0.1.141.iso, The guest is running as a Ceph OSD host with 40 OSD processes. The Ceph cluster shares out RBD images to other hosts. Is this problem related to storage?

1.on the proxmox host,A tcpdump on the bugged interface will show only ARP requests being sent by the server and unanswered.
2.Virtual machine can send ARP packets to the outside
3.The virtual machine cannot receive packets

VM142, network: tap142i0
1600652016426.png
1600652181067.png
 
Apr 8, 2020
4
0
1
34
Norway
voit.no
I have the same problem with Proxmox 5.2.3 and Windows 2012 R2. It happens with virtio and E1000 drivers. There is no entry in the Windows-, or Proxmox log, it just looses connectivity and the only way to fix this is to reboot the VM. I install driver from virtio-win-0.1.141.iso, The guest is running as a Ceph OSD host with 40 OSD processes. The Ceph cluster shares out RBD images to other hosts. Is this problem related to storage?

1.on the proxmox host,A tcpdump on the bugged interface will show only ARP requests being sent by the server and unanswered.
2.Virtual machine can send ARP packets to the outside
3.The virtual machine cannot receive packets

VM142, network: tap142i0
View attachment 19982
View attachment 19983

Try virtio-win-0.1.173
 

tinfever

Member
Jun 30, 2019
16
1
6
31
I believe I might be running into the same issue. Somewhat randomly I'll have Windows guests lose network connectivity completely. I can't see inside the guest but running tcpdump on the VMs tap interface tapXXi0 (XX is VM ID) shows the guest sending repeated ARP requests for the gateway IP address, receiving a response, and then just looping. Rebooting the guest will fix the issue at least temporarily.

All guests are running virtio-win-0.1.173. I'm going to switch this guest from e1000 NICs to a virtio nic and see how it goes. Other than that I'm all out of ideas.

Oddly enough when I made a snapshot of the guest while it was bugged out, the snapshotting seemed to reset the interfaces or something since it started working again when the snapshot was done.

1603471527133.png

agent: 1
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 8
cpu: host,hidden=1,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+hv-tlbflush;+aes,hv-vendor-id=whatever
efidisk0: NVME-thin:vm-213-disk-1,size=4M
hostpci0: 09:00,pcie=1
hugepages: 1024
ide2: none,media=cdrom
lock: snapshot
machine: q35
memory: 22528
name: ___________
net0: e1000=______:1C:84,bridge=vmbr0,firewall=1
numa: 1
numa0: cpus=0-7,hostnodes=0,memory=22528,policy=bind
ostype: win10
scsi0: NVME-thin:vm-213-disk-0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=________________
sockets: 1
vmgenid: ___________________

cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp12s0f1
iface enp12s0f1 inet static
address ______0.6/24
#cluster network

iface enp12s0f0 inet manual

auto vmbr0
iface vmbr0 inet static
address ______.1.6/24
gateway ______.1.1
bridge-ports enp12s0f0
bridge-stp off
bridge-fd 0
 

Attachments

  • 1603471253982.png
    1603471253982.png
    41.6 KB · Views: 9

tinfever

Member
Jun 30, 2019
16
1
6
31
hi,
so windows guest receive the arp replie, but don't register it in his arp table ?

I wish I could check that since it acts like that. Unfortunately due to the arrangement of these guests with GPUs passed through and the vnc display disabled from inside the guest, I can't actually run commands or check on anything from inside the guest without rebooting the guest into a different config.
 

mcdowellster

Member
Jun 13, 2018
30
3
13
37
So I can reproduce this CONSTANTLY...

I'm running Blue Iris on a Windows Server 2016 VM. Once half or more of my cameras connect, recent VirtIO drivers (post .141) start producing some CRAZY latency. The e1000 works perfectly but will randomly drop after backups run FROM A COMPLETELY DIFFERENT 10GBIT INTERFACE! I have found that Check_MK reports NFS drops during backups (NFS share for all VM Backups) but it really doesn't drop. If I'm copying large data to a Windows VM, shares hosted on it will randomly drop. No matter what I do Windows networking, under load, results in packet loss... I'm running a three node cluster, ceph (Fiber 10Gbit SAN), GBit INTEL nics and Linux Bridges. Something is up with Linux bridges... when I ran openvswitch I never had these issues but I would prefer to use the GUI to configure VLANs and such... Help?

See screenshot -> this is the result of a ICMP to google when I have 5 or more cameras connected to Blue Iris. This happens to ALL my windows VMs under load...
 

Attachments

  • ToGoogleUnderLoad.png
    ToGoogleUnderLoad.png
    11.1 KB · Views: 9
Last edited:

tinfever

Member
Jun 30, 2019
16
1
6
31
So I can reproduce this CONSTANTLY...

I'm running Blue Iris on a Windows Server 2016 VM. Once half or more of my cameras connect, recent VirtIO drivers (post .141) start producing some CRAZY latency. The e1000 works perfectly but will randomly drop after backups run FROM A COMPLETELY DIFFERENT 10GBIT INTERFACE! I have found that Check_MK reports NFS drops during backups (NFS share for all VM Backups) but it really doesn't drop. If I'm copying large data to a Windows VM, shares hosted on it will randomly drop. No matter what I do Windows networking, under load, results in packet loss... I'm running a three node cluster, ceph (Fiber 10Gbit SAN), GBit INTEL nics and Linux Bridges. Something is up with Linux bridges... when I ran openvswitch I never had these issues but I would prefer to use the GUI to configure VLANs and such... Help?

See screenshot -> this is the result of a ICMP to google when I have 5 or more cameras connected to Blue Iris. This happens to ALL my windows VMs under load...
Does using a virtio NIC on the VMs make any difference? I think that may have fixed it in my case but it's still somewhat too soon to tell.

Also, I think, although I could be wrong, that the E1000 NICs just use drivers included with Windows and don't have any applicable drivers included with the virtio drivers.
 

mcdowellster

Member
Jun 13, 2018
30
3
13
37
Does using a virtio NIC on the VMs make any difference? I think that may have fixed it in my case but it's still somewhat too soon to tell.

Also, I think, although I could be wrong, that the E1000 NICs just use drivers included with Windows and don't have any applicable drivers included with the virtio drivers.

So virtio NIC is massively dropping packets while under load.

E1000 works perfectly under load BUT drops post backup over my 10Gbit nics. I assume its caused by the backup mode "Snapshot". In the GUI selecting disconnect for the NIC and reconnecting always instantly fixes it.

I'm trying the VMWare nic now to see if its any better...

What I'm finding however, anytime the network is under extreme load things drop. NFS drops during backups (always comes back quickly enough so backups never fail). SMB shares drop for a few seconds when copying 30+GB files from a Windows VM to another Windows VM (Note: Linux mounted CIFS connects report the shares are down not windows).
 
Last edited:

Josh Douglas

Member
Dec 29, 2015
15
1
23
I'm having similiar issues. Double digits on the proxmox node loads, but rather "high" CPU as well (200-600% on some threads). First thought was to upgrade the VirtIO drivers (to .185), but after about a week or so, same issue, back to stopping the VM and starting back up, since the console won't respond. Has anybody tried with the RTL8139 NIC? Or is that just a folly?
 

mcdowellster

Member
Jun 13, 2018
30
3
13
37
I'm having similiar issues. Double digits on the proxmox node loads, but rather "high" CPU as well (200-600% on some threads). First thought was to upgrade the VirtIO drivers (to .185), but after about a week or so, same issue, back to stopping the VM and starting back up, since the console won't respond. Has anybody tried with the RTL8139 NIC? Or is that just a folly?
Shockingly the VMWare emulated NICs seem rock solid. My camera server hasn't needed a reboot in weeks. Monitoring server is happier too. I also switched backups to a NAS via CIFS share on the SAN subnet.
Passed through an inteli210 nic on my primary smb server to see if it works better - it does but I'm ready to switch back to emulated nics.
 

IanCH

Member
Jun 6, 2017
26
0
21
49
I'm having the exact same issue.

Proxmox 6.4-6 running VMs Windows 2016 with Intel NICs

Need to run Intel NICs because Remote Access & Routing (VPN) doesn't work with the VirtIO NIC.

It seems the NIC stops under load. Copying the Windows ISO from a share on the VM to a local PC can cause the issue.

The resolution is to disable and then reenable the NIC, but this isn't ideal.

any ideas?
 
Dec 15, 2018
10
0
6
41
I have this issue also, both with VirtIO & (sadly) VMWare NICs, and whereas the E1000 alternative is functional CPU usage is prohibitively high.

In my case this impacts Linux guests.

Has anyone figured out the root cause, or better yet a viable solution?

Edit: For me I believe this is VLAN related, transferring data between non-tagged interfaces is fine, unfortunately I remain unclear on a resolution.
 
Last edited:

tomstephens89

Active Member
Mar 10, 2014
174
5
38
Kingsclere, United Kingdom
Its been a long time since I initially posted in this thread.

The problem was caused by bad VirtIO drivers. Switching to the latest stable instead of the latest beta which I downloaded by mistake solved the issue.

I have used the net-kvm VirtIO driver from the latest stable VirtIO package on every VM since I acknowledged the same issue back in 2018 on Proxmox 5.2.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!