KVM Windows VMs and losing network connectivity

voitviktor · Apr 8, 2020

We also have problems with this.
For years.

We had an old 2008 R2 server that suddenly lost network, the network card showed it was still connected and everything but nothing could get through. Resetting the network card sometimes worked, and sometimes didn't.
A full reboot of the VM was the only thing that was guaranteed to work.
This could happen once a month, or once per week.

We recently setup two new servers running Windows Server 2019, a DC and an terminal server running some ERP software.
These are standard new Windows VM's with the E1000.
After a week we have had to restart the DC once and the terminal server we sometimes have to restart every 12 hours because of the connectivity issues.

We have had this weird problem for over two years now, and we always thought it was the old windows server os doing, but now this problem happens even more on new VM's running new up to date Windows Server 2019's.

The trend seems to be if the server is running software that does use a lot of network resources/connections like our ERP software. it will end up loosing network connectivity.
This seems to not have anything to do with network transfers since our sql server does not have this issue, and we transfer a lot data out of that VM every day.

We have updated Proxmox regularly for years and the issue just don't want to go away.
In retrospect I think this issue was introduced in Proxmox VE 5.

I will try the latest VirtIO (virtio-win-0.1.173) drivers and see if it gets any better after people get back from easter vacation.

voitviktor · Apr 8, 2020

Luca Guerrini said:
Hi,

sorry for my bad english
I solved the problem i had from months ago on SERVER 2019. I don't touch the parameters of the network but the "max RSS queue". I set the RSS queue to 1 in the network device and in the proxmox network hardware. I also set down the firewall of the network

Thank you, I will try this if new VirtIO driver does not fix the problem.

voitviktor · Apr 21, 2020

voitviktor said:
We also have problems with this.
For years.

We had an old 2008 R2 server that suddenly lost network, the network card showed it was still connected and everything but nothing could get through. Resetting the network card sometimes worked, and sometimes didn't.
A full reboot of the VM was the only thing that was guaranteed to work.
This could happen once a month, or once per week.

We recently setup two new servers running Windows Server 2019, a DC and an terminal server running some ERP software.
These are standard new Windows VM's with the E1000.
After a week we have had to restart the DC once and the terminal server we sometimes have to restart every 12 hours because of the connectivity issues.

We have had this weird problem for over two years now, and we always thought it was the old windows server os doing, but now this problem happens even more on new VM's running new up to date Windows Server 2019's.

The trend seems to be if the server is running software that does use a lot of network resources/connections like our ERP software. it will end up loosing network connectivity.
This seems to not have anything to do with network transfers since our sql server does not have this issue, and we transfer a lot data out of that VM every day.

We have updated Proxmox regularly for years and the issue just don't want to go away.
In retrospect I think this issue was introduced in Proxmox VE 5.

I will try the latest VirtIO (virtio-win-0.1.173) drivers and see if it gets any better after people get back from easter vacation.

After changing the network card to virtio and using virtio-win-0.1.173 the servers has not had any network issues for the past two weeks.

yahouzheng · Sep 21, 2020

I have the same problem with Proxmox 5.2.3 and Windows 2012 R2. It happens with virtio and E1000 drivers. There is no entry in the Windows-, or Proxmox log, it just looses connectivity and the only way to fix this is to reboot the VM. I install driver from virtio-win-0.1.141.iso, The guest is running as a Ceph OSD host with 40 OSD processes. The Ceph cluster shares out RBD images to other hosts. Is this problem related to storage?

1.on the proxmox host，A tcpdump on the bugged interface will show only ARP requests being sent by the server and unanswered.
2.Virtual machine can send ARP packets to the outside
3.The virtual machine cannot receive packets

VM142, network: tap142i0

voitviktor · Sep 21, 2020

yahouzheng said:
I have the same problem with Proxmox 5.2.3 and Windows 2012 R2. It happens with virtio and E1000 drivers. There is no entry in the Windows-, or Proxmox log, it just looses connectivity and the only way to fix this is to reboot the VM. I install driver from virtio-win-0.1.141.iso, The guest is running as a Ceph OSD host with 40 OSD processes. The Ceph cluster shares out RBD images to other hosts. Is this problem related to storage?

1.on the proxmox host，A tcpdump on the bugged interface will show only ARP requests being sent by the server and unanswered.
2.Virtual machine can send ARP packets to the outside
3.The virtual machine cannot receive packets

VM142, network: tap142i0
View attachment 19982
View attachment 19983

Try virtio-win-0.1.173

tinfever · Oct 23, 2020

I believe I might be running into the same issue. Somewhat randomly I'll have Windows guests lose network connectivity completely. I can't see inside the guest but running tcpdump on the VMs tap interface tapXXi0 (XX is VM ID) shows the guest sending repeated ARP requests for the gateway IP address, receiving a response, and then just looping. Rebooting the guest will fix the issue at least temporarily.

All guests are running virtio-win-0.1.173. I'm going to switch this guest from e1000 NICs to a virtio nic and see how it goes. Other than that I'm all out of ideas.

Oddly enough when I made a snapshot of the guest while it was bugged out, the snapshotting seemed to reset the interfaces or something since it started working again when the snapshot was done.

agent: 1
balloon: 0
bios: ovmf
bootdisk: scsi0
cores: 8
cpu: host,hidden=1,flags=+md-clear;+pcid;+spec-ctrl;+ssbd;+pdpe1gb;+hv-tlbflush;+aes,hv-vendor-id=whatever
efidisk0: NVME-thin:vm-213-disk-1,size=4M
hostpci0: 09:00,pcie=1
hugepages: 1024
ide2: none,media=cdrom
lock: snapshot
machine: q35
memory: 22528
name: ___________
net0: e1000=______:1C:84,bridge=vmbr0,firewall=1
numa: 1
numa0: cpus=0-7,hostnodes=0,memory=22528,policy=bind
ostype: win10
scsi0: NVME-thin:vm-213-disk-0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=________________
sockets: 1
vmgenid: ___________________

cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto enp12s0f1
iface enp12s0f1 inet static
address ______0.6/24
#cluster network

iface enp12s0f0 inet manual

auto vmbr0
iface vmbr0 inet static
address ______.1.6/24
gateway ______.1.1
bridge-ports enp12s0f0
bridge-stp off
bridge-fd 0

spirit · Oct 23, 2020

hi,
so windows guest receive the arp replie, but don't register it in his arp table ?

tinfever · Oct 23, 2020

spirit said:
hi,
so windows guest receive the arp replie, but don't register it in his arp table ?

I wish I could check that since it acts like that. Unfortunately due to the arrangement of these guests with GPUs passed through and the vnc display disabled from inside the guest, I can't actually run commands or check on anything from inside the guest without rebooting the guest into a different config.

mcdowellster · Dec 16, 2020

So I can reproduce this CONSTANTLY...

I'm running Blue Iris on a Windows Server 2016 VM. Once half or more of my cameras connect, recent VirtIO drivers (post .141) start producing some CRAZY latency. The e1000 works perfectly but will randomly drop after backups run FROM A COMPLETELY DIFFERENT 10GBIT INTERFACE! I have found that Check_MK reports NFS drops during backups (NFS share for all VM Backups) but it really doesn't drop. If I'm copying large data to a Windows VM, shares hosted on it will randomly drop. No matter what I do Windows networking, under load, results in packet loss... I'm running a three node cluster, ceph (Fiber 10Gbit SAN), GBit INTEL nics and Linux Bridges. Something is up with Linux bridges... when I ran openvswitch I never had these issues but I would prefer to use the GUI to configure VLANs and such... Help?

See screenshot -> this is the result of a ICMP to google when I have 5 or more cameras connected to Blue Iris. This happens to ALL my windows VMs under load...

tinfever · Dec 16, 2020

mcdowellster said:
So I can reproduce this CONSTANTLY...

I'm running Blue Iris on a Windows Server 2016 VM. Once half or more of my cameras connect, recent VirtIO drivers (post .141) start producing some CRAZY latency. The e1000 works perfectly but will randomly drop after backups run FROM A COMPLETELY DIFFERENT 10GBIT INTERFACE! I have found that Check_MK reports NFS drops during backups (NFS share for all VM Backups) but it really doesn't drop. If I'm copying large data to a Windows VM, shares hosted on it will randomly drop. No matter what I do Windows networking, under load, results in packet loss... I'm running a three node cluster, ceph (Fiber 10Gbit SAN), GBit INTEL nics and Linux Bridges. Something is up with Linux bridges... when I ran openvswitch I never had these issues but I would prefer to use the GUI to configure VLANs and such... Help?

See screenshot -> this is the result of a ICMP to google when I have 5 or more cameras connected to Blue Iris. This happens to ALL my windows VMs under load...

Does using a virtio NIC on the VMs make any difference? I think that may have fixed it in my case but it's still somewhat too soon to tell.

Also, I think, although I could be wrong, that the E1000 NICs just use drivers included with Windows and don't have any applicable drivers included with the virtio drivers.

mcdowellster · Dec 16, 2020

tinfever said:
Does using a virtio NIC on the VMs make any difference? I think that may have fixed it in my case but it's still somewhat too soon to tell.

Also, I think, although I could be wrong, that the E1000 NICs just use drivers included with Windows and don't have any applicable drivers included with the virtio drivers.

So virtio NIC is massively dropping packets while under load.

E1000 works perfectly under load BUT drops post backup over my 10Gbit nics. I assume its caused by the backup mode "Snapshot". In the GUI selecting disconnect for the NIC and reconnecting always instantly fixes it.

I'm trying the VMWare nic now to see if its any better...

What I'm finding however, anytime the network is under extreme load things drop. NFS drops during backups (always comes back quickly enough so backups never fail). SMB shares drop for a few seconds when copying 30+GB files from a Windows VM to another Windows VM (Note: Linux mounted CIFS connects report the shares are down not windows).

Josh Douglas · Jan 15, 2021

I'm having similiar issues. Double digits on the proxmox node loads, but rather "high" CPU as well (200-600% on some threads). First thought was to upgrade the VirtIO drivers (to .185), but after about a week or so, same issue, back to stopping the VM and starting back up, since the console won't respond. Has anybody tried with the RTL8139 NIC? Or is that just a folly?

mcdowellster · Jan 15, 2021

Josh Douglas said:
I'm having similiar issues. Double digits on the proxmox node loads, but rather "high" CPU as well (200-600% on some threads). First thought was to upgrade the VirtIO drivers (to .185), but after about a week or so, same issue, back to stopping the VM and starting back up, since the console won't respond. Has anybody tried with the RTL8139 NIC? Or is that just a folly?

Shockingly the VMWare emulated NICs seem rock solid. My camera server hasn't needed a reboot in weeks. Monitoring server is happier too. I also switched backups to a NAS via CIFS share on the SAN subnet.
Passed through an inteli210 nic on my primary smb server to see if it works better - it does but I'm ready to switch back to emulated nics.

IanCH · Jun 3, 2021

I'm having the exact same issue.

Proxmox 6.4-6 running VMs Windows 2016 with Intel NICs

Need to run Intel NICs because Remote Access & Routing (VPN) doesn't work with the VirtIO NIC.

It seems the NIC stops under load. Copying the Windows ISO from a share on the VM to a local PC can cause the issue.

The resolution is to disable and then reenable the NIC, but this isn't ideal.

any ideas?

Chris Lockwood · Nov 21, 2021

I have this issue also, both with VirtIO & (sadly) VMWare NICs, and whereas the E1000 alternative is functional CPU usage is prohibitively high.

In my case this impacts Linux guests.

Has anyone figured out the root cause, or better yet a viable solution?

Edit: For me I believe this is VLAN related, transferring data between non-tagged interfaces is fine, unfortunately I remain unclear on a resolution.

tomstephens89 · Nov 22, 2021

Its been a long time since I initially posted in this thread.

The problem was caused by bad VirtIO drivers. Switching to the latest stable instead of the latest beta which I downloaded by mistake solved the issue.

I have used the net-kvm VirtIO driver from the latest stable VirtIO package on every VM since I acknowledged the same issue back in 2018 on Proxmox 5.2.

lingguchong · Nov 23, 2021

I have encountered this problem for a long time, and I can’t remember which PVE version it started from. This phenomenon still exists after upgrading to the latest 7.1. I remember that replacing E1000 with VirtIO network card did not solve it. I wonder if the latest stable 208 driver is now It can solve this problem perfectly. The strange thing is that only WINDOWS VMs will be affected by this.

mcdowellster · Nov 23, 2021

I switched to the latest virtio drivers and have been fine since.

also turns out a lot of the "drops" were actually IO related issues with the spinning disks. The underlaying storage would "hang" which would cause SMB services to drop. Look at the logs in Windows to validate this. Since moved to SSD for almost everything and CephFS for large volume storage (Windows can mount cephfs now)

alain · Jun 2, 2022

We also encounter the problem since a few months, perhaps since upgrade to 7.0, perhaps before. But until a few days, it was just random and just annoying, we had some windows interfaces that were going down, and just disable it and re-enable was sufficient for some days.

But we now encounter a much noxious problem on a VM since a few days, the interface goes down and when we re-enable it, the interface comes down again within a few seconds. So the VM is unusable.

The VM is 2012 R2, the interface is Intel E11000, we tried with VirtIO, it is the same thing. What is weird, is that we have another E1000 interface inside this VM on the same bridge vmbr0, but in a different VLAN, also E1000, and this one does not go down. The firs interface is just used to connect to the VM remotely and launch a management interface for some microswitchs. The second has a lot of trafic because it scans in a different VLAN about 800 microswitchs continuously.

I also tried to disable in the driver 'allow pc to shutdown this device to save power', but it is the same.

I just verified when I launch a continuous ping on the first interface (ping -t), the interface does not go down, and I am able to use the VM normally. So it is a first fix...
It seems then that the interface goes down in a few seconds when it is not used...

We are in PVE 7.2, last version), on a Dell cluster with three nodes and Ceph storage.

The detailed version is here, and the VM configuration file :

Code:

# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.15.30-1-pve: 5.15.30-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph: 15.2.16-pve1
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

Code:

agent: 1
bootdisk: ide0
cores: 2
ide0: Ceph_vm:vm-102-disk-1,size=100G
ide2: none,media=cdrom
memory: 6144
name: server
net0: e1000=62:09:7A:9F:xx:xx,bridge=vmbr0
net1: e1000=96:B2:D8:62:yy:yy,bridge=vmbr0,tag=xxxx
numa: 0
ostype: win8
scsihw: virtio-scsi-pci
smbios1: uuid=f3c47469-a1ed-4c83-9675-7a5a7dc94dad
sockets: 2

IanCH · Jun 2, 2022

alain said:
We also encounter the problem since a few months, perhaps since upgrade to 7.0, perhaps before. But until a few days, it was just random and just annoying, we had some windows interfaces that were going down, and just disable it and re-enable was sufficient for some days.

But we now encounter a much noxious problem on a VM since a few days, the interface goes down and when we re-enable it, the interface comes down again within a few seconds. So the VM is unusable.

The VM is 2012 R2, the interface is Intel E11000, we tried with VirtIO, it is the same thing. What is weird, is that we have another E1000 interface inside this VM on the same bridge vmbr0, but in a different VLAN, also E1000, and this one does not go down. The firs interface is just used to connect to the VM remotely and launch a management interface for some microswitchs. The second has a lot of trafic because it scans in a different VLAN about 800 microswitchs continuously.

I also tried to disable in the driver 'allow pc to shutdown this device to save power', but it is the same.

I just verified when I launch a continuous ping on the first interface (ping -t), the interface does not go down, and I am able to use the VM normally. So it is a first fix...
It seems then that the interface goes down in a few seconds when it is not used...

We are in PVE 7.2, last version), on a Dell cluster with three nodes and Ceph storage.

The detailed version is here, and the VM configuration file :

Code:

# pveversion -v proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve) pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1) pve-kernel-5.15: 7.2-3 pve-kernel-helper: 7.2-3 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.35-1-pve: 5.15.35-2 pve-kernel-5.15.30-1-pve: 5.15.30-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.11.22-7-pve: 5.11.22-12 ceph: 15.2.16-pve1 ceph-fuse: 15.2.16-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: 0.8.36+pve1 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-8 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-6 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.2-2 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.12-1 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.1.8-1 proxmox-backup-file-restore: 2.1.8-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-10 pve-cluster: 7.2-1 pve-container: 4.2-1 pve-docs: 7.2-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.4-2 pve-ha-manager: 3.3-4 pve-i18n: 2.7-1 pve-qemu-kvm: 6.2.0-5 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-2 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.4-pve1

Code:

agent: 1 bootdisk: ide0 cores: 2 ide0: Ceph_vm:vm-102-disk-1,size=100G ide2: none,media=cdrom memory: 6144 name: server net0: e1000=62:09:7A:9F:xx:xx,bridge=vmbr0 net1: e1000=96:B2:D8:62:yy:yy,bridge=vmbr0,tag=xxxx numa: 0 ostype: win8 scsihw: virtio-scsi-pci smbios1: uuid=f3c47469-a1ed-4c83-9675-7a5a7dc94dad sockets: 2

Hi

Since upgrading my proxmox to v7 I have found the Realtek NIC to be more reliable for Windows Server RAS services.

Regards

KVM Windows VMs and losing network connectivity

New Member

New Member

New Member

New Member

New Member

Member

Attachments

Distinguished Member

Member

Well-Known Member

Attachments

Member

Well-Known Member

Member

Well-Known Member

Renowned Member

Active Member

Renowned Member

Member

Well-Known Member

Renowned Member

Renowned Member

We value your privacy