Windows VMs transfer speeds drop to 0 to/from HDD zfs mirrored pool

debsque

Renowned Member
Sep 28, 2016
24
2
68
36
Hello,

When I'm using Windows 10/11 VMs, stored on an SSD zfs mirror pool, to transfer files to/from zfs HDD mirrored pool on the same Host, speeds frequently drop to 0 and hang there for a few minutes.

I don't have this issue with Ubuntu and linux VMs so I'm guessing it has something to do with Windows virtio drivers and zfs cache on the HDDs, but I can't put my finger on it. My solution at the moment is to rate limit the virtio network driver to 180 MB/s and the virtio hdd driver (for disks on HDD pool) to 100 MB/s.

There are several forum and reddit posts on the topic, but my situation seems a bit different.

I'm testing two scenarios:

(1) Transfers inside Windows VM:
  • Boot (C drive) on SSD pool
  • first test drive on HDD pool
  • second test drive on SSD pool
I'm transferring about 50 GB consisting of several windows and linux isos (literally), mostly files between 1 and 10 GB.

Transfers between the disks stored on the SSDs work mostly within expected parameters. Transfers between anything SSD and HDD, hangs. When I rate limit the HDD test drive to 100 MB/s via the Proxmox interface, speeds fluctuate between 50 si 80 MB/s but rarely drop to 0 and even when they do drop, they recover quickly.

(2) Using Windows VM to copy from Samba LXC which serves files (using bind mounts) from HDD pool (on the same host)

This works even worse. VMs have initial speeds of 200-300-400MB/s then drop to 0 and hang there for over a minute. Sometimes the transfer completes with fluctuating speeds of above 150 MB/s, but other times it just goes from 0-400-0 and just hangs frequently. My solution is to rate limit the virtio network controller to about 180 MB/s.

Hardware setup:
  • 2 x WD SSD - rpool mirror
  • 2 x Samsung EVO 860 2TB - zfs mirror for VM storage pool (called vmstorage)
  • 2 x 8TB WD Red Pro - zfs mirror data storage pool (called tank)
  • CPU Intel Silver 4314 @ 2.4Ghz
  • 128 GB RAM (zfs arc limited to 16GB)
  • Dell X520-DA2 10 Gbit optical transciever
  • Proxmox 8.3.4 - VMs and everything are up to date
The logical setup is something like this:
  • data folders stored on on tank (mirrored WD RED 8TB)
  • VMs stored on vmstorage (mirrored EVO 2 TB)
  • debian lxc container has bind mounts to folders on tank
  • lxc container is samba server
  • VMs access lxc container via samba => VMs access folders on tank via Samba
The VMs are fresh Windows 10 ltsc and Windows 11 LTSC
  • CPU = x86-64-v2-AES with flags md-clear, pcid, spec-ctrl,ssbd,aes
  • RAM = 16GB no baloon
  • Machine Q35 OVMF BIOS
  • SCSI VirtiO SCSI Single
  • HDD= scsi0
I've tried every combination I found on the forum, without results, such as:
  • changing CPU to Host, Kvm, different flags, different architectures
  • changing different disk flags such as SSD emulation, native Async IO, Discard, IO Thread (didn't modify cache though)
  • Changing machine type
  • reinstalling windows and using older virtio drivers
  • Virtio SCSI, Virtio SCSI Single, virtio block storage instead of scsi
  • NIC multiqueues
  • storing VM on LVM-Thin instead of ZFS (detached one disk in vmstorage mirror just to check the
  • storing VM on rpool mirror
The only settings that had a visibile effect, was setting zfs sync=disabled. The speeds still fluctuated a lot, but it didn't hang at 0 anymore.

Some other observations:
  • setup new TrueNAS VM on this host, with 2 mirror virtio disks stored on host's tank zfs pool - no issues but speeds never went above 150MB/s
  • transfers from another TrueNAS host to the Windows VM works normally, switching direction the Windows VM pushes only 20 MB/s
  • Ubuntu 24 VM would work at sustained transfers over 200 MB/s without crashing, only the occasional slowdowns
  • other file transfers work without interruption (ie via Linux Guets) while the Windows one crash
  • I can't find hints in the logs and it doesn't look like resources reach 100% utilization on CPU or RAM. The Windows Task manager shows 100% active time while the transfer stall
So, I can summarise that generally Windows VM is acting weird while copying files.
 
Be sure to use Windows virtio scsi drivers v266 as prior has problems.
+
 
Last edited:
Be sure to use Windows virtio scsi drivers v266 as prior has problems.
+
Thanks for the suggestions.

I've tried both the newest virtio drivers and several older ones.

The dirty_cache thing doesn't do anything and it also doesn't apply in my case. I don't have issues with transfers between SSDs, I only have them while transfering to/from HDD-based zfs pools. I'll probably try an LSI card next weekend but I doubt it's that since I don't have issues at all when using Linux and no issues while using disks from the SSD pool under Windows
 
It seems that creating a sata disk (instead of scsi or virtio) stored on the tank pool (HDD-based zfs mirror) eliminates the hangs when copying files between different disks inside the same Windows VM.

In scenario (1), (which is copying files from a SAMBA server from an LXC container with bind mounts pointing to folders on tank), the problem is still there even when copying to the sata disk. I've changed ethernet to e1000 and controller to LSI but didn't make a difference. Only way to avoid drops and hangs while copying over the "network" is to set the network limit to 200 MB/s.

This also resonates with another post from your thread regarding better sata controller stability:
In the meantime I've ended 3 other test.

Created/installed 3 new VM identically configured, except for the storage controller.

IDE Controller (slow, but no issue):
View attachment 58078

SATA Controller (slow, but no issue):
View attachment 58079

SCSI Controller but with older Windows drivers (fast but with the issue):
View attachment 58080

In conclusion, it seems that when che "not ideal" controller limits the VMs I/O, the issue is not present because the disks are not saturated.

The hypothesis and suggestion of eclipse10000 in post #8 seem more and more likely. I'll test soon and let you know

Thanks!
 
double check driver version of virtio scsi storage controller when you test it, from Windows Device Manager.
version after 208 and before 266 are known to hang, specially with CrystalDiskMark.
Upgrade with .msi can fail, and update driver from Windows Device Manager is required.

 
So far, it's got something to do with multiple networks on the same LXC container (I'm using several VLANs and bridges). The logs don't tell anything meaningful and it's fairly easy to reproduce. When I remove the network interfaces from the config, or use "disconnect" from the GUI, the speeds remain constant (Linux and BSD still don't present this issue).

I went through several tests to arrive at this conclusion.

(1) connected 2 identical but new HDDs via a different HBA
  • created new zfs pool on the new HDDs via Proxmox WebUI
  • created folder on new pool and used bind mounts to LXC container
  • => issue manifests as usual
(2) passed the HBA controller to fresh TrueNAS VM
  • imported pool, created shares
  • => issue gone
(3) created new ubuntu 24.04 LXC (the original is debian based)
  • copied over the same bind mounts from the other container, standard samba configuration, but only 1 network card
  • => no issues
  • joined samba server to domain, copied smb.conf from the other LXC => still no issues
    • wanted to check for a samba server config issue
  • added rest of virtio network interfaces => issue manifests
(4) removed the network cards from the original LXC container, issues gone
(5) disconnecting the extra network interfaces for the LXC container, via the WebUI, while Windows VM transfers files, makes the problem go away in real time
(6) added multiple network interfaces to the fresh TrueNAS VM, no slowdowns or issues

I think I'll save some time now and in the future, and just pass an HBA with the disks to an Ubuntu VM. I wasn't very comfortable with managing the storage on Proxmox to begin with, but the solution seemed elegant and worked for a few years.
Edit: I've found several threads with similar problems. My searches didn't lead me there because I thought it was a zfs issue, but it looks like an lxc/samba issue. Some reddit comment had the exact initial simptom (speeds dropping to 0 and hanging there).
https://forum.proxmox.com/threads/very-poor-lxc-network-performance.139701
https://forum.proxmox.com/threads/very-poor-lxc-network-performance.139701
https://forum.proxmox.com/threads/slow-smb-speed-beetween-windows-machines.42814/
https://forum.proxmox.com/threads/linux-samba-server-or-lxc-container.141177
https://forum.proxmox.com/threads/samba-under-pve-8-sometimes-extremely-slow.130116
 
Last edited: