Windows 11 VM IO drops to 0Mbit

yarcod

Member
Sep 30, 2020
23
2
23
37
I recently re-setup my workstation (Ryzen 5900X, 64GB RAM) with Proxmox and a primary Windows 11 VM with GPU (GTX 1080) passthrough. While most things seem to work just fine overall, I have been having some trouble getting reliable throughput to the underlying disks.

There are 3 disks: 1 NVMe and 2 SATA SSDs. The former hosts both Proxmox and the Windows 11 VM, while the latter is a mirrored ZFS device being forwarded as a VirtIO SCSI device. I have noticed that when transferring data to the mirrored VDEV, the device is able to sustain the full bandwidth for a while and then Windows reports it as 0MB/s for some time. This then will continue on and off regularly during that whole transfer. During this time, the performance monitor in Windows reports the disk load being 100%, despite transferring at 0MB/s. Also at this time, if you try to pause the transfer nothing will happen until Windows would start to see a real transfer happening again. Proxmox, however, apparently continues to report the full bandwidth during this whole transfer. I have added some pictures from both systems; the Windows image shows how the transfer stalls almost at the end (it happens to say 21 MB/s, but it stays there with no progress; other times it manages to jump to 0MB/s), having written about 5GB, but the disk is reporting 100% utilisation. The corresponding Proxmox graph on the other hand shows no sign of downgrading. This was a limited test transfer, but the second Proxmox graph (the one with the RAM usage included) is from a larger transfer I did yesterday. There it seems like the bandwidth is reported at full speed until there is a sudden spike of several GB/s (might have been that the NVMe had some traffic at that specific point, but nothing else was happening on the system then).

Some quick background on the transfers above, in case the rates seem low: I have been copying loads of RAW images off of a couple of SD cards. These are capped at about 250 MB/s, which is why the transfer rates are not higher than that. While the SSDs are SATA drives, they should be able to keep up with that kind of speed. Also, I have seen the same behaviour on some 90MB/s SD cards as well.

I have not been able to determine wherein the problem lies, but it feels like there is either some problem with caching or reporting of IO operations to the Windows VM. So either there is a cache/buffer that is filled up at first, then being emptied, and at which point Windows sees that as 0Mbit transfer since nothing new is getting out. Or there is some stalling happening at the IO reporting layer, where the VM does not get a successful write confirmation and therefore waits until that arrives. But that is just a feeling. In reality I have probably just misconfigured something. Any thoughts on what I might be missing?

Thanks in advance!
 

Attachments

  • Skärmbild 2024-08-12 181911.png
    Skärmbild 2024-08-12 181911.png
    132.2 KB · Views: 11
  • tempProxmoxView.png
    tempProxmoxView.png
    69.4 KB · Views: 11
  • tempWindowsView.png
    tempWindowsView.png
    166.1 KB · Views: 11
Are you using ZFS on drives with QLC flash memory? That can become really, really slow on sustained writes.
Most likely, yes. The two SATA drives were quite cheap (Insenso brand, if that says anything), so would likely be QLC. But would it become this slow, to the point that transfers actually stop while data is written out to the nand cells?
 
  • Like
Reactions: Kingneutron
Welcome to ZFS.

ZFS have a cache (ARC) and a cache for sync writes (LOG). Inside ARC cache you can find data cache and write cache.
ZFS do not write constantly (on not heavy writes). It flush the data every 5 seconds (zfs_txg_timeout) or when dirty cache is full.
Then ZFS write cache is full it will stop accepting (hold) new write request until write cache is free.
For Sync write - if you do not have dedicated LOG device and your sync!=disabled then you have double write.

If your SSD is not heavy duty ready then your SSD will suffer.
 
Yes, it's unusable with ZFS but people find that hard to believe (in earlier discussions about that): https://forum.proxmox.com/search/7408381/?q=QLC&t=post&c[child_nodes]=1&c[nodes][0]=16&o=date
I see -- I had no idea about this limitation! Is there some technical explanation as to why QLC specifically makes it incompatible with ZFS? I have seen some talk about why (HDD) SMR does not work, but not QLC.
Welcome to ZFS.

ZFS have a cache (ARC) and a cache for sync writes (LOG). Inside ARC cache you can find data cache and write cache.
ZFS do not write constantly (on not heavy writes). It flush the data every 5 seconds (zfs_txg_timeout) or when dirty cache is full.
Then ZFS write cache is full it will stop accepting (hold) new write request until write cache is free.
For Sync write - if you do not have dedicated LOG device and your sync!=disabled then you have double write.

If your SSD is not heavy duty ready then your SSD will suffer.
Thanks for that! I did not know any specifics about ZFS caching. But in my case, is it the sync (?) write between both the drives that both hurts performance tremendously and the longevity of the drives? Is there some way to circumvent this behaviour using my current drives or do I need to get new ones?

On that topic, if I need to get a new pair of (1TB) drives, what should I be looking for? Brands, series or tiers? Any specific models that work well with ZFS and are relatively affordable? Also, is there a reason for me to get a LOG device at this point? I am not looking for any huge performance really -- they are SATA drives after all -- but the system should work.
 
Last edited:
If you are working with critical data then you will invest into UPS, more disk for file copies and so on...

For you at the moment I can suggest to disable ZFS sync ( regular and sync writes will become as regular writes) to see does it helps and If it possible rise ARC cache size.

zfs set sync=disabled your/pool/name
 
  • Like
Reactions: Kingneutron
I finally got around to buying a new pair of SSDs -- Kingston KC600 1TB TLC. I've heard that they should be good drives, and if I am not mistaken they have also been recommended by Jim Salter on the 2.5Admins podcast. However, while the experience did improve with much fewer lock-ups and overall better performance, the transfer rate still drops to 0 Mbit from time to time. Again, this is reading from a relatively high-speed SD card, which works perfectly if read directly to the internal NVMe.

Does anyone have any other ideas on what could cause this performance degradation? Should I just expect this level of performance from a mirrored pair of SATA SSDs under ZFS?

Also, out of curiosity, now that I have two spare SATA SSDs of limited value, would they work better under RAIDZ0? I know I will lose all data if one disk breaks, but for storing, e.g., games which can easily be reinstalled, it does not matter much. Would it be usable, despite being cheap QLC drives?
 
I finally got around to buying a new pair of SSDs -- Kingston KC600 1TB TLC. I've heard that they should be good drives, and if I am not mistaken they have also been recommended by Jim Salter on the 2.5Admins podcast.
Nope--I recommended the DC600M SSDs. Totally different line-up, a bit more pricey, aimed squarely at enterprise/datacenter use.

I don't have any data on how good the KC600 series is, but it's consumer-targeted. I've seen totally decent Kingston consumer SSDs, and I've seen very "meh" Kingston consumer SSDs. Sorry, I wish I could tell you for sure whether your KC600 is a good drive or a bad one, but all I can tell you for sure is "that's not a drive I've recommended, nor one that I've directly tested."

I don't see anything in the spec sheet about the enterprise features the DC600M offers, so my guess is this is a modern run-of-the-mill consumer targeted SSD. Which means it's probably okay for a desktop drive, but WAY lower on the food chain than the DC600M series. Don't get me wrong, you can certainly use standard consumer SSDs for virtualization hosting, but you're missing out on the hardware QoS, more than double the write endurance spec, powerloss protection, and more.

https://www.kingston.com/en/ssd/dc600m-data-center-solid-state-drive
 
Nope--I recommended the DC600M SSDs. Totally different line-up, a bit more pricey, aimed squarely at enterprise/datacenter use.

I don't have any data on how good the KC600 series is, but it's consumer-targeted. I've seen totally decent Kingston consumer SSDs, and I've seen very "meh" Kingston consumer SSDs. Sorry, I wish I could tell you for sure whether your KC600 is a good drive or a bad one, but all I can tell you for sure is "that's not a drive I've recommended, nor one that I've directly tested."

I don't see anything in the spec sheet about the enterprise features the DC600M offers, so my guess is this is a modern run-of-the-mill consumer targeted SSD. Which means it's probably okay for a desktop drive, but WAY lower on the food chain than the DC600M series. Don't get me wrong, you can certainly use standard consumer SSDs for virtualization hosting, but you're missing out on the hardware QoS, more than double the write endurance spec, powerloss protection, and more.

https://www.kingston.com/en/ssd/dc600m-data-center-solid-state-drive
Hi Jim! Sorry for misquoting you, so much so that you had to create an account to correct me!

Apart from recommending another drive, would you have any idea as to why my Windows VM still is unable to perform well? To clarify a bit, it is not the OS drive which misbehaves*, but the extra (D: drive) that has this slow performance. Since I already have bought new TLC drives, is there anything I can try in order to make them faster? Or have I simply set them up in a bad way, e.g. by mirroring them? Do you think this specific issue would be (better) solved by switching to another set of drives again?


* While the OS stutters from time to time when performing some disk intensive operations, e.g., starting Lightroom, this might still be connected to the slowness of the second drive(s) -- I've stumbled upon this previously, when a seemingly empty D: HDD could halt Windows until the disk had span up.
 
Last edited:
  • Like
Reactions: Jim Salter
Hi Jim! Sorry for misquoting you, so much so that you had to create an account to correct me!

Apart from recommending another drive, would you have any idea as to why my Windows VM still is unable to perform well? To clarify a bit, it is not the OS drive which misbehaves*, but the extra (D: drive) that has this slow performance. Since I already have bought new TLC drives, is there anything I can try in order to make them faster? Or have I simply set them up in a bad way, e.g. by mirroring them? Do you think this specific issue would be (better) solved by switching to another set of drives again?


* While the OS stutters from time to time when performing some disk intensive operations, e.g., starting Lightroom, this might still be connected to the slowness of the second drive(s) -- I've stumbled upon this previously, when a seemingly empty D: HDD could halt Windows until the disk had span up.
I agree, Windows will absolutely act like an idiot even if it's an alternate drive that's occasionally non-responsive. I don't have access to the source code, but it seems pretty obvious that PLENTY of suboptimal decisions have landed in that codebase over the decades. :)

I'd need a lot more information to try to advise you on potential software fixes. And it's possible there might be some, but honestly... you could easily be describing the way Samsung EVO consumer drives "fall off a cliff" when their SLC write cache area gets saturated and they have to start flushing it out to the primary media as a background process, and I suspect that is the actual issue.

If you want to noodle through other possible ideas that DON'T involve buying yet another set of hardware, you're welcome to head on over to the Practical ZFS discourse and we can talk it over there. Apologies, but I really never had any intention of becoming a regular here, I just noticed you Beetlejuicing me and wanted to make sure I got the record straight on drive recommendations. :)
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!