Windows guest with very low SSD performance

Soilage

Member
Aug 13, 2021
11
2
8
54
I have a problem with my Windows 10 guest VM, where the disk write speeds are going from 500-600 MB/s to around 50 MB/s in a matter of seconds. After a while it goes down to around 2 and then back up again to around 50 MB/s, leaving the OS almost unusable.
I have tested it with one 10 gb file, making a copy of it on the same disk.

I have Proxmox 7 installed on bare metal with a single NVMe M.2 disk in ZFS.
The Windows guest is configured with 8 cores as host, 8 gb RAM and the SCSI controller as VirtIO.
The HDD is configured with Write Back in cache.
When installing Windows, I installed the VirtIO driver and the QEMU agent.

Do any of know what could be the problem and what I should do? Thanks!
 
I have a problem with my Windows 10 guest VM, where the disk write speeds are going from 500-600 MB/s to around 50 MB/s in a matter of seconds. After a while it goes down to around 2 and then back up again to around 50 MB/s, leaving the OS almost unusable.
I have tested it with one 10 gb file, making a copy of it on the same disk.

I have Proxmox 7 installed on bare metal with a single NVMe M.2 disk in ZFS.
The Windows guest is configured with 8 cores as host, 8 gb RAM and the SCSI controller as VirtIO.
The HDD is configured with Write Back in cache.
When installing Windows, I installed the VirtIO driver and the QEMU agent.

Do any of know what could be the problem and what I should do? Thanks!
First you shouldn't use writeback with ZFS. Your ZFS is already caching stuff in RAM (with its ARC) on the host. If you enable writeback this will tell your host to cache stuff in RAM again. So you basically cache stuff twice in RAM wasting RAM and lowering the performance because everything is done twice. And your Windows guest is caching in RAM too, so the same data gets cached 3 times in RAM. ITs recommended to set the cache mode to "none" if using ZFS.

And I think what you are seeing is that your SSD is slow and can only write with 50MB/s and the 500-600MB/s speed is just when you write stuff to the write cache in RAM. But your RAM isn't unlimited and as soon as it is full it will stop caching and the performance sinks down to the real performance of your SSD. Then the SSD is working hard because it needs to store the new data from the guest as well as the old data that os cached in RAM and waits to be stored on the SSD. Over time your cache gets written to disk, your RAM is free again and you can write with 600MB/s.
 
Last edited:
First you shouldn't use writeback with ZFS. Your ZFS is already caching stuff in RAM (with its ARC) on the host. If you enable writeback this will tell your host to cache stuff in RAM again. So you basically cache stuff twice in RAM wasting RAM and lowering the performance because everything is done twice. And your Windows guest is caching in RAM too, so the same data gets cached 3 times in RAM. ITs recommended to set the cache mode to "none" if using ZFS.

And I think what you are seeing is that your SSD is slow and can only write with 50MB/s and the 500-600MB/s speed is just when you write stuff to the write cache in RAM. But your RAM isn't unlimited and as soon as it is full it will stop caching and the performance sinks down to the real performance of your SSD. Then the SSD is working hard because it needs to store the new data from the guest as well as the old data that os cached in RAM and waits to be stored on the SSD. Over time your cache gets written to disk, your RAM is free again and you can write with 600MB/s.
Thanks for taking the time to write that explanation. I haven't tried to install fx Windows just to see the performance of the disk prior to installing Proxmox. It's a Samsung 980 NVMe M.2, so it should be performing a bit better than what I'm seeing :)
Would it make sense for me to recreate the disk as LVM instead of ZFS, or is that pointless?
Can I test the SSD's speed directly from the Proxmox OS?

Just to tell you my "level of expertise": I'm not an expert in Linux, but can learn if I get a few pointers. My Proxmox server is just a home lab server, so nothing crucial.
 
Thanks for taking the time to write that explanation. I haven't tried to install fx Windows just to see the performance of the disk prior to installing Proxmox. It's a Samsung 980 NVMe M.2, so it should be performing a bit better than what I'm seeing :)
You also get alot write amplification using ZFS and virtualization in general. My benchmarks for example showed write amplification between factor 3 and 81 for the same VM and disks just with different types of writes (async/sync, sequential/random, small/big blocksize). So lets say you got a write amplification of factor 10. That means if your guests writes with 50MB/s your SSD is actually writing with 500MB/s. If you got a write amplification of factor 81 and your guest is writing with 50MB/s your host would write with 4050MB/s...
So a factor 10 write amplification will also make your guest write 10 times slower.
And write amplification isn't only reducing performance, it wil also kill your SSDs faster. Your 980 PRO only got 600 TBW. I for example got a average write amplification of around factor 20. So after writing 30TB inside the guest the host would have written 600TB and the SSD may die (or atleast I would loose the 5 year warranty). Thats one thing why enterprise SSDs can handle much more writes compared to consumer SSDs. They are build to perform better and survive longer with problems like write amplification that you don't encounter in that size if you are not running server stuff like advanced filesystems (ZFS) or nested filesystems like when using vitualization.
Depending on the workload every NVMe SSD can get down to below 10MB/s. I tested a pool of 8 SSDs and wasn't able to write with more than 2 MB/s if running the worst workload.
But if you just copy a big file the write amplification shouldn't be that bad because this should be big sequential async writes where I saw write amplifications in the factor 3-4.
So you basically cant compare a bare metal win installation with a virtualized one because the write/read amplification lay lower on bare metal.
Would it make sense for me to recreate the disk as LVM instead of ZFS, or is that pointless?
LVM should be faster, save RAM and the SSD should survive longer because the write amplification should be lower. But you would also loose alot of features that ZFS offers like compression on block level, replication, data corruption detection (but it can't repair the found corruptions anyways because you don't got parity data with only one drive) and so on.
Its by the way not recommended to use consumer SSDs like your Samsung 980 with ZFS because they die too fast and are too slow if you try to run server workloads like running DBs.
Can I test the SSD's speed directly from the Proxmox OS?
Best tool would be fio. Buts its also the most complicated one. I spend over 30 hours benchmarking with it and still didn't fully understand whats exactly going on in detail. You could run a quick pveperf but that won't help you much to get absolute numbers.
 
Last edited:
  • Like
Reactions: Soilage
LVM should be faster, save RAM and the SSD should survive longer because the write amplification should be lower. But you would also loose alot of features that ZFS offers like compression on block level, replication, data corruption detection (but it can't repair the found corruptions anyways because you don't got parity data with only one drive) and so on.
Its by the way not recommended to use consumer SSDs like your Samsung 980 with ZFS because they die too fast and are too slow if you try to run server workloads like running DBs.
That's a huge let down that I didn't knew all of this before I made my purchase. I honestly thought that my main focus should be the CPU and a large amount of RAM - and the SSD would just give me the bare metal-ish performance.

I guess I just have to live with it for now.

Thanks for all your help.. I highly appreciate it!
 
8gb ram for vm, but how much ram in host ?
other vm usage ?
because zfs use 50% host ram usage by default ...
retry with arc limited to 1Gb :
#echo options zfs zfs_arc_max=1024000000 > /etc/modprobe.d/zfs.conf
#update-initramfs -u
#reboot
 
For the last 2 hours I have been doing windows guest performance testing for ssd backed storage.

I do plan to post my results, but to summarise.

Volume faster than dataset+raw, 64k considerably better than 4k even with random 4k i/o.

Uncached write performance drags even on the best configuration I tested though (excluding write back with volume which seems to still cache using another caching layer).
 
Last edited:
  • Like
Reactions: Dunuin
8gb ram for vm, but how much ram in host ?
other vm usage ?
because zfs use 50% host ram usage by default ...
retry with arc limited to 1Gb :
#echo options zfs zfs_arc_max=1024000000 > /etc/modprobe.d/zfs.conf
#update-initramfs -u
#reboot
I have a total of 32 gb in the host and I have for testing purposes turned off the other vm I have running.
I have limited it to 1 gb as you told me, but I'm still having the same issue. I'm only copying the file from the network just to have the write speed test, and it swings between 0 and 50 MB/s.
 
For the last 2 hours I have been doing windows guest performance testing for ssd backed storage.

I do plan to post my results, but to summarise.

Volume faster than dataset+raw, 64k considerably better than 4k even with random 4k i/o.

Uncached write performance drags even on the best configuration I tested though (excluding write back with volume which seems to still cache using another caching layer).
As someone just started with Proxmox, can you tell me where to change this, please?
 
Assuming you using new proxmox as it looks like it changed in how its done.

The volblocksize should be set when you create a volume, ashift can only be set when you make a pool, first you need to be sure you have a good ashift for your hardware, I am not a fan of auto detection but thankfully proxmox lets you manually configure it on installation, check your ashift with this command.

zpool get ashift rpool

if its 9, start again and reinstall proxmox (or remake the pool if its a different pool to rpool), hopefully it is 12 or 13.

Next we make the volume, login to proxmox ui.
Go to datacentre, then storage.
Click add
Choose ZFS
Type a name of your choosing in the ID box to name the volume. e.g. 'rpool-64k-vol'
Select the pool in the "ZFS Pool" box.
Bottom right box type the volblocksize e.g. '64k'.
Thats it you done, now when you add a new disk you can choose this volume, make sure the cluster size matches on the OS drive when you format it.

All my testing was done with lz4 compression, which I think is the best to use unless you chasing % on archival drives.

If you want to be sure its the same.

zfs set compression=lz4 on the volume, if you do this before you create disks they will inherit the value.
 
Last edited:
That's a huge let down that I didn't knew all of this before I made my purchase. I honestly thought that my main focus should be the CPU and a large amount of RAM - and the SSD would just give me the bare metal-ish performance.

I guess I just have to live with it for now.

Thanks for all your help.. I highly appreciate it!
It really depends on your workload. If you really want to torture it, you could kill that SSD within weeks. But if you mainly use LXCs, dont run any DBs and so on that SSD might work for many years. Just don't forget to monitor your SMART values so you see early if the wear gets too high. My homeserver for example writes 900GB per day while idleing. Most of that is just caused by wear leveling and writing logs and metrics to DBs that get amplified like hell. I also bought 6 consumer SSDs first and replaced them later with enterprise SSDs because I calculated that they wouldn't survive a single year. So I'm now using them inside my gaming PC where they possibly will survive for decades even if I'm writing much more real data to them (all the big Steam downloads and so on).
If you buy the enterprise SSDs second hand they aren't that expensive. I for example paid around 125€ per TB (but SATA not NVMe) and the SSD still got around 20.500 TBW left. That Samsung SSD was more expensive and only got 600 TBW. So the enterprise SSDs sound expensive but if you look at the price per TBW they are super cheap.
 
Last edited:
if its 9, start again and reinstall proxmox (or remake the pool if its a different pool to rpool), hopefully it is 12 or 13.
Mine's 12. Does this mean that I don't need to reinstall Proxmox?
Next we make the volume, login to proxmox ui.
Go to datacentre, then storage.
Click add
Choose ZFS
And in my case, I need to wipe my already create ZFS disk and recreate it per your instructions?
Thats it you done, now when you add a new disk you can choose this volume, make sure the cluster size matches on the OS drive when you format it.
Sorry, if I sound stupid, but what does the above mean? When I initialize it with the Windows Setup, there's a cluster size?

I'll try it sometime this weekend. I just need to backup my vms first :)
 
It really depends on your workload. If you really want to torture it, you could kill that SSD within weeks. But if you mainly use LXCs, dont run any DBs and so on that SSD might work for many years. Just don't forget to monitor your SMART values so you see early if the wear gets too high. My homeserver for example writes 900GB per day while idleing. I also bought 6 consumer SSDs and replaced them with enterprise SSDs because I calculated that they wouldn't survive a single year. So I'm now using them inside my gaming PC.
If you buy the enterprise SSDs second hand they aren't that expensive. I for example paid around 125€ per TB (but SATA not NVMe) and the SSD still got around 20.500 TBW left. That Samsung SSD was more expensive and only got 600 TBW. So the enterprise SSDs sound expensive but if you look at the price per TBW they are super cheap.
Thanks for the hint on the SMART wearout status. I'll keep an eye on it.
I have an Ubuntu server running 24/7 with a handful of docker containers. Some of them can be pretty active though.
The other vms are a couple of Windows machines that I actively turn on when I need them. One of them is a development machine that feels incredibly sluggish because of this ssd issue.

Besides of the much higher TBW on the enterprise grade SSDs you're having, are you experiencing a "much higher" speed compared to your previous consumer SSDs?
 
Mine's 12. Does this mean that I don't need to reinstall Proxmox?

And in my case, I need to wipe my already create ZFS disk and recreate it per your instructions?

Sorry, if I sound stupid, but what does the above mean? When I initialize it with the Windows Setup, there's a cluster size?

I'll try it sometime this weekend. I just need to backup my vms first :)

Yeah you good no need to reinstall proxmox.

If you want 64k clusters on the disk then yes recreate the disk.

When you initialise inside windows it asks you to choose the partition type. After you pick that, then you right click on the partition in disk manager and select create simple volume, on the screen where you can type the volume name and choose quick format, there is an allocation size option, pick 65535.

Ntfs drives that are not 4k lose compression (zfs compression is better), lose native encryption (no ransomware thank you) and I think it might also affect defrag but I never defrag disks anyway.
 
Last edited:
Mine's 12. Does this mean that I don't need to reinstall Proxmox?
ashift=12 should be fine that means a 4K blocksize. Every manufacturer is lying and will tell you that the SSD will use a 512B or 4K blocksize. But in reality it should be way higher. More like 8K/16K/32K oder something like that. And a SSD can only write a block if it first erases a complete row and a row might be 128K or even up into the MBs. And writing isn't whats damaging the SSD its the erasing. Every write is a read-erase-write. and if your row is 128K and you want to write 4K block it will read 128K, erase 128K, write 128K.
Or in short:
Just stick with the ashift of 12. Manufacturers want that the SSDs are doing well in benchmarks to sell more stuff and 99,9% of all people will use a 4K blocksize, so I'm sure they optimised it for 4K to get high numbers that are good to sell stuff.
And in my case, I need to wipe my already create ZFS disk and recreate it per your instructions?
Yes, but your virtual disk not your physical one. And restoring a VM from backup will also recreate the virtual disk. So you don't need to start your VM from scratch....except you also want to change your NTFS cluster size.
Sorry, if I sound stupid, but what does the above mean? When I initialize it with the Windows Setup, there's a cluster size?

I'll try it sometime this weekend. I just need to backup my vms first :)
You can choose a cluster size while formating a disk. Not sure if the win installer allows to change that. Can'T remember to have ever seen that.
 
Thanks for the hint on the SMART wearout status. I'll keep an eye on it.
I have an Ubuntu server running 24/7 with a handful of docker containers. Some of them can be pretty active though.
The other vms are a couple of Windows machines that I actively turn on when I need them. One of them is a development machine that feels incredibly sluggish because of this ssd issue.

Besides of the much higher TBW on the enterprise grade SSDs you're having, are you experiencing a "much higher" speed compared to your previous consumer SSDs?
Not sure, didn't made benchmarks back then. But in general consumer SSDs are more designed for reads and not with writes in mind. And they should be optimized for high and short bursts of IO instead of medium IO but that 24/7. So on paper enterprise SSDs might look slower with lower bandwidth and IOPS but a consumer SSD might only deliver that high speed for some minutes or seconds until the performance is crashing because the RAM cache and SLC cache gets full and performance will drop to terrible values. A enterprise SSDs performance shouldn't drop that low.
Right now I got 10 enterprise SSDs in my homeserver (paid 10-30€ per SSD) and each SSD got a 4GB DDR3 RAM chip for caching. So all SSDs together got more RAM than your complete server. And that are just the small 100 and 200GB SSDs. The bigger models with more capacity got even more RAM.

And a big difference are sync writes. Enterprise SSDs got a powerloss protection (buildin backup "battery") so they can quickly save cached stuff from the volatile RAM into the nonvolatile NAND if a power outage occures. Consumer/prosumer SSDs don't got such a backup "battery" and all data in the SSDs internal RAM cache will be lost. If a application needs to make sure that important data is really safely stored it will do a sync instead of a async write. A consumer SSD knows that it will loose all data if a power outage occures so it can't cache stuff in RAM and will directly write it to the NAND cells without any caching so the data can't be lost.

But now remember what I wrote earlier how SSDs work.

Lets say the SSD reports to use 4K blocks. But uses 16K blocks internally for reads/writes and a 128K row for erasing.

Now you want to sync write 32x 4K blocks. A enterprise SSD can use the RAM cache and will store this 32x 4K in RAM and immediately report back as "securley written" even if it didn't done a single write to the NAND yet. Then it will merge these 32x 4K blocks in RAM to 8x 16k blocks. Then it will erase a 128K row and write these 8x 16k blocks all at once. So in total it erased 128K, has written 128K and read 0K to store a sum of 128K (32x 4K) of data.

Here is how a consumer SSD handles this: Because it can't cache stuff in RAM it will write each of the 64 4K blocks one after another.
Read 128K from NAND to RAM, erase 128K, write 128K. That all to write a single 4k block. Now it will report back "I saved the first block, send me another one". The hosts sends the second 4k block. The SSD will again read, erase and write 128K and report back...this happens 32 times until all 32 blocks have been written. So in total the consumer SSD will read 4M (32x 128K), erase 4M and write 4M to store only 128K (32x 4K) of data.
So you got a reaaaaally bad read and write amplification here.

Thats why consumer SSDs are so terrible as a DB storage because DBs mainly do small 8K/16K/32K sync writes.
 
Last edited:
  • Like
Reactions: erikk
I apologise, having now posted my test data, 4k seems to be better for writes on ntfs with ssds. But bear in mind I posted on a single drive zfs pool. Volumes do seem much more capable than datasets+raw.

If using writeback performance goes through the roof.

So I would go with either nocache, 4k, volumes or writeback, 64k, volumes.

On 64k volumes gets same speed on both cluster sizes, but the i/o delay is about 1/7th on 64k.

I will do forced flush testing at some point to ensure writeback honours them.
 
Last edited:
Yes, but your virtual disk not your physical one. And restoring a VM from backup will also recreate the virtual disk. So you don't need to start your VM from scratch....except you also want to change your NTFS cluster size.
Just to make sure that I'm getting this: Under my PVE node and Disks I have two physical disks. One is used for the PVE OS and the other is the Samsung NVME. The Samsung is used for my VMs and has the ZFS partition. Should I highlight the device itself and then do a "Wipe disk", or are you talking about something else?
 
Here is how a consumer SSD handles this: Because it can't cache stuff in RAM it will write each of the 64 4K blocks one after another.
Read 128K from NAND to RAM, erase 128K, write 128K. That all to write a single 4k block. Now it will report back "I saved the first block, send me another one". The hosts sends the second 4k block. The SSD will again read, erase and write 128K and report back...this happens 32 times until all 32 blocks have been written. So in total the consumer SSD will read 4M (32x 128K), erase 4M and write 4M to store only 128K (32x 4K) of data.
So you got a reaaaaally bad read and write amplification here.

Thats why consumer SSDs are so terrible as a DB storage because DBs mainly do small 8K/16K/32K sync writes.
Thanks a billion for that explanation. It makes perfect sense and I should invest in an enterprise grade SSD when I see wear outs!
 
I apologise, having now posted my test data, 4k seems to be better for writes on ntfs with ssds. But bear in mind I posted on a single drive zfs pool. Volumes do seem much more capable than datasets+raw.
I'm having a single drive ZFS pool as well. Can you explain what you mean by volumes? Can you spell it out for me, please? :)
When I initially installed PVE, I created my single drive disk as ZFS, named it and went on to create VMs with their harddisks on that named ZFS disk.
When I recreate it, what should I do differently?
Thanks!