Windows guest with very low SSD performance

If you followed what I said when adding ZFS you are using volumes already, its actually more work to use a dataset as its not the supported method by proxmox. :)

I have since moved the OS drive from dataset to volume on the windows VM I am testing and its definitely faster, it boots in under 2 seconds now instead of about 5, and feels almost like bare metal when using it now. It was only raw in the first place as I had migrated it from esxi and hadnt bothered to convert it over.

You can confirm when you add a disk on the hardware config of the VM, if all of the raw. qcow2 options etc. are ghosted out, then you will be using a volume.
 
Not sure, didn't made benchmarks back then. But in general consumer SSDs are more designed for reads and not with writes in mind. And they should be optimized for high and short bursts of IO instead of medium IO but that 24/7. So on paper enterprise SSDs might look slower with lower bandwidth and IOPS but a consumer SSD might only deliver that high speed for some minutes or seconds until the performance is crashing because the RAM cache and SLC cache gets full and performance will drop to terrible values. A enterprise SSDs performance shouldn't drop that low.
Right now I got 10 enterprise SSDs in my homeserver (paid 10-30€ per SSD) and each SSD got a 4GB DDR3 RAM chip for caching. So all SSDs together got more RAM than your complete server. And that are just the small 100 and 200GB SSDs. The bigger models with more capacity got even more RAM.

And a big difference are sync writes. Enterprise SSDs got a powerloss protection (buildin backup "battery") so they can quickly save cached stuff from the volatile RAM into the nonvolatile NAND if a power outage occures. Consumer/prosumer SSDs don't got such a backup "battery" and all data in the SSDs internal RAM cache will be lost. If a application needs to make sure that important data is really safely stored it will do a sync instead of a async write. A consumer SSD knows that it will loose all data if a power outage occures so it can't cache stuff in RAM and will directly write it to the NAND cells without any caching so the data can't be lost.

But now remember what I wrote earlier how SSDs work.

Lets say the SSD reports to use 4K blocks. But uses 16K blocks internally for reads/writes and a 128K row for erasing.

Now you want to sync write 32x 4K blocks. A enterprise SSD can use the RAM cache and will store this 32x 4K in RAM and immediately report back as "securley written" even if it didn't done a single write to the NAND yet. Then it will merge these 32x 4K blocks in RAM to 8x 16k blocks. Then it will erase a 128K row and write these 8x 16k blocks all at once. So in total it erased 128K, has written 128K and read 0K to store a sum of 128K (32x 4K) of data.

Here is how a consumer SSD handles this: Because it can't cache stuff in RAM it will write each of the 64 4K blocks one after another.
Read 128K from NAND to RAM, erase 128K, write 128K. That all to write a single 4k block. Now it will report back "I saved the first block, send me another one". The hosts sends the second 4k block. The SSD will again read, erase and write 128K and report back...this happens 32 times until all 32 blocks have been written. So in total the consumer SSD will read 4M (32x 128K), erase 4M and write 4M to store only 128K (32x 4K) of data.
So you got a reaaaaally bad read and write amplification here.

Thats why consumer SSDs are so terrible as a DB storage because DBs mainly do small 8K/16K/32K sync writes.

Fair points on the SLC cache, luckily I am using MLC drives ( 2x 830 and 1x 850 pro(this model got tested to oblivion and back on writes by some reviewers) ) I have also manually over provisioned them for compensation. Its only my personal project, but if I was setting something up for a work project given all the write amplification issues and the power loss protection of enterprise drives I probably would go that route, use case is important, I dont mind for a personal project, at least on MLC drives.
 
about consumer ssd, they are pretty bad with zfs or ceph, because it need fast sync writes for the zfs/ceph journal.
datacenter ssd have a supercapacitor to keep writes in internal buffer, so they are a lot lot faster with sync write.
(some consumer ssd are slow like a hdd, around 200-400iops 4k write, vs 10000-20000 iops 4k write for datacenter ssd)


(be carefull of samsung "enterprise" ssd, 970/980, they are consumer too, it's just marketing bullshit)


https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/
 
I'm having a single drive ZFS pool as well. Can you explain what you mean by volumes? Can you spell it out for me, please? :)
When I initially installed PVE, I created my single drive disk as ZFS, named it and went on to create VMs with their harddisks on that named ZFS disk.
When I recreate it, what should I do differently?
Thanks!
Volumes should be zvols.
Just to make sure that I'm getting this: Under my PVE node and Disks I have two physical disks. One is used for the PVE OS and the other is the Samsung NVME. The Samsung is used for my VMs and has the ZFS partition. Should I highlight the device itself and then do a "Wipe disk", or are you talking about something else?
No, that will erase your real/physical SSD. You just need to recreate the virtual disks (in other words zvols / volumes / not real disks that your guests are using) if you only want to change the volblocksize and not the ashift.
 
Hey @Dunuin @chrcoluk.
First of all, thanks a lot for all your help and insights. I tried every trick you gave, but nothing worked and the SSD still had some serious bottlenecks.

I wiped my drive, installed Windows on bare metal and ran tests there. The speed was pretty good and the drive didn't seem to have any problems.
I reinstalled PVE again (I didn't have to, but I did it anyway) and recreated the ZFS disk together with a 4k pool. I installed Windows in a VM and the SSD speed is both acceptable and stable. So it no longer throttles out, leaving the OS hanging.

I honestly have no idea what has wrong before and what was the good thing happening now, but I'm satisfied with the result :)
 
  • Like
Reactions: Dunuin and chrcoluk