[SOLVED] Windows VM I/O problems only with ZFS

EdoFede · Nov 15, 2023

VictorSTS said:
RAIDz works but has terrible write amplification and padding overhead + low performance. There are tons of threads regarding this, i.e.:

https://forum.proxmox.com/threads/about-zfs-raidz1-and-disk-space.110478/post-475702
https://forum.proxmox.com/threads/raidz-out-of-space.112822/post-487183

It's just the way RAIDz + zvol works. It can be mitigated with bigger volblocksize and using more disks in the RAIDz, but IMHO isn't worth it: just buy more/bigger drives and stick to RAID10, specially when the total space needed will be low and you are using consumer drives, as using more/bigger drives is cheap.

Even if using LZ4 compression?
I've read many posts that argue that with compression on ZFS, everything changes on this subject.
(even the common suggested volblocksize tuning to the ashift and the number of disks of the pool minus parity, seems to looses meaning with compression on)

On these topics I have read a lot of information, tutorials, tuning guides, well detailed posts on this forum... many times in conflict with each other.

It's really difficult to understand which is actually the best choice. Considering also that it depends a lot on the load, type of I/O and configuration of the VMs... which will certainly be very variable within a farm, because of different VMs for different purposes...

VictorSTS · Nov 15, 2023

Not a real expert here, but AFAIK compression doesn't matter on padding, as padding is applied after compressing data. Using compression will potentially make the data to write smaller, so zfs apply padding to the amount of data you really write. Even if that write will need to use a few extra sectors for padding that the original, uncompressed, write may have not need, you still use less disk space when using compression because you need less sectors in total.

I've only had a couple of VMs in production for a few months using RAIDz-1 + compression, with 6 drives, 16k volblocksize and theory proved right, as the overhead was around 30%, comparing du -sh in the VMs with the amount of space used in the zpool.

You've probably found this already, but I found this explanation quite helpful:

https://www.reddit.com/r/zfs/comments/ujba3i/does_allocation_overhead_still_exist_with/

_gabriel · Nov 15, 2023

EdoFede said:
It's just a lab, but for some machines we are going this way because of customers budget constraints.
(I think it's better than spinning SATA drives, anyway)

We are testing the "worst config" now

Thanks

ZFS cannot be tested without proper disks, even for a lab. Experience will wrong.
ZFS require datacenter ssd drive (with plp/capacitors and many TBW) or many hdd and lot of RAM.
"budget constraints" can't use ZFS and should go to the default ext4/LVMThin + a daily fast backup which PBS provide.

EdoFede · Nov 15, 2023

_gabriel said:
ZFS cannot be tested without proper disks, even for a lab. Experience will wrong.
ZFS require datacenter ssd drive (with plp/capacitors and many TBW) or many hdd and lot of RAM.
"budget constraints" can't use ZFS and should go to the default ext4/LVMThin + a daily fast backup which PBS provide.

I don't totally agree.

ZFS was created to work with spinning disks that are much slower than the SSDs used in my tests. And I've used in the past many times even with 5.4k rpm drives without issues.
After reducing the "zfs_dirty_data_max" the issue has gone and the system simply "run as fast as disks permit" without troubles.

I can agree that on an enterprise-grade VM production system it is not a suitable solution, but to say that even in a test lab you cannot experiment with normal consumer-grade SSDs, I find it exaggerated.

For "experiment" I mean testing the PVE/PBS products, not doing some kind of comparison or performance benchmarks. (This was never the point of my thread).

EdoFede · Nov 15, 2023

VictorSTS said:
Agree, but I'm curious too as I don't know if that would change anything in the original problem.

I've done some test reverting the server to the original config (no ZFS tuning parameters) and playing with the zfs sync parameter.

Here the results.

sync=standard (default setting)

sync=always (all writes treated as sync writes)

sync=disabled (all writes treated as async writes)

With the last option, the process get notified of the write done, even if it's not (so very unsecure).

Bye,
Edoardo

VictorSTS · Nov 16, 2023

Interesting that all sync writes avoids the problem and gives "decent" performance!
Many thanks for the test!

EdoFede · Nov 17, 2023

You're welcome!

Yes, interesting. Even if they aren't exactly very good in terms of performance.
Probably some testing with "fio" could give a little more accurate results.
However, they should always be taken as a comparison test in my setup, not as absolute results.

trey.b · Jan 26, 2024

I just wanted to thank the OP and others with the zfs_dirty_data_max suggestion.

I have been working on stabilizing Windows VMs on AMD "Zen 4" EPYC 9004 on:
- Dell R7625
- 2x Zen AMD EPYC 9554p (128 physical cores 3.1-3.75 GHz)
- 16x PCIe Gen 3 enterprise SSDs in ZFS soft RAID
- 20x128 GB DDR5
- Hosting 35-60 VMs with lots of data, mostly Windows and Linux builds can merge TBs of qcow2 disks with Packer

We've had freezes recurring for a number of issues, and the last one seems to be correlated with disk IO.

While I'm not sure if zfs_dirty_data_max fixes it, it had an enormous improvement in our disk throughput and responsiveness. We had AzDo agents disconnect and brick builds and reconnect under high load, and I just ran 2x the packer builds that one would have brought down agents and not a hitch and it build 2x faster despite being concurrent.

I don't know why, I guess ZFS must hit some sort of bottleneck when it decides to flush the dirty writes if the size is too large in some configurations.

Ours was defaulting to about 4 GB, I changed it to about 1 GB after the initial 50 MB value discussed worked but dramatically reduced disk IO throughput.

Permanent:
/etc/modprobe.d/zfs.conf
options zfs zfs_dirty_data_max=1073741824
update-initramfs -u -k all
reboot now

Current OS session only:
echo "1073741824" >/sys/module/zfs/parameters/zfs_dirty_data_max

I also had to disable ZFS compression to make this happen, qemu-img merging TBs of data with light ZFS compression consumes all of the host's CPU, yes all 128 physical high frequency latest gen cores. So don't let people hand wave that ZFS light compression has no major impact on performance, it is not true for all use cases.

IvanDyedov · Jan 29, 2024

EdoFede said:
Hi,
I'm a new Proxmox user.

I've built a small lab to evaluate PVE and PBS, with the intention of replace out Hyper-V infrastructure (90 VMs in 3 sites, many in replica).

We got two Dell R640 servers for PVE and another for PBS.
For this testing purpose we are using 4x WD Blue 1TB SSDs (model WDS100T2B0A) that are laying around.
Two per server (+ separate OS disks) in ZFS mirror config, with ashift=12.

HW config (one server)
Dell R640
2x Intel Xeon Gold 5120
64GB RAM (planned to be expanded to 256GB)
Controller Dell Perc in Passthrough mode

The idea is to run two PVE nodes in HA cluster with ZFS replication between nodes, a remote replica for disaster recovery purpose (for critical VMs) and a local third server with PBS for backups (used also as qDevice for HA quorum).

I configured a network ring (with one 2x10G ethernet card per server), with RSTP over Open vSwitch between 2 PVE and 1 PBS.
All works as expected with very good satisfaction.

But...
While testing with some VMs (for the most part they will be windows servers) I ran into a major stability issue during high I/O.

The Write I/O performance is very poor and the VM become very(very) slow to user interaction and other action during a CrystalDiskMark default test.
The benchmark result also go to 0.00 in 1 or more write test (not always repeatable) at the end.

The benchmark with CrystalDiskMark was carried out after noticing an anomalous behavior during the duplication of a simple file (1GB) inside the VM (guest operating system froze a few seconds after starting the copy and remained unresponsive until the end).

This behaviour happens only if I use ZFS as the storage engine, with any combination of storage parameters, except "Writeback (unsafe)" cache.
And only with Windows Write cache active and buffer flushing turned on (flag unchecked), which is the "standard" windows configuration.

Tests I've made to figure out the problem:
- Every combination of cache setting inside the VM Windows write caching (problems with write cache active, as described)
- Every combination of VM cache setting for the virtual disks (impact on results, but same behaviour, EXCEPT for "Writeback (unsafe)" )
- Separating test disk from OS disk inside VM (no difference)
- Create a separate disk for paging file inside VM (no difference)
- Playing with ZFS ashift, volblocksize, VM NTFS allocation size (very light impact on results and same behaviour)
- Enabling/disabling ZFS cache on the Zvol during test (huge impact on read results, but same behaviour)
- Enabling/disabling ZFS compression (impact on results, but same behaviour)
- Changing Zpool from mirror to single disk (near same behaviour)
- Changing storage engine from ZFS to ext4 (PROBLEM SOLVED using ext4 instead of ZFS)

(of course I've reinstalled the VM for every ZFS layer modification like compression and change of ashift/volblocksize)

It seems like a particular problem related to ZFS in my setup.
I've searched around for days, found post like this one (https://forum.proxmox.com/threads/p...ndows-server-2022-et-write-back-disks.127580/) with near the same issue, but found no practical info except for enterprise SSDs suggestions.

I know that I'm using consumer-grade drives for this test, but since the issue is huge and only present with a certain combination of configuration, I'm searching for an help to figure out the source of the real problem.

Some results from the last test I've ran
Result for ZFS on single disk, with Win cache ON and buffer flush ON
View attachment 58006

Result for ZFS on single disk, with Win cache ON and buffer flush OFF (unsafe)
View attachment 58007

Result for ZFS on single disk, with Win cache OFF
View attachment 58008

Similar behaviour with the ZFS mirror on two disks.

Result using ext4 instead of ZFS on same hardware (and single disk)
No problem at all in this case
View attachment 58009

Hope someone help me to understand where is the problem and how to solve it.

Thanks in advance!
Edoardo

pveversion

Code:

proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve) pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916) pve-kernel-6.2: 8.0.2 pve-kernel-6.2.16-3-pve: 6.2.16-3 ceph-fuse: 17.2.6-pve1+3 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx2 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-3 libknet1: 1.25-pve1 libproxmox-acme-perl: 1.4.6 libproxmox-backup-qemu0: 1.4.0 libproxmox-rs-perl: 0.3.0 libpve-access-control: 8.0.3 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.0.5 libpve-guest-common-perl: 5.0.3 libpve-http-server-perl: 5.0.3 libpve-rs-perl: 0.8.3 libpve-storage-perl: 8.0.1 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve3 novnc-pve: 1.4.0-2 openvswitch-switch: 3.1.0-2 proxmox-backup-client: 2.99.0-1 proxmox-backup-file-restore: 2.99.0-1 proxmox-kernel-helper: 8.0.2 proxmox-mail-forward: 0.1.1-1 proxmox-mini-journalreader: 1.4.0 proxmox-widget-toolkit: 4.0.5 pve-cluster: 8.0.1 pve-container: 5.0.3 pve-docs: 8.0.3 pve-edk2-firmware: 3.20230228-4 pve-firewall: 5.0.2 pve-firmware: 3.7-1 pve-ha-manager: 4.0.2 pve-i18n: 3.0.4 pve-qemu-kvm: 8.0.2-3 pve-xtermjs: 4.16.0-3 qemu-server: 8.0.6 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.1.12-pve1

zpool status

Code:

pool: ZFS-Lab2 state: ONLINE scan: scrub repaired 0B in 00:03:57 with 0 errors on Sun Nov 12 00:27:59 2023 config: NAME STATE READ WRITE CKSUM ZFS-Lab2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WDS100T2B0A-00SM50_183602A01791 ONLINE 0 0 0 ata-WDC_WDS100T2B0A_1849AC802510 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: scrub repaired 0B in 00:03:17 with 0 errors on Sun Nov 12 00:27:20 2023 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-TOSHIBA_MQ01ABF050_863LT034T-part3 ONLINE 0 0 0 ata-TOSHIBA_MQ01ABF050_27MDSVHVS-part3 ONLINE 0 0 0 errors: No known data errors

VM config

Code:

agent: 1 bios: ovmf boot: order=scsi0;ide2;net0 cores: 8 cpu: x86-64-v4 efidisk0: Test:104/vm-104-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K ide2: none,media=cdrom machine: pc-i440fx-8.0 memory: 8192 meta: creation-qemu=8.0.2,ctime=1699873712 name: Testzzz net0: virtio=8E:46:68:39:1E:CA,bridge=vmbr0,firewall=1 numa: 0 ostype: win11 scsi0: Test2:vm-104-disk-0,discard=on,iothread=1,size=100G scsihw: virtio-scsi-single smbios1: uuid=b65697c3-3c86-4c72-86a4-92b05fc8f241 sockets: 1 tpmstate0: Test:104/vm-104-disk-2.raw,size=4M,version=v2.0 unused0: Test:104/vm-104-disk-1.raw vmgenid: 0186fd26-8c5d-4ef0-a8bc-d0ff738bef43

Hi,

EdoFede

Use the "virtio" driver for disks, you have scsi. Do everything according to the instructions https://pve.proxmox.com/wiki/Windows_10_guest_best_practices

_gabriel · Jan 29, 2024

IvanDyedov said:
Hi,

EdoFede
Use the "virtio" driver for disks, you have scsi. Do everything according to the instructions https://pve.proxmox.com/wiki/Windows_10_guest_best_practices

scsi is scsi OVER VirtIO SCSI single controller which is the recommended in the docs ( VirtIO Block is different )

trey.b · Jan 29, 2024

5/7 days of uptime without freezes with 1 GB dirty cache.

It has also resolved other problems:

I no longer see our disk IO alternate between max speed and 0 over and over again. I believe this is all related and is what the OP saw with benchmarks.
I no longer get bogus errors when using Ansible to recreate-vms, source we've had for years. During busy load before, I'd get errors attaching an EFI or TPM disk. I believe that if these commands run on the host during a seizure of disk IO writes, it returns these errors.
Throughput increased by over 2x. A giant list of packer templates we use to define our Windows build image around 600 GB but in 20 pieces would take qemu-img rebase/commit 4.6 hours. Now it takes 2.2 hours.

I have a ticket open with PVE support that I'll recommend that they explore the default calculation of zfs_dirty_data_max and why 4 GB seems to cause this behavior. Maybe on some systems, too large of a buffer congests PCIe lanes between multiple CPU sockets and causes the NUMA load balancer on the host to halt IO which can cause all of the issues above and Windows NT to kernel panic and freeze without a trace?

2 more days and I'll be confident.

trey.b · Jan 30, 2024

Looking like this immunizes us against 99% of freezes without having to cap disk IO per VM, which is huge. We did have 1 freeze, but it was an absurd amount of IO. Essentially qemu-img was saturating bandwidth while 35 VMs were building in production and I was hammering the host with bootstrapping 10 large Windows VMs which performs a lot of small writes, saturating IOPs.

Before, we'd have 3 freezes per day per host with light load. I just torture testing our server with the zfs dirty write edit to 1 GB and had 1 freeze in a week. In normal conditions with production, we probably won't see it more than a few times per year. I was able to get over 10 days of uptime without freezes by severely restricting IO to 150 MB/s read and write per VM. Now it's uncapped.

This is the mail thread of the NUMA load balancer bug that seems to be the real issue. High disk IO seems to be a catalyst in triggering it.
https://lists.proxmox.com/pipermail/pve-devel/2024-January/061399.html

While I have made this edit on one host, I disabled the NUMA load balancer on another without caps or the zfs_dirty_data_max and I also torture tested it, and 0 freezes. I do not recommend disabling the NUMA load balancer as a fix because it degrades the server performance severely to the point a lot of SSH and Ansible proxmox_kvm error out about being unreachable or timing out. Also the web dashboard will error a lot with "too many redirections", or simply take a long time to update. My understanding is that while disabling the NUMA load balancer bypasses the issue, it also causes PVE to treat multi-socket chips as 1 and it will cross boundaries and use RAM channels belonging to 1 physical socket for threads running on the other socket and it congests PCIe traffic.

kalind · Apr 27, 2024

EdoFede said:
Hi,
I'm a new Proxmox user.

I've built a small lab to evaluate PVE and PBS, with the intention of replace out Hyper-V infrastructure (90 VMs in 3 sites, many in replica).

We got two Dell R640 servers for PVE and another for PBS.
For this testing purpose we are using 4x WD Blue 1TB SSDs (model WDS100T2B0A) that are laying around.
Two per server (+ separate OS disks) in ZFS mirror config, with ashift=12.

HW config (one server)
Dell R640
2x Intel Xeon Gold 5120
64GB RAM (planned to be expanded to 256GB)
Controller Dell Perc in Passthrough mode

The idea is to run two PVE nodes in HA cluster with ZFS replication between nodes, a remote replica for disaster recovery purpose (for critical VMs) and a local third server with PBS for backups (used also as qDevice for HA quorum).

I configured a network ring (with one 2x10G ethernet card per server), with RSTP over Open vSwitch between 2 PVE and 1 PBS.
All works as expected with very good satisfaction.

But...
While testing with some VMs (for the most part they will be windows servers) I ran into a major stability issue during high I/O.

The Write I/O performance is very poor and the VM become very(very) slow to user interaction and other action during a CrystalDiskMark default test.
The benchmark result also go to 0.00 in 1 or more write test (not always repeatable) at the end.

The benchmark with CrystalDiskMark was carried out after noticing an anomalous behavior during the duplication of a simple file (1GB) inside the VM (guest operating system froze a few seconds after starting the copy and remained unresponsive until the end).

This behaviour happens only if I use ZFS as the storage engine, with any combination of storage parameters, except "Writeback (unsafe)" cache.
And only with Windows Write cache active and buffer flushing turned on (flag unchecked), which is the "standard" windows configuration.

Tests I've made to figure out the problem:
- Every combination of cache setting inside the VM Windows write caching (problems with write cache active, as described)
- Every combination of VM cache setting for the virtual disks (impact on results, but same behaviour, EXCEPT for "Writeback (unsafe)" )
- Separating test disk from OS disk inside VM (no difference)
- Create a separate disk for paging file inside VM (no difference)
- Playing with ZFS ashift, volblocksize, VM NTFS allocation size (very light impact on results and same behaviour)
- Enabling/disabling ZFS cache on the Zvol during test (huge impact on read results, but same behaviour)
- Enabling/disabling ZFS compression (impact on results, but same behaviour)
- Changing Zpool from mirror to single disk (near same behaviour)
- Changing storage engine from ZFS to ext4 (PROBLEM SOLVED using ext4 instead of ZFS)

(of course I've reinstalled the VM for every ZFS layer modification like compression and change of ashift/volblocksize)

It seems like a particular problem related to ZFS in my setup.
I've searched around for days, found post like this one (https://forum.proxmox.com/threads/p...ndows-server-2022-et-write-back-disks.127580/) with near the same issue, but found no practical info except for enterprise SSDs suggestions.

I know that I'm using consumer-grade drives for this test, but since the issue is huge and only present with a certain combination of configuration, I'm searching for an help to figure out the source of the real problem.

Some results from the last test I've ran
Result for ZFS on single disk, with Win cache ON and buffer flush ON
View attachment 58006

Result for ZFS on single disk, with Win cache ON and buffer flush OFF (unsafe)
View attachment 58007

Result for ZFS on single disk, with Win cache OFF
View attachment 58008

Similar behaviour with the ZFS mirror on two disks.

Result using ext4 instead of ZFS on same hardware (and single disk)
No problem at all in this case
View attachment 58009

Hope someone help me to understand where is the problem and how to solve it.

Thanks in advance!
Edoardo

pveversion

Code:

proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve) pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916) pve-kernel-6.2: 8.0.2 pve-kernel-6.2.16-3-pve: 6.2.16-3 ceph-fuse: 17.2.6-pve1+3 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx2 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-3 libknet1: 1.25-pve1 libproxmox-acme-perl: 1.4.6 libproxmox-backup-qemu0: 1.4.0 libproxmox-rs-perl: 0.3.0 libpve-access-control: 8.0.3 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.0.5 libpve-guest-common-perl: 5.0.3 libpve-http-server-perl: 5.0.3 libpve-rs-perl: 0.8.3 libpve-storage-perl: 8.0.1 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve3 novnc-pve: 1.4.0-2 openvswitch-switch: 3.1.0-2 proxmox-backup-client: 2.99.0-1 proxmox-backup-file-restore: 2.99.0-1 proxmox-kernel-helper: 8.0.2 proxmox-mail-forward: 0.1.1-1 proxmox-mini-journalreader: 1.4.0 proxmox-widget-toolkit: 4.0.5 pve-cluster: 8.0.1 pve-container: 5.0.3 pve-docs: 8.0.3 pve-edk2-firmware: 3.20230228-4 pve-firewall: 5.0.2 pve-firmware: 3.7-1 pve-ha-manager: 4.0.2 pve-i18n: 3.0.4 pve-qemu-kvm: 8.0.2-3 pve-xtermjs: 4.16.0-3 qemu-server: 8.0.6 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.1.12-pve1

zpool status

Code:

pool: ZFS-Lab2 state: ONLINE scan: scrub repaired 0B in 00:03:57 with 0 errors on Sun Nov 12 00:27:59 2023 config: NAME STATE READ WRITE CKSUM ZFS-Lab2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WDS100T2B0A-00SM50_183602A01791 ONLINE 0 0 0 ata-WDC_WDS100T2B0A_1849AC802510 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: scrub repaired 0B in 00:03:17 with 0 errors on Sun Nov 12 00:27:20 2023 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-TOSHIBA_MQ01ABF050_863LT034T-part3 ONLINE 0 0 0 ata-TOSHIBA_MQ01ABF050_27MDSVHVS-part3 ONLINE 0 0 0 errors: No known data errors

VM config

Code:

agent: 1 bios: ovmf boot: order=scsi0;ide2;net0 cores: 8 cpu: x86-64-v4 efidisk0: Test:104/vm-104-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K ide2: none,media=cdrom machine: pc-i440fx-8.0 memory: 8192 meta: creation-qemu=8.0.2,ctime=1699873712 name: Testzzz net0: virtio=8E:46:68:39:1E:CA,bridge=vmbr0,firewall=1 numa: 0 ostype: win11 scsi0: Test2:vm-104-disk-0,discard=on,iothread=1,size=100G scsihw: virtio-scsi-single smbios1: uuid=b65697c3-3c86-4c72-86a4-92b05fc8f241 sockets: 1 tpmstate0: Test:104/vm-104-disk-2.raw,size=4M,version=v2.0 unused0: Test:104/vm-104-disk-1.raw vmgenid: 0186fd26-8c5d-4ef0-a8bc-d0ff738bef43

i am late in this party !

For ZFS backed VM disk - select as following.

"VirtIO SCSI Single" as controller
SCSI as a Bus/Device
SSD Emulation - yes ( even if ZFS is on HDD, just to disable disk defragment scheduler, and still you should manually disable disk defragment scheduler in guest OS too. )
Cache - No Cache
Discard - yes ( even if ZFS is on HDD, this way ZFS knows which portion of physical HDD is not in use, but i am still confused about this and may update this part with certain answer later after tests. )
IO Thread - yes
Backup - Yes/No ( as per your need )

---- now very important -----

Just try selecting "native" in Async_IO , during guest disk creation.

fendrich · Aug 5, 2024

kalind said:
i am late in this party !

For ZFS backed VM disk - select as following.

"VirtIO SCSI Single" as controller

SCSI as a Bus/Device

SSD Emulation - yes ( even if ZFS is on HDD, just to disable disk defragment scheduler, and still you should manually disable disk defragment scheduler in guest OS too. )

Cache - No Cache

Discard - yes ( even if ZFS is on HDD, this way ZFS knows which portion of physical HDD is not in use, but i am still confused about this and may update this part with certain answer later after tests. )

IO Thread - yes

Backup - Yes/No ( as per your need )

---- now very important -----

Just try selecting "native" in Async_IO , during guest disk creation.

if you read here: https://forum.proxmox.com/threads/async-io-io_uring-native-or-threads.139849/
as far as i understand:
native should not be used on ZFS as metadata writes will block

999 · Apr 13, 2025

Special thanks to @trey.b and @eclipse10000.
As a small addition for all those whose search leads here (like me).

On my systems (old Dell R740 and a small server with Xeon E-2146G, each only 64GB RAM) the absolute upper limit (parameter 'zfs_dirty_data_max_max') for the “drity writes” was set to 4GB and thus parameter 'zfs_dirty_data_max' was also at 4GB using Proxmox - more details see ZFS docs.

This caused Windows RDS to crash several times a day. As suggested the solution is to limit 'zfs_dirty_data_max' to 1GB (you could also use 'zfs_dirty_data_max_max', which would make the limit even more absolute, but this is not possible with my solution).

The goal is for the parameter setting 'zfs_dirty_data_max' to survive kernel updates. For me the solution was to use Systemd Service (instead of automate update-initramfs after Kernel updates).

Bash:

cat > /etc/systemd/system/zfs-tune.service <<EOF
[Unit]
Description=Set ZFS dirty_data_max after boot
After=zfs.target

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo 1073741824 > /sys/module/zfs/parameters/zfs_dirty_data_max'

[Install]
WantedBy=multi-user.target
EOF

# Check the created file with
cat /etc/systemd/system/zfs-tune.service

# Then reboot the system and check the ZFS 'dirty write' parameter with
cat /sys/module/zfs/parameters/zfs_dirty_data_max

Or use any editor to create the the /etc/systemd/system/zfs-tune.service file.

This Systemd Service should guarantee that the 'zfs_dirty_data_max' parameter is always set after ZFS is up, does not matter which kernel is used and its easy to change the value of 'zfs_dirty_data_max' if you want to optimize the system.

_gabriel · Apr 13, 2025

Btw, another factor is Virtio-scsi Windows drivers version 215 (2022-01) to 262 (2024-08) which are known to hang. https://forum.proxmox.com/threads/w...from-hdd-zfs-mirrored-pool.163057/post-753229

Search

Search

[SOLVED] Windows VM I/O problems only with ZFS

EdoFede

Member

VictorSTS

Distinguished Member

_gabriel

Famous Member

EdoFede

Member

EdoFede

Member

VictorSTS

Distinguished Member

EdoFede

Member

trey.b

Member

IvanDyedov

New Member

EdoFede

_gabriel

Famous Member

EdoFede

trey.b

Member

trey.b

Member

kalind

New Member

fendrich

New Member

999

New Member

_gabriel

Famous Member

We value your privacy

[SOLVED] Windows VM I/O problems only with ZFS

Member

Distinguished Member

Famous Member

Member

Member

Distinguished Member

Member

Member

New Member

EdoFede​

Famous Member

EdoFede​

Member

Member

New Member

New Member

New Member

Famous Member

We value your privacy

EdoFede

EdoFede