[SOLVED] Windows VM I/O problems only with ZFS

EdoFede · Nov 13, 2023

Hi,
I'm a new Proxmox user.

I've built a small lab to evaluate PVE and PBS, with the intention of replace out Hyper-V infrastructure (90 VMs in 3 sites, many in replica).

We got two Dell R640 servers for PVE and another for PBS.
For this testing purpose we are using 4x WD Blue 1TB SSDs (model WDS100T2B0A) that are laying around.
Two per server (+ separate OS disks) in ZFS mirror config, with ashift=12.

HW config (one server)
Dell R640
2x Intel Xeon Gold 5120
64GB RAM (planned to be expanded to 256GB)
Controller Dell Perc in Passthrough mode

The idea is to run two PVE nodes in HA cluster with ZFS replication between nodes, a remote replica for disaster recovery purpose (for critical VMs) and a local third server with PBS for backups (used also as qDevice for HA quorum).

I configured a network ring (with one 2x10G ethernet card per server), with RSTP over Open vSwitch between 2 PVE and 1 PBS.
All works as expected with very good satisfaction.

But...
While testing with some VMs (for the most part they will be windows servers) I ran into a major stability issue during high I/O.

The Write I/O performance is very poor and the VM become very(very) slow to user interaction and other action during a CrystalDiskMark default test.
The benchmark result also go to 0.00 in 1 or more write test (not always repeatable) at the end.

The benchmark with CrystalDiskMark was carried out after noticing an anomalous behavior during the duplication of a simple file (1GB) inside the VM (guest operating system froze a few seconds after starting the copy and remained unresponsive until the end).

This behaviour happens only if I use ZFS as the storage engine, with any combination of storage parameters, except "Writeback (unsafe)" cache.
And only with Windows Write cache active and buffer flushing turned on (flag unchecked), which is the "standard" windows configuration.

Tests I've made to figure out the problem:
- Every combination of cache setting inside the VM Windows write caching (problems with write cache active, as described)
- Every combination of VM cache setting for the virtual disks (impact on results, but same behaviour, EXCEPT for "Writeback (unsafe)" )
- Separating test disk from OS disk inside VM (no difference)
- Create a separate disk for paging file inside VM (no difference)
- Playing with ZFS ashift, volblocksize, VM NTFS allocation size (very light impact on results and same behaviour)
- Enabling/disabling ZFS cache on the Zvol during test (huge impact on read results, but same behaviour)
- Enabling/disabling ZFS compression (impact on results, but same behaviour)
- Changing Zpool from mirror to single disk (near same behaviour)
- Changing storage engine from ZFS to ext4 (PROBLEM SOLVED using ext4 instead of ZFS)

(of course I've reinstalled the VM for every ZFS layer modification like compression and change of ashift/volblocksize)

It seems like a particular problem related to ZFS in my setup.
I've searched around for days, found post like this one (https://forum.proxmox.com/threads/p...ndows-server-2022-et-write-back-disks.127580/) with near the same issue, but found no practical info except for enterprise SSDs suggestions.

I know that I'm using consumer-grade drives for this test, but since the issue is huge and only present with a certain combination of configuration, I'm searching for an help to figure out the source of the real problem.

Some results from the last test I've ran
Result for ZFS on single disk, with Win cache ON and buffer flush ON

Result for ZFS on single disk, with Win cache ON and buffer flush OFF (unsafe)

Result for ZFS on single disk, with Win cache OFF

Similar behaviour with the ZFS mirror on two disks.

Result using ext4 instead of ZFS on same hardware (and single disk)
No problem at all in this case

Win cache ON - VM No-cache - Storage ext4 (single disk).png

Hope someone help me to understand where is the problem and how to solve it.

Thanks in advance!
Edoardo

pveversion

Code:

proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

zpool status

Code:

pool: ZFS-Lab2
 state: ONLINE
  scan: scrub repaired 0B in 00:03:57 with 0 errors on Sun Nov 12 00:27:59 2023
config:


    NAME                                         STATE     READ WRITE CKSUM
    ZFS-Lab2                                     ONLINE       0     0     0
      mirror-0                                   ONLINE       0     0     0
        ata-WDC_WDS100T2B0A-00SM50_183602A01791  ONLINE       0     0     0
        ata-WDC_WDS100T2B0A_1849AC802510         ONLINE       0     0     0


errors: No known data errors


  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:17 with 0 errors on Sun Nov 12 00:27:20 2023
config:


    NAME                                        STATE     READ WRITE CKSUM
    rpool                                       ONLINE       0     0     0
      mirror-0                                  ONLINE       0     0     0
        ata-TOSHIBA_MQ01ABF050_863LT034T-part3  ONLINE       0     0     0
        ata-TOSHIBA_MQ01ABF050_27MDSVHVS-part3  ONLINE       0     0     0


errors: No known data errors

VM config

Code:

agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v4
efidisk0: Test:104/vm-104-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-i440fx-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=1699873712
name: Testzzz
net0: virtio=8E:46:68:39:1E:CA,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: Test2:vm-104-disk-0,discard=on,iothread=1,size=100G
scsihw: virtio-scsi-single
smbios1: uuid=b65697c3-3c86-4c72-86a4-92b05fc8f241
sockets: 1
tpmstate0: Test:104/vm-104-disk-2.raw,size=4M,version=v2.0
unused0: Test:104/vm-104-disk-1.raw
vmgenid: 0186fd26-8c5d-4ef0-a8bc-d0ff738bef43

EdoFede · Nov 14, 2023

Small update.
I've tried to add another SSD as log device, just to try.
Same behaviour.

A simple copy of a large file inside the VM freezes the VM for all the copy time, after the firsts seconds.

I can't understand why it don't just get inferior performance, but freezes the virtual machine... and only with windows write cache active (standard setting).

I hope some of you can give some ideas.

Thanks in advance
Edoardo

VictorSTS · Nov 14, 2023

Sorry to say, but I believe the problem are your drives: when cache is full, the writes sink to a nearly stall. Writeback mode sends async writes to the Proxmox hosts page cache, which sends the writes to the storage in an async way. Sync writes are sent to the disk. I'm not sure what kind of writes are your VMs sending to the storage give it's different cache settings in the OS.

Probably won't help, but you can try to disable sync writes on ZFS (zfs set sync=disabled) and check if there's any difference in the results. Beware that this is completely unsafe: ZFS will return ack to the aplication writing data as soon as ZFS has sent the data to disk. Data can get lost at any point between the app and the drive's chips.

EdoFede · Nov 14, 2023

Hi Victor,
thank you for your explanation.

Sure the disks are not ideal for an enterprise solution, but these are still SSDs with good performance and all tests were conducted with a single VM, not with a heavy load (after all, it is a lab for testing the product, not a production farm).

A simple duplication of a file inside the VM blocks it completely.
However, this absolutely does not happen with the same configuration, but with ext4 instead of ZFS on PVE.

Rather, it appears that there is incorrect management of the interface between the SCSI virtual controller and ZFS.
If I have these problems with a simple copy of files using SSD storage, how do users who use spinning disk arrays?

I think there is something wrong, because the behavior is really anomalous. It's not a simple slowdown (which can happen), but actually a freezing of the entire VM during the copy (or the disk benchmark).

I'll do some more testing tonight.
I hope to solve this because "consumer" SSD disks are actually being used on some machines due to customer budget constraints and I would hate to have to leave everything on Hyper-V (which doesn't cause this type of problem) due to this issue.

Thanks,
Edoardo

VictorSTS · Nov 14, 2023

I have many of systems with ZFS (RAID1, RAID10, I don't use RAIDz for VMs) + Windows VMs (win7 to Win2022 and anything in between but Win8) using VirtIO SCSI Single controller and some variety of drives from spinning rust to high end NVMe drives. Everything is smooth, so I would discard a general issue.

IMHO this is something with ZFS and those drives.

Please post the VM configuration to make sure that end is ok (qm config VMID)

EDIT: you should also check if there's any firmware update for those WD SA510

EdoFede · Nov 14, 2023

VictorSTS said:
I have many of systems with ZFS (RAID1, RAID10, I don't use RAIDz for VMs) + Windows VMs (win7 to Win2022 and anything in between but Win8) using VirtIO SCSI Single controller and some variety of drives from spinning rust to high end NVMe drives. Everything is smooth, so I would discard a general issue.

I used ZFS sometimes in the past (first time on Solaris 10), but never had similar issues.
Never ran VMs on ZFS too. I've used it on servers with storage purpose.

VictorSTS said:
IMHO this is something with ZFS and those drives.

Agree, but I've also a pair of other similar SSDs laying around, so I'll try also with these.

VictorSTS said:
Please post the VM configuration to make sure that end is ok (qm config VMID)

EDIT: you should also check if there's any firmware update for those WD SA510

Attached on the first post, but here it is:

Code:

agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v4
description: ### Template Windows Server 2022
efidisk0: ZFS-Lab2:vm-103-disk-0,efitype=4m,format=raw,pre-enrolled-keys=1,size=528K
ide2: none,media=cdrom
machine: pc-i440fx-8.0
memory: 8192
meta: creation-qemu=8.0.2,ctime=1699644077
name: DaTemplate
net0: virtio=A2:B8:BC:D2:2B:65,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: ZFS-Lab2:vm-103-disk-1,discard=on,format=raw,iothread=1,size=120G
scsi1: ZFS-Lab2:vm-103-disk-4,iothread=1,size=16G
scsihw: virtio-scsi-single
smbios1: uuid=57ec82d1-b9f9-48bd-9a69-3b58933d973c
sockets: 1
tpmstate0: ZFS-Lab2:vm-103-disk-2,size=4M,version=v2.0
unused0: ZFS-Lab2:vm-103-disk-3
vmgenid: 9d26c942-b373-4128-b78c-72ffc9edd39c

Thanks!

VictorSTS · Nov 14, 2023

Please post your /etc/pve/storage.cfg.

Code:

scsi0: ZFS-Lab2:vm-103-disk-1,discard=on,format=raw,iothread=1,size=120G

Asking because I don't see that format=raw parameter in any VM using zfs storage.

eclipse10000 · Nov 14, 2023

I had a similar issue with my ZFS SSDs, even causing my Proxmox host to crash because the host SSD was also ZFS. This happened when a VM caused high write rates that couldn't be synced due to the SSD cache being full.

Under Windows, I observed a pattern like a loop (3000 MB/s for 30 seconds, then 0 MB/s for 60 seconds), etc. These zero-transfer periods likely caused the Proxmox host SSD to crash the entire system.

What I changed on my Proxmox Host:

Permanently:

In /etc/modprobe.d/zfs.conf, I added:

Code:

options zfs zfs_dirty_data_max=67108864

Or temporarily:

Code:

echo "$[64*1024*1024]" >/sys/module/zfs/parameters/zfs_dirty_data_max

This limits the amount ZFS holds in RAM, encouraging it to write to the SSD sooner. It probably prevents complete I/O blocking on SSDs that can't maintain the speed, as it doesn't insist on writing a large amount, like 8GB, all at once.

Now it works really well, even if a QLC SSD's speed drops from 3000 MB/s to 60 MB/s.

EdoFede · Nov 14, 2023

VictorSTS said:
Please post your /etc/pve/storage.cfg.

Sure!

Code:

root@proxlab1:~# cat /etc/pve/storage.cfg
dir: local
    path /var/lib/vz
    content backup,vztmpl,iso
    shared 1


pbs: backlab1
    datastore ZFSLab1
    server backlab1
    content backup
    fingerprint cc:7f:44:6d:0e:39:a9:3f:9c:67:5c:7a:ac:84:4e:26:0e:62:c6:9b:2f:12:24:7b:71:ae:a2:13:d5:b1:28:98
    prune-backups keep-all=1
    username root@pam


zfspool: ZFS-Lab2
    pool ZFS-Lab2
    content rootdir,images
    mountpoint /ZFS-Lab2
    nodes proxlab1
    sparse 0

VictorSTS said:
Asking because I don't see that format=raw parameter in any VM using zfs storage.

From what I understood, in latest version of PVE, Windows VMs on ZFS storage uses Zvol instead of cow2 files by default.

All my VM's disks are Zvol on ZFS:

Code:

root@proxlab1:~# zfs list
NAME                       USED  AVAIL     REFER  MOUNTPOINT
Test                       124G   107G       96K  /Test
Test/vm-105-disk-0           3M   107G      592K  -
Test/vm-105-disk-1         124G   216G     14.7G  -
Test/vm-105-disk-2           6M   107G       72K  -
ZFS-Lab2                   799G   100G       96K  /ZFS-Lab2
ZFS-Lab2/Test               96K   100G       96K  /ZFS-Lab2/Test
ZFS-Lab2/base-201-disk-0  3.11M   100G      116K  -
ZFS-Lab2/base-201-disk-1   136G   224G     12.6G  -
ZFS-Lab2/base-201-disk-2  6.07M   100G       68K  -
ZFS-Lab2/vm-100-disk-0    3.11M   100G      132K  -
ZFS-Lab2/vm-100-disk-2    6.08M   100G       80K  -
ZFS-Lab2/vm-100-disk-3    33.0G   133G       56K  -
ZFS-Lab2/vm-100-disk-4    33.0G   133G       56K  -
ZFS-Lab2/vm-102-disk-0    3.11M   100G      116K  -
ZFS-Lab2/vm-102-disk-1     112G   200G     9.05G  -
ZFS-Lab2/vm-102-disk-2    6.07M   100G       68K  -
ZFS-Lab2/vm-103-disk-0       3M   100G      184K  -
ZFS-Lab2/vm-103-disk-1     124G   200G     24.1G  -
ZFS-Lab2/vm-103-disk-2       6M   100G       68K  -
ZFS-Lab2/vm-103-disk-3    10.1G   110G     2.36M  -
ZFS-Lab2/vm-103-disk-4    16.5G   117G      157M  -
ZFS-Lab2/vm-103-disk-5     149G   197G     45.0G  -
ZFS-Lab2/vm-106-disk-0    61.9G   154G     8.04G  -
ZFS-Lab2/vm-107-disk-0    61.9G   154G     8.37G  -
ZFS-Lab2/vm-108-disk-0    61.9G   155G     7.43G  -
rpool                     27.7G   422G      104K  /rpool
rpool/ROOT                27.7G   422G       96K  /rpool/ROOT
rpool/ROOT/pve-1          27.7G   422G     27.7G  /
rpool/data                  96K   422G       96K  /rpool/data

Bye

EdoFede · Nov 14, 2023

eclipse10000 said:
Under Windows, I observed a pattern like a loop (3000 MB/s for 30 seconds, then 0 MB/s for 60 seconds), etc. These zero-transfer periods likely caused the Proxmox host SSD to crash the entire system.

Very similar behaviour! But without any host crash.

eclipse10000 said:
This limits the amount ZFS holds in RAM, encouraging it to write to the SSD sooner. It probably prevents complete I/O blocking on SSDs that can't maintain the speed, as it doesn't insist on writing a large amount, like 8GB, all at once.

Now it works really well, even if a QLC SSD's speed drops from 3000 MB/s to 60 MB/s.

Thank you so much!
I'll try in the next hours and post the results!

spirit · Nov 14, 2023

don't use consumer ssd with zfs. You need ssd with supercapacitor to handle zfs journal sync writes.

(or at minimum, use a small datacenter ssd for the journal)

EdoFede · Nov 14, 2023

In the meantime I've ended 3 other test.

Created/installed 3 new VM identically configured, except for the storage controller.

IDE Controller (slow, but no issue):

SATA Controller (slow, but no issue):

SCSI Controller but with older Windows drivers (fast but with the issue):

Test new VM - SCSI Controller (OLD driver).png

In conclusion, it seems that when che "not ideal" controller limits the VMs I/O, the issue is not present because the disks are not saturated.

The hypothesis and suggestion of eclipse10000 in post #8 seem more and more likely. I'll test soon and let you know

Thanks!

EdoFede · Nov 14, 2023

spirit said:
don't use consumer ssd with zfs. You need ssd with supercapacitor to handle zfs journal sync writes.

(or at minimum, use a small datacenter ssd for the journal)

It's just a lab, but for some machines we are going this way because of customers budget constraints.
(I think it's better than spinning SATA drives, anyway)

We are testing the "worst config" now

Thanks

eclipse10000 · Nov 14, 2023

To apply the permanent changes:

update-initramfs -u -k all

Regarding this:https://pve.proxmox.com/wiki/ZFS_on_Linux

And you can check if the values have been applied by using

arc_summary

EdoFede · Nov 14, 2023

THANKS eclipse10000!!

It worked like a charm!

Not so good performances, but never had VM freeze and slow down in GUI operations!

Now I'll move on some fine tuning and other testings.

Thanks a lot!

VictorSTS · Nov 15, 2023

EdoFede said:
From what I understood, in latest version of PVE, Windows VMs on ZFS storage uses Zvol instead of cow2 files by default.

As it should be, that's why that format=raw looked weird in there

Thing is that i never stumbled on this consumer ssd + ZFS problems myself, not even with very basic NUCs I have to test lots of things, so I never got a chance to troubleshot it. All production systems have enterprise drives both for spinning rust and solid state drives. No matter how low the budget is, I won't risk production data in any case.

By chance, did you test with zfs set sync=disabled?

Thanks to @eclipse10000 for the tip on zfs_dirty_data_max!

eclipse10000 · Nov 15, 2023

I am glad that I could help.

I don't remember which settings I've tried, but a few months ago, I spent two evenings testing all kinds of suggestions in Proxmox forums and other recommendations. It's possible that I also tried 'zfs set sync=disabled', but I'm not sure, as the problem is probably related to a mechanism at the zpool level.

For me, the only sensible option was the setting 'zfs_dirty_data_max'. I probably tried about 20-30 other settings.

From ChatGPT

With Synchronous Writes Disabled: Data is not written to the disk immediately. Instead, it's initially stored in the ZFS write cache (in RAM). This can significantly speed up write operations because it's faster to write to RAM than to disk. The data in the cache is eventually written to the disk, but this happens in the background and is managed by ZFS.

But personally, I have understood that the problem is related to the fact that data transfer is first written to RAM and then to the disk, and if there are significant fluctuations in speed, it causes the entire IO to hang.

Even with zfs set sync=disabled, if we imagine a file transfer that exceeds the size of the RAM available for ZFS, we are back to the original problem.

In general, though, I would like to have some feedback from the Proxmox people on this, or someone who knows a bit more about ZFS.

EdoFede · Nov 15, 2023

VictorSTS said:
As it should be, that's why that format=raw looked weird in there

Understood

Not a mine mod on VM config. It's stored as is by PVE GUI.

VictorSTS said:
Thing is that i never stumbled on this consumer ssd + ZFS problems myself, not even with very basic NUCs I have to test lots of things, so I never got a chance to troubleshot it. All production systems have enterprise drives both for spinning rust and solid state drives. No matter how low the budget is, I won't risk production data in any case.

Agree with you, but it's a customer choice and for some not very critical systems, for which backup security is probably sufficient for them.

VictorSTS said:
By chance, did you test with zfs set sync=disabled?

No, I haven't tested it because I think it's very risky, since it "converts" all synchronous writes to asynchronous and ack the processes of completed writes when are not.

I may give it a try when I have some time, but just out of curiosity... I would never use it in production.

Thanks!

EdoFede · Nov 15, 2023

eclipse10000 said:
Even with zfs set sync=disabled, if we imagine a file transfer that exceeds the size of the RAM available for ZFS, we are back to the original problem.

I think so too.

I ask a provocative question:
Assuming a small amount of VMs that are not very I/O intensive (and with an average R/W pattern of 70/30 or 80/20),
could a raidz(1-2) zpool made up of consumer-grade SSDs with SLOG on small enterprise-grade SSDs, be a good compromise?

VictorSTS · Nov 15, 2023

EdoFede said:
I may give it a try when I have some time, but just out of curiosity... I would never use it in production.

Agree, but I'm curious too as I don't know if that would change anything in the original problem.

eclipse10000 said:
Even with zfs set sync=disabled, if we imagine a file transfer that exceeds the size of the RAM available for ZFS, we are back to the original problem.

Theory tells that you are right, I just would like to test it somehow.

EdoFede said:
Assuming a small amount of VMs that are not very I/O intensive (and with an average R/W pattern of 70/30 or 80/20),
could a raidz(1-2) zpool made up of consumer-grade SSDs with SLOG on small enterprise-grade SSDs, be a good compromise?

RAIDz works but has terrible write amplification and padding overhead + low performance. There are tons of threads regarding this, i.e.:

https://forum.proxmox.com/threads/about-zfs-raidz1-and-disk-space.110478/post-475702
https://forum.proxmox.com/threads/raidz-out-of-space.112822/post-487183

It's just the way RAIDz + zvol works. It can be mitigated with bigger volblocksize and using more disks in the RAIDz, but IMHO isn't worth it: just buy more/bigger drives and stick to RAID10, specially when the total space needed will be low and you are using consumer drives, as using more/bigger drives is cheap.

[SOLVED] Windows VM I/O problems only with ZFS

Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

New Member

Member

Member

Distinguished Member

Member

Member

New Member

Member

Renowned Member

New Member

Member

Member

Renowned Member