Sync writes to ZFS zvol disk are NOT sync under PVE?

sevimo

New Member
Mar 2, 2025
4
1
1
Was benchmarking/tuning a new Proxmox on ZFS install, and got weird results with sync writes from VM, which I am struggling to explain. I am testing the storage stack from physical disks all the way to zvol from inside VM. This is fio command that I am using:

Code:
fio --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1

Basically, I am testing block devices for random writes at 4k/QD1, and the important part is '--sync=1', i.e. this should do sync writes, which are slow. However, these writes when done from within VM are unreasonably fast! While one could argue that this is a good problem to have :), this is an indication that despite app in VM asking for sync writes, this is not happening, so consistency is potentially compromised. Or I just don't quite understand what is happening there.

So here are the details. I have 2 SSD drives (host:/dev/sda and host:/dev/sdb), two partitions on these that are combined into ZFS mirrored pool 'ztest', and ztest has a zvol /dev/zd0 = 'vm-101-disk-0', which can be passed to fio directly on the host, and can be passed to a VM (debian live) which can run fio on it in guest (guest:/dev/sda).

So to establish baseline, I run fio command above on a raw physical disk partition (host:/dev/sdb5), and get the following IOPS (rounded for easier reading):


host, sdb5, sync=0
15,000​
host, sdb5, sync=1
1,000​

So this is what these disks are capable of for sync vs async workload as per above. These are test drives, so iops are terrible, but is also easy to notice the differences.

So if I run the same workload still on host, but against /dev/zd0 and with different settings of sync attribute on zvol:

host, zd0/standard, sync=0
17,000​
host, zd0/standard, sync=1
700​
host, zd0/always, sync=0
700​
host, zd0/always, sync=1
700​

Nothing particularly unusual here - standard sync honors explicit syncs and doesn't do sync when not asked to, and always sync does sync always, no matter what. The only thing is, I did not expect that sync writes to zvol would be notably slower than sync writes to the slowest underlying drive, and it seems quite a bit slower (I have retested multiple times, this is consistent). Is this expected, and what could be a reason for that?

But results get really weird when I bench the exact same zvol from inside a VM:

guest, sda/standard, sync=0, cache=none
20,000​
guest, sda/standard, sync=1, cache=none
10,000​
guest, sda/standard, sync=0, cache=directsync
15,000​
guest, sda/standard, sync=1, cache=directsync
15,000​
guest, sda/always, sync=0, cache=none
700​
guest, sda/always, sync=1, cache=none
700​

OK, now this table doesn't make much sense to me, at all (well, except the last two lines I suppose). So with cache=none, I expected guest zvol to behave the same as in the host testing, but even with sync=1, it is clearly that writes are not exactly sync, they are way too fast. Worse, there IS a difference between sync=0 and sync=1 in this case, with both being much higher than physical sync writes. Even worse, these results (for cache=none) are basically the same as for cache=writeback (wtf?!)

Directsync mode also makes a difference, but is also way too fast (and now sync=0 vs sync=1 does NOT make a difference). And to round this up, to make sure I am not tripping in testing a wrong disk or something ( :) ), as soon as I set sync=always on the original zvol, guest finally showing expected terrible iops no matter what sync is set up to in fio.

So, what is happening? I feel that I am missing something obvious here...

P.S. I tried other combinations of cache, io_uring/native, etc., these other settings did not make material difference. SCSI single with IOTreads on.
 
I can't confirm this (for me, with sync=standard, a sync fio benchmark is magnitudes slower than a non-sync one..)

can you please post the exact
- pveversion -v
- VM config and details about the guest OS
- ZFS dataset properties
- fio commandline

that shows you the "wrong" results?
 
I can't confirm this (for me, with sync=standard, a sync fio benchmark is magnitudes slower than a non-sync one..)

can you please post the exact
- pveversion -v
- VM config and details about the guest OS
- ZFS dataset properties
- fio commandline

that shows you the "wrong" results?

fio commandline is posted above, guest OS is Debian Live:
Linux debian 6.1.0-29-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.123-1 (2025-01-02) x86_64 GNU/Linux

pveversion -v:
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.3 (running version: 8.3.3/f157a38b211595d6)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.4
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.3.3
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

zfs properties:
Code:
NAME   PROPERTY              VALUE                  SOURCE
ztest  type                  filesystem             -
ztest  creation              Thu Feb 27 21:44 2025  -
ztest  used                  974M                   -
ztest  available             62.5G                  -
ztest  referenced            96K                    -
ztest  compressratio         10.61x                 -
ztest  mounted               yes                    -
ztest  quota                 none                   default
ztest  reservation           none                   default
ztest  recordsize            128K                   default
ztest  mountpoint            /ztest                 default
ztest  sharenfs              off                    default
ztest  checksum              on                     default
ztest  compression           on                     local
ztest  atime                 on                     default
ztest  devices               on                     default
ztest  exec                  on                     default
ztest  setuid                on                     default
ztest  readonly              off                    default
ztest  zoned                 off                    default
ztest  snapdir               hidden                 default
ztest  aclmode               discard                default
ztest  aclinherit            restricted             default
ztest  createtxg             1                      -
ztest  canmount              on                     default
ztest  xattr                 on                     default
ztest  copies                1                      default
ztest  version               5                      -
ztest  utf8only              off                    -
ztest  normalization         none                   -
ztest  casesensitivity       sensitive              -
ztest  vscan                 off                    default
ztest  nbmand                off                    default
ztest  sharesmb              off                    default
ztest  refquota              none                   default
ztest  refreservation        none                   default
ztest  guid                  6386616471437502632    -
ztest  primarycache          all                    local
ztest  secondarycache        all                    default
ztest  usedbysnapshots       0B                     -
ztest  usedbydataset         96K                    -
ztest  usedbychildren        973M                   -
ztest  usedbyrefreservation  0B                     -
ztest  logbias               latency                default
ztest  objsetid              54                     -
ztest  dedup                 off                    default
ztest  mlslabel              none                   default
ztest  sync                  standard               local
ztest  dnodesize             legacy                 default
ztest  refcompressratio      1.00x                  -
ztest  written               96K                    -
ztest  logicalused           10.0G                  -
ztest  logicalreferenced     42K                    -
ztest  volmode               default                default
ztest  filesystem_limit      none                   default
ztest  snapshot_limit        none                   default
ztest  filesystem_count      none                   default
ztest  snapshot_count        none                   default
ztest  snapdev               hidden                 default
ztest  acltype               off                    default
ztest  context               none                   default
ztest  fscontext             none                   default
ztest  defcontext            none                   default
ztest  rootcontext           none                   default
ztest  relatime              on                     default
ztest  redundant_metadata    all                    default
ztest  overlay               on                     default
ztest  encryption            off                    default
ztest  keylocation           none                   default
ztest  keyformat             none                   default
ztest  pbkdf2iters           0                      default
ztest  special_small_blocks  0                      default
ztest  prefetch              all                    default

NAME                 PROPERTY              VALUE                  SOURCE
ztest/vm-101-disk-0  type                  volume                 -
ztest/vm-101-disk-0  creation              Sat Mar  1 16:39 2025  -
ztest/vm-101-disk-0  used                  967M                   -
ztest/vm-101-disk-0  available             62.5G                  -
ztest/vm-101-disk-0  referenced            967M                   -
ztest/vm-101-disk-0  compressratio         10.63x                 -
ztest/vm-101-disk-0  reservation           none                   default
ztest/vm-101-disk-0  volsize               55G                    local
ztest/vm-101-disk-0  volblocksize          128K                   -
ztest/vm-101-disk-0  checksum              on                     default
ztest/vm-101-disk-0  compression           on                     inherited from ztest
ztest/vm-101-disk-0  readonly              off                    default
ztest/vm-101-disk-0  createtxg             30526                  -
ztest/vm-101-disk-0  copies                1                      default
ztest/vm-101-disk-0  refreservation        none                   default
ztest/vm-101-disk-0  guid                  4528998075837681602    -
ztest/vm-101-disk-0  primarycache          all                    inherited from ztest
ztest/vm-101-disk-0  secondarycache        all                    default
ztest/vm-101-disk-0  usedbysnapshots       0B                     -
ztest/vm-101-disk-0  usedbydataset         967M                   -
ztest/vm-101-disk-0  usedbychildren        0B                     -
ztest/vm-101-disk-0  usedbyrefreservation  0B                     -
ztest/vm-101-disk-0  logbias               latency                default
ztest/vm-101-disk-0  objsetid              13976                  -
ztest/vm-101-disk-0  dedup                 off                    default
ztest/vm-101-disk-0  mlslabel              none                   default
ztest/vm-101-disk-0  sync                  standard               inherited from ztest
ztest/vm-101-disk-0  refcompressratio      10.63x                 -
ztest/vm-101-disk-0  written               967M                   -
ztest/vm-101-disk-0  logicalused           10.0G                  -
ztest/vm-101-disk-0  logicalreferenced     10.0G                  -
ztest/vm-101-disk-0  volmode               default                default
ztest/vm-101-disk-0  snapshot_limit        none                   default
ztest/vm-101-disk-0  snapshot_count        none                   default
ztest/vm-101-disk-0  snapdev               hidden                 default
ztest/vm-101-disk-0  context               none                   default
ztest/vm-101-disk-0  fscontext             none                   default
ztest/vm-101-disk-0  defcontext            none                   default
ztest/vm-101-disk-0  rootcontext           none                   default
ztest/vm-101-disk-0  redundant_metadata    all                    default
ztest/vm-101-disk-0  encryption            off                    default
ztest/vm-101-disk-0  keylocation           none                   default
ztest/vm-101-disk-0  keyformat             none                   default
ztest/vm-101-disk-0  pbkdf2iters           0                      default
ztest/vm-101-disk-0  prefetch              all                    default

volblocksize is 128k here, but I started with 16k, the pattern was still the same.

VM definition:
Code:
boot: order=ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: local:iso/debian-live-12.9.0-amd64-standard.iso,media=cdrom,size=1499968K
memory: 2048
meta: creation-qemu=9.0.2,ctime=1740711583
name: fiotest
net0: virtio=BC:24:11:0F:05:39,bridge=vmbr0,firewall=1,tag=11
numa: 0
ostype: l26
scsi0: ztest:vm-101-disk-0,aio=io_uring,cache=none,discard=on,iothread=1,size=55G
scsihw: virtio-scsi-single
smbios1: uuid=7e99887f-d252-43b2-9d91-7027cb2f84c8
sockets: 1
vmgenid: 7b4eb2f9-ab24-4264-ac43-d9602341249b

Thanks!
 
trying to mimic your setup (my test VM yesterday was beefier ;)) I now get:

16k volblocksize:

sync=1: 17k iops, 70MB/s bw
sync=0: 30k iops, 123MB/s bw

128 volblocksize:

sync=1: 18k iops, 77MB/s
sync=0: 33k iops, 137MB/s

if I change iodepth to 4, the difference is more pronounced:

sync=1: 57k iops, 237MB/s
sync=0: 90k iops, 365MB/s

on the host itself:
sync=1: 500 iops, 2157kB/s
sync=0: 92k iops, 378MB/s

so I think I can reproduce this! will keep you posted!
 
  • Like
Reactions: leesteken