Was benchmarking/tuning a new Proxmox on ZFS install, and got weird results with sync writes from VM, which I am struggling to explain. I am testing the storage stack from physical disks all the way to zvol from inside VM. This is fio command that I am using:
Basically, I am testing block devices for random writes at 4k/QD1, and the important part is '--sync=1', i.e. this should do sync writes, which are slow. However, these writes when done from within VM are unreasonably fast! While one could argue that this is a good problem to have
, this is an indication that despite app in VM asking for sync writes, this is not happening, so consistency is potentially compromised. Or I just don't quite understand what is happening there.
So here are the details. I have 2 SSD drives (host:/dev/sda and host:/dev/sdb), two partitions on these that are combined into ZFS mirrored pool 'ztest', and ztest has a zvol /dev/zd0 = 'vm-101-disk-0', which can be passed to fio directly on the host, and can be passed to a VM (debian live) which can run fio on it in guest (guest:/dev/sda).
So to establish baseline, I run fio command above on a raw physical disk partition (host:/dev/sdb5), and get the following IOPS (rounded for easier reading):
So this is what these disks are capable of for sync vs async workload as per above. These are test drives, so iops are terrible, but is also easy to notice the differences.
So if I run the same workload still on host, but against /dev/zd0 and with different settings of sync attribute on zvol:
Nothing particularly unusual here - standard sync honors explicit syncs and doesn't do sync when not asked to, and always sync does sync always, no matter what. The only thing is, I did not expect that sync writes to zvol would be notably slower than sync writes to the slowest underlying drive, and it seems quite a bit slower (I have retested multiple times, this is consistent). Is this expected, and what could be a reason for that?
But results get really weird when I bench the exact same zvol from inside a VM:
OK, now this table doesn't make much sense to me, at all (well, except the last two lines I suppose). So with cache=none, I expected guest zvol to behave the same as in the host testing, but even with sync=1, it is clearly that writes are not exactly sync, they are way too fast. Worse, there IS a difference between sync=0 and sync=1 in this case, with both being much higher than physical sync writes. Even worse, these results (for cache=none) are basically the same as for cache=writeback (wtf?!)
Directsync mode also makes a difference, but is also way too fast (and now sync=0 vs sync=1 does NOT make a difference). And to round this up, to make sure I am not tripping in testing a wrong disk or something (
), as soon as I set sync=always on the original zvol, guest finally showing expected terrible iops no matter what sync is set up to in fio.
So, what is happening? I feel that I am missing something obvious here...
P.S. I tried other combinations of cache, io_uring/native, etc., these other settings did not make material difference. SCSI single with IOTreads on.
Code:
fio --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G --time_based --runtime=60 --group_reporting --stonewall --name=cc1 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 --numjobs=1 --sync=1
Basically, I am testing block devices for random writes at 4k/QD1, and the important part is '--sync=1', i.e. this should do sync writes, which are slow. However, these writes when done from within VM are unreasonably fast! While one could argue that this is a good problem to have

So here are the details. I have 2 SSD drives (host:/dev/sda and host:/dev/sdb), two partitions on these that are combined into ZFS mirrored pool 'ztest', and ztest has a zvol /dev/zd0 = 'vm-101-disk-0', which can be passed to fio directly on the host, and can be passed to a VM (debian live) which can run fio on it in guest (guest:/dev/sda).
So to establish baseline, I run fio command above on a raw physical disk partition (host:/dev/sdb5), and get the following IOPS (rounded for easier reading):
host, sdb5, sync=0 | 15,000 |
host, sdb5, sync=1 | 1,000 |
So this is what these disks are capable of for sync vs async workload as per above. These are test drives, so iops are terrible, but is also easy to notice the differences.
So if I run the same workload still on host, but against /dev/zd0 and with different settings of sync attribute on zvol:
host, zd0/standard, sync=0 | 17,000 |
host, zd0/standard, sync=1 | 700 |
host, zd0/always, sync=0 | 700 |
host, zd0/always, sync=1 | 700 |
Nothing particularly unusual here - standard sync honors explicit syncs and doesn't do sync when not asked to, and always sync does sync always, no matter what. The only thing is, I did not expect that sync writes to zvol would be notably slower than sync writes to the slowest underlying drive, and it seems quite a bit slower (I have retested multiple times, this is consistent). Is this expected, and what could be a reason for that?
But results get really weird when I bench the exact same zvol from inside a VM:
guest, sda/standard, sync=0, cache=none | 20,000 |
guest, sda/standard, sync=1, cache=none | 10,000 |
guest, sda/standard, sync=0, cache=directsync | 15,000 |
guest, sda/standard, sync=1, cache=directsync | 15,000 |
guest, sda/always, sync=0, cache=none | 700 |
guest, sda/always, sync=1, cache=none | 700 |
OK, now this table doesn't make much sense to me, at all (well, except the last two lines I suppose). So with cache=none, I expected guest zvol to behave the same as in the host testing, but even with sync=1, it is clearly that writes are not exactly sync, they are way too fast. Worse, there IS a difference between sync=0 and sync=1 in this case, with both being much higher than physical sync writes. Even worse, these results (for cache=none) are basically the same as for cache=writeback (wtf?!)
Directsync mode also makes a difference, but is also way too fast (and now sync=0 vs sync=1 does NOT make a difference). And to round this up, to make sure I am not tripping in testing a wrong disk or something (

So, what is happening? I feel that I am missing something obvious here...
P.S. I tried other combinations of cache, io_uring/native, etc., these other settings did not make material difference. SCSI single with IOTreads on.