RaidZ1 performance ZFS on host vs VM

privnote

New Member
Dec 11, 2023
27
2
3
Hi,

I am currently testing my RaidZ1 setup.

The plan was actually to create the ZFS pool (4x1TB PCIe NVME SSD) with Proxmox and then pass it as a disk to a VM, among others.
In my benchmarks with fio, however, I noticed that the performance on the host was significantly higher (approx. 50%) than in the VM.

Host (PVE 7.4):
write_throughput: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=16

write: IOPS=243, BW=976MiB/s (1023MB/s)(858GiB/900075msec); 0 zone resets
WRITE: bw=976MiB/s (1023MB/s), 976MiB/s-976MiB/s (1023MB/s-1023MB/s), io=858GiB (921GB), run=900075-900075msec

VM (Debian 12)
I can't find my results right now, but I did the same fio benchmark here and the write was around 500 MB/s
I then tried a lot of optimizations, such as setting the CPU type to host, disabling aiothreads and specte mitigations, but none of that helped much.
The performance only improved slightly when I changed the block size in the PVE UI to 1MB before creating the VM disk, but even then it was only around 650MB/s


This is the script for my fio benchmark.
The benchmark was running for 15 minutes

Code:
IODEPTH=16
NUMJOBS=1
BLOCKSIZE=4M
RUNTIME=900


#TEST_DIR=/mnt/testLVM3/fiotest
TEST_DIR=/testZFS/fiotest
fio --name=write_throughput --directory=$TEST_DIR --numjobs=$NUMJOBS \
--size=1200G --time_based --runtime=$RUNTIME --ramp_time=2s --ioengine=libaio \
--direct=1 --bs=$BLOCKSIZE --iodepth=$IODEPTH --rw=randwrite \
--group_reporting=1 --iodepth_batch_submit=$IODEPTH \
--iodepth_batch_complete_max=$IODEPTH


I created the ZFS pool and the VM with default settings

create -fo 'ashift=12' testZFS raidz /dev/disk/by-id/nvme-Lexar_SSD_NM790_1TB_NLD648R000186P2202 /dev/disk/by-id/nvme-KIOXIA-EXCERIA_PLUS_G3_SSD_8DSKF3M9Z0E9 /dev/disk/by-id/nvme-KIOXIA-EXCERIA_PLUS_G3_SSD_8DSKF3SLZ0E9 /dev/disk/by-id/nvme-Lexar_SSD_NM790_1TB_NLD648R000184P2202

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Does anyone have an idea where the big performance difference between host and VM comes from?
I expected some performance degradation with the VM, but I was thinking around 10%, not 30% or even 50%

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In addition, the performance for a Z1 raid also seems very bad to me, when testing the individual disks (only tested on the host) I had about 50% more performance in the same test

But according to this formula, I should have had a factor of 3 higher performance
Streaming write speed: (N - p) * Streaming write speed of single drive
https://static.ixsystems.co/uploads/2020/09/ZFS_Storage_Pool_Layout_White_Paper_2020_WEB.pdf
 
And my question was more about why I have better performance with a single disk than with 4 disks in RaidZ1
Raidz1(/2/3) has to wait for the slowest drive (because it has to wait for all drives and since they are different makes and models the "slowest one" might change during the work-load) and has additional write multiplication due to padding.
 
Raidz1(/2/3) has to wait for the slowest drive (because it has to wait for all drives and since they are different makes and models the "slowest one" might change during the work-load) and has additional write multiplication due to padding.
The speed should still be about 3 x the slowest SSD, shouldn't it?

According to this formula
Streaming write speed: (N - p) * Streaming write speed of single drive
 
ZFS flush data in sync mode. It will wait until slowest disk finish write and then it will push another batch to write. And of course all metadata / transaction updates.
 
And like already said...there is padding overhead. PVE7 uses 8K volblocksize by default. This combined with a 4-disk raidz1 and ashift=12 means you lose 50% and not 25% of the raw capacity as everything written to a zvol will be 150% in size. So on top of your data blocks there will be additional 50% padding blocks. With that in mind you lose 33% of read/write bandwidth because of this overhead.
You probably want to increase the ZFS storages "Block size" from 8K to 16K or even 64K and then destroy and recreate every VMs virtual disk.
 
Last edited:
ZFS flush data in sync mode. It will wait until slowest disk finish write and then it will push another batch to write. And of course all metadata / transaction updates.
But it should still be much faster?
Or is the formula not correct, we can also adjust the wording there
Streaming write speed: (N - p) * Streaming write speed of slowest single drive

With that in mind you lose 33% of read/write bandwidth because of this overhead.
I was aware that I was losing raw capacity.
But bandwidth too?
I actually thought that the paddin area would then simply be "empty"

But anyway, I already described that I did the same test again with a block size of 1024KB.
And even with that I only achieved 650MB/s instead of 1000MB/s
 
If you want to speed up write you can set up sync=disabled But keep in mind you can lose some data in the power outage


as of VM look here - https://kb.blockbridge.com/technote/proxmox-optimizing-windows-server/part-2.html
Or simply use an NVMe as a ZIL.
Even with some parititioning where:
slice 1 = ZIL - 8GB
slice 2 = L2ARC - Remaining drive.

"Raidz1(/2/3) has to wait for the slowest drive (because it has to wait for all drives and since they are different makes and models the "slowest one" might change during the work-load)" - only when a ZIL is not in use by the way I've described here, that will circumvent this behaviour in a major way.

When you're talking of "ZFS storages "Block size" - do you mean recordsize or volblocksize? Those will be quite different. Presuming volblocksize.

My per VM defaults on latest Prox 8.X has:
volsize 1M
volblocksize 8K

I'm not sure this really matters as such.
 
"Raidz1(/2/3) has to wait for the slowest drive (because it has to wait for all drives and since they are different makes and models the "slowest one" might change during the work-load)" - only when a ZIL is not in use by the way I've described here, that will circumvent this behaviour in a major way.
But then the formula can't really be right, can it?
I would rather expect what you describe from a mirror.

The 2 SSD types are not so different in terms of speed.
And even if they were exactly the same, you can only achieve higher performance if writing is parallelized, can't you?

When you're talking of "ZFS storages "Block size" - do you mean recordsize or volblocksize? Those will be quite different. Presuming volblocksize.
I mean the setting in the Proxmox UI / storage -> volblocksize
 
But then the formula can't really be right, can it?
I would rather expect what you describe from a mirror.
Actually, I said that. Every write to a Raidz is spread over every drive, so I claim that it needs to wait for each drive (when it's a sync write) and therfore the slowest one. ZFS update metadata and waits for it to complete befote doing the next change to guarantee data consistency.
The 2 SSD types are not so different in terms of speed.
Maybe, but they are different brand, using different firmware, using different wear leveling algorithms. One might be slow when the other is fast and the other way around. Both of them are consumer SSDs without PLP and cannot cache sync writes, which slows ZFS down (when doing snyc writes or metadata updates).
And even if they were exactly the same, you can only achieve higher performance if writing is parallelized, can't you?
I think so, but Raidz requires all writes to go to all drives, while a stripe of mirrors can do them independently (withing metadata limits).
I mean the setting in the Proxmox UI / storage -> volblocksize
Yes, that has a major impact of padding and write multiplication (and often also requires reads for each write, which limits bandwidth and latency).
 
When you're talking of "ZFS storages "Block size" - do you mean recordsize or volblocksize? Those will be quite different. Presuming volblocksize.

My per VM defaults on latest Prox 8.X has:
volsize 1M
volblocksize 8K
PVE 7 default:
volblocksize = 8K
recordsize = 128K

PVE 8 default:
volblocksize = 16K
recordsize = 128K

What you define in the "Block size" field of the ZFSpol type storage in PVE only defines what "volblocksize" will be used for newly created zvols (so virtual disks of VMs). LXCs always use datasets where the recordsize will be used, instead of the "volblocksize" and there is no way via the webUI to tell PVE what recordsize to use when creating new datasets. This has to be set via CLI by setting the recordsize of the parent dataset/pool so it will be inherited by the LXCs virtual disks.
 
Last edited:
  • Like
Reactions: Himcules
Maybe, but they are different brand, using different firmware, using different wear leveling algorithms. One might be slow when the other is fast and the other way around. Both of them are consumer SSDs without PLP and cannot cache sync writes, which slows ZFS down (when doing snyc writes or metadata updates).
I understand the arguments in general, but I don't understand why the RaidZ1 is even significantly slower than the slowest disk.

And again the question, I should get about 3 x the write speed of the slowest disk, shouldn't I?

What you define in the "Block size" field of the ZFSpol type storage in PVE only defines what "volblocksize" will be used for newly created zvols (so virtual disks of VMs).
I know.
As described above.
With the default of 8k I reached about 500MB/s in the fio benchmark and after I changed it to 1024KB (of course I recreated the disk) I reached 650MB/s

But on the host I achieve about 1000 MB/s with the same benchmark
 
Last edited:
ZFS is very complex system. It is COW system. If you don't care compression, encryption, snapshot, data integrity then use old file systems.

Use #atop to see CPU and disk usage. Maybe it will show something interesting.
 
If your VM writes in blocks of 12K then the raidz1 could write 16K in parallel to the drives (assuming ZFS is that simple). But the VM probably writes in blocks of power of two, which with 1 drive redundancy (+33%) needs to be split over multiple records, which takes more time. That's why you have less space and speed than you expect. Using one more drive might help.

I did not see your VM configuration but did you add something like args: -global scsi-hd.physical_block_size=16k -global scsi-hd.logical_block_size=4k to inform the operating system inside of the optimal read/write size (when using VirtIO SCSI)? I hope that setting is optimal for ashift=12 and volblocksize=16K, when not using Raidz, but someone please correct me if I'm wrong. Maybe the physical_block_size should be 12k for Raidz1 but I don't think that's supported.
 
I did not see your VM configuration but did you add something like args: -global scsi-hd.physical_block_size=16k -global scsi-hd.logical_block_size=4k to inform the operating system inside of the optimal read/write size (when using VirtIO SCSI)?
In theory yes, but I was newer able to see any performance or wear improvement in fio benchmarks when using that so set it from the default 512B/512B to 4K/4K.
Not sure what black magic virtio is doing there to perform so well with the 512B defaults on a 8K/16K/32K zvol.

Did you see any improvements?
 
Did you see any improvements?
Not that I have noticed. I have been running only inside VMs on ZFS (mirrors) for so long that I don't know if I'm missing out on performance (as I upgrade drives seldom). Even the huge jump in fsync/sec with enterprise NVMe (measured with pveperf) does not feel faster, just less wear and a feeling of safety due to the PLP. The biggest improvement was adding a cheap TLC NVMe to my mirror of two HDDs that I use with PBS. The random reads "go to the SSD" and it becomes much more responsive (until it wears out of course), without the need for a redundant special device.
 
guys i have a question here regarding the volblocksize.
when i look at my disks i get this:

Code:
root@pve:~# zfs get volblocksize
NAME                                        PROPERTY      VALUE     SOURCE
rpool                                       volblocksize  -         -
rpool/ROOT                                  volblocksize  -         -
rpool/ROOT/pve-1                            volblocksize  -         -
rpool/data                                  volblocksize  -         -
rpool/data/subvol-241-disk-0                volblocksize  -         -
rpool/data/subvol-242-disk-0                volblocksize  -         -
rpool/data/subvol-242-disk-0@fresh-install  volblocksize  -         -
rpool/data/subvol-243-disk-0                volblocksize  -         -
rpool/data/subvol-244-disk-0                volblocksize  -         -
rpool/data/subvol-245-disk-0                volblocksize  -         -
rpool/data/subvol-247-disk-0                volblocksize  -         -
rpool/data/subvol-249-disk-0                volblocksize  -         -
rpool/data/vm-100-disk-0                    volblocksize  8K        -
rpool/data/vm-100-disk-1                    volblocksize  8K        -
rpool/data/vm-101-disk-0                    volblocksize  8K        -
rpool/data/vm-101-disk-1                    volblocksize  8K        -
rpool/data/vm-101-disk-2                    volblocksize  8K        -
rpool/data/vm-102-disk-0                    volblocksize  8K        -
rpool/data/vm-102-disk-1                    volblocksize  8K        -
rpool/data/vm-102-disk-2                    volblocksize  8K        -
rpool/data/vm-103-disk-0                    volblocksize  8K        -
rpool/data/vm-104-disk-0                    volblocksize  8K        -
rpool/data/vm-248-disk-0                    volblocksize  8K        -
rpool/data/vm-248-disk-1                    volblocksize  8K        -
rpool/data/vm-248-disk-2                    volblocksize  8K        -
rpool/data/vm-252-disk-0                    volblocksize  16K       default
rpool/data/vm-252-disk-1                    volblocksize  16K       default
rpool/data/vm-253-disk-0                    volblocksize  8K        -
rpool/data/vm-253-disk-1                    volblocksize  8K        -

i assume the 8k on the majority of disks are because the machines were from a backup taken on pve7 and restored on 8 while the vm252 was freshly installed on pve8.

the pool is a mirror of intel s3610 sata drives and generally performing very well.

i guess i could improve performance by getting the volblocksize for all vm-disks to 16k, im just unsure how.
i definitely want to avoid reinstalling the vms as this would be a load of work.

if someone could give me some tips it would be nice.
 
i guess i could improve performance by getting the volblocksize for all vm-disks to 16k, im just unsure how.
i definitely want to avoid reinstalling the vms as this would be a load of work.
Do something that creates a new virtual disk and copies the content: clone the VM (and delete the original) or backup and then restore (if all disks are backed up).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!