RaidZ1 performance ZFS on host vs VM

restoring the backups is what i did (backed it up on pbs under pve7 and restored it on pve8).
thats where the volblocksite of 8k comes from i think.
cloning might be the solution. i will look into it.
 
ZFS is very complex system. It is COW system. If you don't care compression, encryption, snapshot, data integrity then use old file systems.

Use #atop to see CPU and disk usage. Maybe it will show something interesting.
Yet the code is prob 50% of others, like UFS :)

"improve performance by getting the volblocksize for all vm-disks to 16k" - I wonder, has anybody done performance tests changing only the value for volblocksize?
 
"improve performance by getting the volblocksize for all vm-disks to 16k" - I wonder, has anybody done performance tests changing only the value for volblocksize?
Sure, otherwise OpenZFS wouldn't have switched from 8K to 16K as the new default. It will allow for better compression ratios and better performance doing big IO but the performance of very small IO (512B to 8K) will suffer. So its like always...it depends...on your hardware, pool layout and average workload.
 
Last edited:
  • Like
Reactions: Kingneutron
I start this software #top -f -F 1
And then with "L" I change to see all disks and CPU. In the benchmark time you can see disk activity and you can compare do all disk are busy equally and what is CPU activity.
 
I start this software #top -f -F 1
And then with "L" I change to see all disks and CPU. In the benchmark time you can see disk activity and you can compare do all disk are busy equally and what is CPU activity.
I haven't tested this yet, but I went back a step and benchmarked the individual disks again.

This is the result of the fio benchmark with the same settings as above.
The disks are formatted here with ext4
nvme2 and nvme3 are the two Kioxia and nvme0 and nvme1 are the Lexar

1708843882397.png

And here is the result with ZFS.
The first block is the Lexar, the second the Kioxia and the third the Kioxia again, but this time with 1MB / 4MB recordsize

1708843909307.png


On the one hand, I'm very surprised that the Kioxia can permanently (over 15 minutes) reach the approx. 3400MB/s when formatted with ext4.
And on the other hand, the performance then drops very sharply to approx. 1000MB/s with these
(Also with the Lexar, but not as much there, because they already had significantly less performance in the ext4 benchmark than the Kioxia).

Does anyone have an explanation for this, especially for the big drop with the Kioxia devices?
 
Last edited:
what is your fio command ?
advertised performance is only within the cache.
It's the same as on the first page, but here it is again

Code:
IODEPTH=16
NUMJOBS=1
BLOCKSIZE=4M
RUNTIME=900


#TEST_DIR=/mnt/testLVM3/fiotest
TEST_DIR=/testZFS/fiotest
fio --name=write_throughput --directory=$TEST_DIR --numjobs=$NUMJOBS \
--size=1200G --time_based --runtime=$RUNTIME --ramp_time=2s --ioengine=libaio \
--direct=1 --bs=$BLOCKSIZE --iodepth=$IODEPTH --rw=randwrite \
--group_reporting=1 --iodepth_batch_submit=$IODEPTH \
--iodepth_batch_complete_max=$IODEPTH
 
drop only exist on consumer / regular drives when their cache is full.
The 2 TB version slow down to 450 MB/s after writing 490 GB ... tested here at the end of the article with AIDA
I cannot confirm that.
As you can see from the graphic, the two Kioxia did not have a drop over the entire 15-minute runtime.
Approx. 3TB of data was written during the benchmark
(I am using the 1TB version)

And I didn't mean a general drop, but the drop if you use ZFS instead of ext4.
Is it really normal that you then only have 30% of the performance?
 
Last edited:
And I didn't mean a general drop, but the drop if you use ZFS instead of ext4.
Is it really normal that you then only have 30% of the performance?
There is the additional (sync) metadata and checksums and therefore write amplification. Also, raidz1/2/3 does not gain better IOPS or bandwidth like mirrors or stripes. I'm not surprised about this.
 
There is the additional (sync) metadata and checksums and therefore write amplification. Also, raidz1/2/3 does not gain better IOPS or bandwidth like mirrors or stripes. I'm not surprised about this.
Do you think that a 70% performance loss with ZFS is "normal"?
(By the way, my last screenshots were single disk benchmarks and not from a RaidZ1, in case you didn't see that)

According to this formula, I should have had a factor of 3 higher performance with RaidZ1
Streaming write speed: (N - p) * Streaming write speed of single drive
https://static.ixsystems.co/uploads/2020/09/ZFS_Storage_Pool_Layout_White_Paper_2020_WEB.pdf

Or is this formula wrong?
 
For big sequential writes its correct.
For write IOPS only the number of top-level vdevs counts and not the number of disks.
At what size do you define it as big?

And in my benchmark I use random writes, but with 4MB relatively large files and the IOPS of a disk should be sufficient for this?
 
At what size do you define it as big?

And in my benchmark I use random writes, but with 4MB relatively large files and the IOPS of a disk should be sufficient for this?
For sequential writes 4M would be fine but you then want to use "write" and not "randwrite".
For IOPS something small like 4K and then with "randwrite" and a high queue depth or multiple instances.
For latency 4K "randwrite" with queue depth 1, single instance and sync enabled.
 
  • Like
Reactions: _gabriel
Do you think that a 70% performance loss with ZFS is "normal"?
Think of it this way: there is zero performance loss on the hardware of the SSD. The SSD has always the same capabilities regardless of the used filesystem.

ZFS does more than -let's say- ext4fs. ZFS works very hard to guarantee the integrity of any data written to disk for future reads. And it works hard to guarantee that the metadata is valid. And it adds redundancy to be able to repair data. It reaches these goals by adding a lot of check-summing and handling metadata differently than other filesystems.

When ZFS needs to write much more metadata and more often in a sequential, verified and reliable way the user simply sees that writing data takes longer --> write amplification.

Other filesystems perform better while using cheap SSDs just because they do not care.

So yes, ZFS is slower. A lot slower on cheap SSDs and perhaps a little slower on good ones. Personally I am fine with this tradeoff, the benefits are worth it. My data is important. When I read files now or in a few years I appreciate the guarantee that the data I get is the very same and unmodified data that I wrote to the storage yesterday or years ago.

:)
 
ZFS does more than -let's say- ext4fs. ZFS works very hard to guarantee the integrity of any data written to disk for future reads. And it works hard to guarantee that the metadata is valid. And it adds redundancy to be able to repair data. It reaches these goals by adding a lot of check-summing and handling metadata differently than other filesystems.
I know that there must be losses due to the additional features, but I think 70% is quite a lot.

Whether it is significantly less with enterprise SSDs, I would have to test first.
Maybe I'll get a chance to do that at work

By the way, I did another test with btrfs.
There the performance was twice as high as with ZFS
Unfortunately, this is still about 50% less than ext4

This is not meant to be ZFS bashing, I just wanted to get some information.
Of course, it would be nice if ZFS document it somewhere that the performance loss is so great.
 
No... something must be quite wrong here.
Worth doing a zpool iostat with -v or -x during those writes.
Checking over your zpool + zfs get all options on the FS too.

"ZFS needs to write much more metadata" - compared with something like ext4, I would find this to be subjectively unlikely.
Ergo, it is maybe possible that more metadata is written in total, but how that data is written out will be handled much better. Even this will be done using COW as well. Also with using a block freelist instead of a block allocator as your data usage increases, the efficency vs ext4 only increases.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!