Slow disk writes with ZFS RAID 10

Vedeyn

Member
Dec 30, 2018
5
0
6
32
Good Afternoon,

I'm a bit new to ZFS so please bear with me. I recently configured two identical servers (Ryzen 7 3700x, 96GB RAM, 500GB NVMe OS drive, 4 8TB HDDs, 2.5Gb NIC) with Proxmox. I set up the 4 8TB HDDs on each in a RAID-Z1 pool through the Proxmox GUI. I wasnt particularly happy with the performance in some cases so I decided to destroy the zpool on one node and set it up in RAID-10 using the steps provided here: https://pve.proxmox.com/wiki/ZFS_on_Linux

The problem is that Im actually seeing some odd behavior when running rsyncs between nodes. When doing an rsync (~25GB movie) from the RAID-Z1 to the RAID-10 array, speeds start off around 260MB/s but pretty quickly drop down to about 95MB/s and stay there. Syncs from the RAID-10 to the RAID-Z1 however start and pretty much stay at about 160MB/s the entire time. I get the same results in and outside of a VM. Id expect to see the exact opposite performance numbers.

Im at a bit of a loss as to why this would happen so any help/suggestions would be greatly appreciated.


Thank you for your time.
 
When doing an rsync (~25GB movie) from the RAID-Z1 to the RAID-10 array, speeds start off around 260MB/s but pretty quickly drop down to about 95MB/s and stay there.
This sounds like some cache is involved. Once it is full, the speed drops to the actual speed that can be achieved. Might be some CPU or IO bound thing with rsync.

Syncs from the RAID-10 to the RAID-Z1 however start and pretty much stay at about 160MB/s the entire time.
A raid10 like pool and a raidz1 pool have quite different performance properties depending on the kind of IO that is happening. What you are doing with that is basically a bandwidth test.

For a bandwidth test, a raidz pool will be better as when writing as the writes can be spread out over all disks minus parity data. On a raid10 like pool with 4 disks (2 mirrors) the writes will have the bandwidth of about 2 disks while reads will get the performance of all disks.

Therefore I assume that in the raidz -> mirror pool you have the bottleneck on the write side in the mirror pool and in the mirror -> raidz way you benefit that the data can be read from all 4 disks and the raidz1 pool is better at writing a single data stream.

Be aware though, that the other metric, how many operations can be performed per second (IOPS) is much in favor for mirrored pools. If you store VMs on that pool, IOPS is usually the more important metric.

Our documentation has a chapter about this: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_raid_considerations
 
  • Like
Reactions: Dunuin
This sounds like some cache is involved. Once it is full, the speed drops to the actual speed that can be achieved. Might be some CPU or IO bound thing with rsync.


A raid10 like pool and a raidz1 pool have quite different performance properties depending on the kind of IO that is happening. What you are doing with that is basically a bandwidth test.

For a bandwidth test, a raidz pool will be better as when writing as the writes can be spread out over all disks minus parity data. On a raid10 like pool with 4 disks (2 mirrors) the writes will have the bandwidth of about 2 disks while reads will get the performance of all disks.

Therefore I assume that in the raidz -> mirror pool you have the bottleneck on the write side in the mirror pool and in the mirror -> raidz way you benefit that the data can be read from all 4 disks and the raidz1 pool is better at writing a single data stream.

Be aware though, that the other metric, how many operations can be performed per second (IOPS) is much in favor for mirrored pools. If you store VMs on that pool, IOPS is usually the more important metric.

Our documentation has a chapter about this: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_raid_considerations

Thanks for the info. I am still a bit new to ZFS so I was expecting things to work more like traditional RAID but it sounds like that is not the case.

Do you have any recommendations as to what "Block Size" (Datacenter -> Storage -> Edit) should be used as a decent middle ground between performance and storage efficiency? I've seen a lot of different answers to that question. The default value causes quite a lot of extra space to be consumed on my ZFS pools.
 
You can use this for a rough estimation. But keep in mind that that table isn't taking compression and so on into account.
Example: You got a Pool with four 8TB disks in raidz1 that use a sector size of 4K (ashift=12). With 3/6/9/12 sectors you would only loose the optimum of 25% capacity. You use a sector size of 4K so a volblocksize of 12K/24K/36K/48K would be the theoretical optimum. And in general you want to smallest possible volblocksize so you don't get that much overhead. So 12K would be a good value to start with. With the default of 8K (2 sectors) you will loose 50% of the capacity. In both cases your pool will show that 24 of 32TB are usable but because with 8K everything you write will need 50% more space so 16TB of data will need 24TB space on the storage.

If you use a stripped mirror (raid10) with 4 disks 8K should be the best value (in case of 4K sectors).
 
If you use a stripped mirror (raid10) with 4 disks 8K should be the best value (in case of 4K sectors).
So even now that PVE defaults to 16k for the Zvols, a similar use case scenario like that (4 drives (512e) / ashift=12) would result as optimum value that of 8k right?
 
So even now that PVE defaults to 16k for the Zvols, a similar use case scenario like that (4 drives (512e) / ashift=12) would result as optimum value that of 8k right?
I don't think so because mostly all zfs pools get compression on and depending on the data written to disk it's ever individually compressed before written and don't to forget the additionally generated checksums ... so in reality I strongly assume all "ideally optimal disk pro vdev suggestions" settings are totally nonsense as it would rarely fit.
 
I don't think so because mostly all zfs pools get compression on and depending on the data written to disk it's ever individually compressed before written and don't to forget the additionally generated checksums ... so in reality I strongly assume all "ideally optimal disk pro vdev suggestions" settings are totally nonsense as it would rarely fit.
So your thumb of rule is what? All defaults?
Check here as well my post https://forum.proxmox.com/threads/blocksize-recordsize-thin-provision-options.155553/#post-710296 (specific the part with my examples)
If you have any insights of my assumptions on the examples feel free to join in and speak your mind.
Finally maybe they are <<totally nonsense>> as you say but yet, you have to put a value in there.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!