How increase size of raid 10 in proxmox ZFS pool?

And that's it? No further configuration is required in Proxmox?
Thats basically it.

But...
1.) depending on the number of disks you add you maybe want to increase the volblocksize for better performance
2.) I personally would add the disks by "/dev/disk/by-id/[yourdisk]" instead of "/dev/sd[ef]" so it's easier to identify an disk with problems
3.) Make sure you really add a mirror like zpool add rpool mirror /dev/sde /dev/sdf
 
And in case it is your boot pool, you might want to make the new disks also bootable by adapting the instructions here:
https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration -> "Changing a failed bootable device"
Jup, but not really needed. By default when creating a raid10 via the webUI it will also only create the boot partitions on the first mirror but not on the second/third/... mirror. Once both disks of the first mirror fail at the same time you wouldn't be able to boot anymore as the other disks ren'T bootable...but on the other hand all data on that pool is then lost anyway... ;)

But would indeed be an important step when adding more disks to an existing bootable vdev (like when turning a single disk into a mirror, turning a 2 disk mirror into a 3 disk mirror or later when extening a raidz1/2/3 onc that feature will be added).
But won't hurt to add the bootloaders if you for some reason like to add them.
 
Last edited:
Once both disks of the first mirror fail at the same time you wouldn't be able to boot anymore as the other disks ren'T bootable...but on the other hand all data on that pool is then lost anyway... ;)

You are absolutely right, of course. I did not think far enough. :eek:
 
Thats basically it.

But...
1.) depending on the number of disks you add you maybe want to increase the volblocksize for better performance
[..]

Just to clarify with the following setup:
- 10 nvme disks (RAID 10, striped & mirrored)
- ashift 12 (4096K)

5 mirrors * 4k = 20k

is 16k or 20k recommended for volblocksize?
 
Just to clarify with the following setup:
- 10 nvme disks (RAID 10, striped & mirrored)
- ashift 12 (4096K)

5 mirrors * 4k = 20k

is 16k or 20k recommended for volblocksize?
AFAIK, the volblocksize does not matter with a stripped mirror setup as much as it matters with RAIDz*.
The default volblocksize in OpenZFS 2.2 is 16 KB, maybe you should just stick with that.
 
  • Like
Reactions: dash222
Test Results:

PVE 8.2 - Fresh install:
- 2x Intel Xeon Platinum 5th Gen, 60 Cores, 120 Threads, 300 MB L3 Cache
- 32 * 64 GiB ECC DDR5 5.800 MHz
- 12x KIOXA CM7-V 3,2 TiB

KIOXA NVME:
- 128KiB Read: 14.000 MB/s
- 128KiB Write: 6.750 MB/s
- 4KiB Rand Read: 2.700.000 IOPS
- 4KiB Rand Write: 600.000 IOPS

PVE ZFS Configuration:
- 10 Disks RAIID 10 (ashift=12)
- 2 Disks Spare
- Recordsize=128K (I guess only important for datasets, not ZVOLS ?)

Theoretical should be the following possible:
Read gain: 10x
Write gain: 5x

Read: 140 GiB/s
Write: 33,75 GiB/s

ZFS Settings for test env:
Code:
zfs set redundant_metadata=most rpool
zfs set xattr=sa rpool
zfs set atime=off rpool
zfs set compression=lz4 rpool
zfs setprimarycache = metadata ← only for testing to not benchmark the RAM :)

VM:
- Debian 12 clean install
- V-CPU: 12
- Memory: 12 GiB
- Cache: None
- Controller: VirtIO SCSI
- Async IO: io_uring

The following fio tests are executed inside the described VM:

ZVOL Block SizeSEQ READRAND READSEQ WRITERAND WRITE
16K14.6GiB/s140k13.4GiB/s59.5k
32K21.6GiB/s144k24.0GiB/s42.0k
64k41.0GiB/s146k27.7GiB/s42.4k
128k59.0GiB/s132k30.6GiB/s40.6k
256k69.3GiB/s130k29.2GiB/s41.7k
512k58.1GiB/s124k27.9GiB/s40.9k
1M60.4GiB/s69.9k26.3GiB/s38.5k


Seq Read:
Code:
fio --ioengine=libaio --direct=1 --name=test --filename=seq_read.fio \
--bs=1M --iodepth=16 --size=1G --rw=read --numjobs=8 --refill_buffers \
--time_based --runtime=30

Seq Write:
Code:
fio --ioengine=libaio --direct=1 --name=test --filename=seq_write.fio \
--bs=1M --iodepth=16 --size=1G --rw=write --numjobs=8 --refill_buffers \
--time_based --runtime=30

Rand Read:
Code:
fio --ioengine=libaio --direct=1 --name=test --filename=rand_read.fio \
--bs=4K --iodepth=16 --size=1G --rw=randread --numjobs=8 --group_reporting \
--refill_buffers --time_based --runtime=30

Rand Write:
Code:
fio --ioengine=libaio --direct=1 --name=test --filename=rand_write.fio \
--bs=4K --iodepth=16 --size=1G --rw=randwrite --numjobs=8  --group_reporting \
--refill_buffers --time_based --runtime=30

Based on that results it seems the best performance could be reached with a block size of 128K - 256k which is far away from the default of 8k (or ZFS default oft 16k).

Do you really recommend setting the block size to such high value of 128k / 256k?

IOPS seems to be capped at ~140k - benchmarking a single disk (without ZFS) results into ~2.000.000 IOPS.
Are there other ZFS options which have to be tuned?

On this hypervisor will be around 300 Linux Debian VMs with Webservers and databases. Debian / EXTFS4 using 4K block sizes, InnoDB uses 16k block sizes, so is there anything about read / write amplification that is against the 256k Zvol Sizes?

I really would appreciate for any thoughts, advices on this setup :)

Cheers
 
Last edited:
  • Like
Reactions: DerDanilo
Based on that results it seems the best performance could be reached with a block size of 128K - 256k which is far away from the default of 8k (or ZFS default oft 16k).
Yes, for a synthetic benchmark, this is normal and you can see that 4k random read write cycles are much better with smaller volblocksizes and would still be better if you would benchmark with lower volblocksizes down to 4k.


Do you really recommend setting the block size to such high value of 128k / 256k?
That depends heavily on the used data inside of the VM. You also need to tune the filesystem in the guest to have aligned and proper blocksizes in order to not have a huge read and/or write amplification. If you would e.g. storage a lot of large files, a bigger volblocksize is better, yet if you store a lot of small files < and << your volblocksize, you will have tremendous write amplification cycles. If they're from a database, you will have a lot of sync writes of huge blocks and could potentially have a lot of read amplification if the blocks are not currently in memory.

There is not one number that fits all. It's as simple as that. You may need to tune each volume seperately. OpenZFS - as previously stated - set the default to 16K after years of study and as a tradeoff for the best overall performance with VMs they observed. This number is probably also only valid for ashift=12 (or higher) and on stripped mirrors. You're lucky that you're not also have to optimize for RAIDz*.

If you want to do further tests:
  • Compare your fio results with tests on only one or two threads. It's uncommon to have 8 threads hammering concurrently. Most of the time you have fewer IO-heavy workloads, at least in my experience, YMMV.
  • Install the OS (preferably automatically for reproducibility) on different zvolsizes and see the real world performance
  • Install one OS and dd it to different VMs with varying volblocksizes and do backups, potentially also restore, yet you need to change the storage volblocksize for each test to get the correct volblocksize in the destroyed and recreated volumes
  • For databases, use a database benchmark or known workload with data on different volblocksizes
 
Hey LnxBil, thank you so much for your great answer :)
I will do some more benchmarks as suggested and will share the results
 
Code:
zpool add rpool mirror /dev/sd[ef]
Just be aware that this will NOT rebalance existing data in your pool. that means that further writes may have inconsistent and often "poor" performance due to the lack of vdev avaliability for full stroke writes. in an ideal world, you would export all data out to a temporary space, and then zfs receive it back.
 
in an ideal world, you would export all data out to a temporary space, and then zfs receive it back.
Just send/receive the data from and to the same pool, delete the original data and rename the dataset to the previous name. If you have a fine granular setup regarding datasets, it's no problem.
 
ust send/receive the data from and to the same pool, delete the original data and rename the dataset to the previous name. If you have a fine granular setup regarding datasets, it's no problem.
that assumes that the pool is less than half full; chances are thats not the case for a request to increase the pool size. Moreover, the free space resulting would probably not be distributed across all vdevs anyway, since the pool copy has to reside on the free space- which would be mostly on the added vdev.
 
that assumes that the pool is less than half full; chances are thats not the case for a request to increase the pool size. Moreover, the free space resulting would probably not be distributed across all vdevs anyway, since the pool copy has to reside on the free space- which would be mostly on the added vdev.
Yes, it depends on a few assumptions, yet you can always run multiple rebalancing loops. The upside is that you can do that almost online or just with a very small downtime window for each dataset and you don't need double the space overall to copy everything off. If you have the space and a downtime window ... do a send/receive to another pool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!