How increase size of raid 10 in proxmox ZFS pool?

nanaksingh · Jun 20, 2023

Is there a way to increase size of zfs pool by adding more disks.

LnxBil · Jun 20, 2023

Of course there is. Just add another vdev via

Code:

zpool add rpool mirror /dev/sd[ef]

For more information, please refer to the manpage.

Xeata_James · Jun 27, 2023

LnxBil said:
Of course there is. Just add another vdev via

Code:

zpool add rpool mirror /dev/sd[ef]

For more information, please refer to the manpage.

And that's it? No further configuration is required in Proxmox?

Dunuin · Jun 28, 2023

Xeata_James said:
And that's it? No further configuration is required in Proxmox?

Thats basically it.

But...
1.) depending on the number of disks you add you maybe want to increase the volblocksize for better performance
2.) I personally would add the disks by "/dev/disk/by-id/[yourdisk]" instead of "/dev/sd[ef]" so it's easier to identify an disk with problems
3.) Make sure you really add a mirror like zpool add rpool mirror /dev/sde /dev/sdf

Neobin · Jun 28, 2023

And in case it is your boot pool, you might want to make the new disks also bootable by adapting the instructions here:
https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration -> "Changing a failed bootable device"

Dunuin · Jun 28, 2023

Neobin said:
And in case it is your boot pool, you might want to make the new disks also bootable by adapting the instructions here:
https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration -> "Changing a failed bootable device"

Jup, but not really needed. By default when creating a raid10 via the webUI it will also only create the boot partitions on the first mirror but not on the second/third/... mirror. Once both disks of the first mirror fail at the same time you wouldn't be able to boot anymore as the other disks ren'T bootable...but on the other hand all data on that pool is then lost anyway...

But would indeed be an important step when adding more disks to an existing bootable vdev (like when turning a single disk into a mirror, turning a 2 disk mirror into a 3 disk mirror or later when extening a raidz1/2/3 onc that feature will be added).
But won't hurt to add the bootloaders if you for some reason like to add them.

Neobin · Jun 28, 2023

Dunuin said:
Once both disks of the first mirror fail at the same time you wouldn't be able to boot anymore as the other disks ren'T bootable...but on the other hand all data on that pool is then lost anyway...

You are absolutely right, of course. I did not think far enough.

dash222 · Apr 30, 2024

Dunuin said:
Thats basically it.

But...
1.) depending on the number of disks you add you maybe want to increase the volblocksize for better performance
[..]

Just to clarify with the following setup:
- 10 nvme disks (RAID 10, striped & mirrored)
- ashift 12 (4096K)

5 mirrors * 4k = 20k

is 16k or 20k recommended for volblocksize?

LnxBil · Apr 30, 2024

dash222 said:
Just to clarify with the following setup:
- 10 nvme disks (RAID 10, striped & mirrored)
- ashift 12 (4096K)

5 mirrors * 4k = 20k

is 16k or 20k recommended for volblocksize?

AFAIK, the volblocksize does not matter with a stripped mirror setup as much as it matters with RAIDz*.
The default volblocksize in OpenZFS 2.2 is 16 KB, maybe you should just stick with that.

dash222 · Apr 30, 2024

Test Results:

PVE 8.2 - Fresh install:
- 2x Intel Xeon Platinum 5th Gen, 60 Cores, 120 Threads, 300 MB L3 Cache
- 32 * 64 GiB ECC DDR5 5.800 MHz
- 12x KIOXA CM7-V 3,2 TiB

KIOXA NVME:
- 128KiB Read: 14.000 MB/s
- 128KiB Write: 6.750 MB/s
- 4KiB Rand Read: 2.700.000 IOPS
- 4KiB Rand Write: 600.000 IOPS

PVE ZFS Configuration:
- 10 Disks RAIID 10 (ashift=12)
- 2 Disks Spare
- Recordsize=128K (I guess only important for datasets, not ZVOLS ?)

Theoretical should be the following possible:
Read gain: 10x
Write gain: 5x

Read: 140 GiB/s
Write: 33,75 GiB/s

ZFS Settings for test env:

Code:

zfs set redundant_metadata=most rpool
zfs set xattr=sa rpool
zfs set atime=off rpool
zfs set compression=lz4 rpool
zfs setprimarycache = metadata ← only for testing to not benchmark the RAM :)

VM:
- Debian 12 clean install
- V-CPU: 12
- Memory: 12 GiB
- Cache: None
- Controller: VirtIO SCSI
- Async IO: io_uring

The following fio tests are executed inside the described VM:

ZVOL Block Size	SEQ READ	RAND READ	SEQ WRITE	RAND WRITE
16K	14.6GiB/s	140k	13.4GiB/s	59.5k
32K	21.6GiB/s	144k	24.0GiB/s	42.0k
64k	41.0GiB/s	146k	27.7GiB/s	42.4k
128k	59.0GiB/s	132k	30.6GiB/s	40.6k
256k	69.3GiB/s	130k	29.2GiB/s	41.7k
512k	58.1GiB/s	124k	27.9GiB/s	40.9k
1M	60.4GiB/s	69.9k	26.3GiB/s	38.5k

Seq Read:

Code:

fio --ioengine=libaio --direct=1 --name=test --filename=seq_read.fio \
--bs=1M --iodepth=16 --size=1G --rw=read --numjobs=8 --refill_buffers \
--time_based --runtime=30

Seq Write:

Code:

fio --ioengine=libaio --direct=1 --name=test --filename=seq_write.fio \
--bs=1M --iodepth=16 --size=1G --rw=write --numjobs=8 --refill_buffers \
--time_based --runtime=30

Rand Read:

Code:

fio --ioengine=libaio --direct=1 --name=test --filename=rand_read.fio \
--bs=4K --iodepth=16 --size=1G --rw=randread --numjobs=8 --group_reporting \
--refill_buffers --time_based --runtime=30

Rand Write:

Code:

fio --ioengine=libaio --direct=1 --name=test --filename=rand_write.fio \
--bs=4K --iodepth=16 --size=1G --rw=randwrite --numjobs=8  --group_reporting \
--refill_buffers --time_based --runtime=30

Based on that results it seems the best performance could be reached with a block size of 128K - 256k which is far away from the default of 8k (or ZFS default oft 16k).

Do you really recommend setting the block size to such high value of 128k / 256k?

IOPS seems to be capped at ~140k - benchmarking a single disk (without ZFS) results into ~2.000.000 IOPS.
Are there other ZFS options which have to be tuned?

On this hypervisor will be around 300 Linux Debian VMs with Webservers and databases. Debian / EXTFS4 using 4K block sizes, InnoDB uses 16k block sizes, so is there anything about read / write amplification that is against the 256k Zvol Sizes?

I really would appreciate for any thoughts, advices on this setup

Cheers

LnxBil · May 1, 2024

dash222 said:
Based on that results it seems the best performance could be reached with a block size of 128K - 256k which is far away from the default of 8k (or ZFS default oft 16k).

Yes, for a synthetic benchmark, this is normal and you can see that 4k random read write cycles are much better with smaller volblocksizes and would still be better if you would benchmark with lower volblocksizes down to 4k.

dash222 said:
Do you really recommend setting the block size to such high value of 128k / 256k?

That depends heavily on the used data inside of the VM. You also need to tune the filesystem in the guest to have aligned and proper blocksizes in order to not have a huge read and/or write amplification. If you would e.g. storage a lot of large files, a bigger volblocksize is better, yet if you store a lot of small files < and << your volblocksize, you will have tremendous write amplification cycles. If they're from a database, you will have a lot of sync writes of huge blocks and could potentially have a lot of read amplification if the blocks are not currently in memory.

There is not one number that fits all. It's as simple as that. You may need to tune each volume seperately. OpenZFS - as previously stated - set the default to 16K after years of study and as a tradeoff for the best overall performance with VMs they observed. This number is probably also only valid for ashift=12 (or higher) and on stripped mirrors. You're lucky that you're not also have to optimize for RAIDz*.

If you want to do further tests:

Compare your fio results with tests on only one or two threads. It's uncommon to have 8 threads hammering concurrently. Most of the time you have fewer IO-heavy workloads, at least in my experience, YMMV.
Install the OS (preferably automatically for reproducibility) on different zvolsizes and see the real world performance
Install one OS and dd it to different VMs with varying volblocksizes and do backups, potentially also restore, yet you need to change the storage volblocksize for each test to get the correct volblocksize in the destroyed and recreated volumes
For databases, use a database benchmark or known workload with data on different volblocksizes

dash222 · May 2, 2024

Hey LnxBil, thank you so much for your great answer

I will do some more benchmarks as suggested and will share the results

alexskysilk · May 2, 2024

LnxBil said:
Code:

zpool add rpool mirror /dev/sd[ef]

Just be aware that this will NOT rebalance existing data in your pool. that means that further writes may have inconsistent and often "poor" performance due to the lack of vdev avaliability for full stroke writes. in an ideal world, you would export all data out to a temporary space, and then zfs receive it back.

LnxBil · May 2, 2024

alexskysilk said:
in an ideal world, you would export all data out to a temporary space, and then zfs receive it back.

Just send/receive the data from and to the same pool, delete the original data and rename the dataset to the previous name. If you have a fine granular setup regarding datasets, it's no problem.

alexskysilk · May 2, 2024

LnxBil said:
ust send/receive the data from and to the same pool, delete the original data and rename the dataset to the previous name. If you have a fine granular setup regarding datasets, it's no problem.

that assumes that the pool is less than half full; chances are thats not the case for a request to increase the pool size. Moreover, the free space resulting would probably not be distributed across all vdevs anyway, since the pool copy has to reside on the free space- which would be mostly on the added vdev.

LnxBil · May 3, 2024

alexskysilk said:
that assumes that the pool is less than half full; chances are thats not the case for a request to increase the pool size. Moreover, the free space resulting would probably not be distributed across all vdevs anyway, since the pool copy has to reside on the free space- which would be mostly on the added vdev.

Yes, it depends on a few assumptions, yet you can always run multiple rebalancing loops. The upside is that you can do that almost online or just with a very small downtime window for each dataset and you don't need double the space overall to copy everything off. If you have the space and a downtime window ... do a send/receive to another pool.

Search

Search

How increase size of raid 10 in proxmox ZFS pool?

nanaksingh

New Member

LnxBil

Distinguished Member

Xeata_James

Well-Known Member

Dunuin

Distinguished Member

Neobin

Distinguished Member

Dunuin

Distinguished Member

Neobin

Distinguished Member

dash222

New Member

LnxBil

Distinguished Member

dash222

New Member

LnxBil

Distinguished Member

dash222

New Member

alexskysilk

Distinguished Member

LnxBil

Distinguished Member

alexskysilk

Distinguished Member

LnxBil

Distinguished Member