Expand PBS storage

NdK73 · Jan 9, 2023

Hello all.

I'm going to configure a new PBS server.
It currrently hosts 8 x 16TB disks, but "soon" (during the year, not next week) we'll need to add another 16 x 16TB (for a total of 24 x 16TB).
I thought to use RAIDZ3, starting with the current 8 disks and expanding later with the new ones, but it seems "zpool attach" is not supported yet on RAIDZ3

Any hints? The system is currently blank, but I'd need to start using it ASAP...

Tks,
Diego

Richard · Jan 13, 2023

NdK73 said:
I'm going to configure a new PBS server.
It currrently hosts 8 x 16TB disks, but "soon" (during the year, not next week) we'll need to add another 16 x 16TB (for a total of 24 x 16TB).
I thought to use RAIDZ3, starting with the current 8 disks and expanding later with the new ones, but it seems "zpool attach" is not supported yet on RAIDZ3

You cannot add a disk to an existing raidz3 vdev but you can add another raidz3 vdev to an existing pool ("tank" in the following example), e.g.:

Code:

zpool add tank raidz3 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr

Dunuin · Jan 13, 2023

And just an reminder:
Its not recommended to use HDDs because PBS needs IOPS performance. And using raidz makes it even worse, as IOPS performance will only scale with the number of vdevs, not with the number of disks.
A striped mirror might be a better choice, so you get as much vdevs as possible.

NdK73 · Jan 31, 2023

@Dunuin Tks, but SSDs are way more expensive and offer at most 1/4 of the space (4TB vs 16TB). A striped mirror of HDDs gives half the space.
@Richard Nearly the same issue as above: I need space (just one of the backups is 32TB! and requires more than 2 days to be verified). I can lose 4 disks out of 24 to have some redundancy, but 9 is "a bit" too many!
Does the ZFS version included in PBS support tiering? If so, I could add a couple m.2 cards to use as 'cache'... but how to configure it?
Tks!

Dunuin · Jan 31, 2023

NdK73 said:
Does the ZFS version included in PBS support tiering? If so, I could add a couple m.2 cards to use as 'cache'... but how to configure it?
Tks!

See the manual: https://pbs.proxmox.com/docs/sysadmin.html#zfs-administration
Especially:

If you use a dedicated cache and/or log disk, you should use an enterprise class SSD (for example, Intel SSD DC S3700 Series). This can increase the overall performance significantly.

Create a new pool with cache (L2ARC)
It is possible to use a dedicated cache drive partition to increase the performance (use SSD).
For <device>, you can use multiple devices, as is shown in "Create a new pool with RAID*".
# zpool create -f -o ashift=12 <pool> <device> cache <cache_device>

Create a new pool with log (ZIL)
It is possible to use a dedicated cache drive partition to increase the performance (SSD).
For <device>, you can use multiple devices, as is shown in "Create a new pool with RAID*".
# zpool create -f -o ashift=12 <pool> <device> log <log_device>

Add cache and log to an existing pool
You can add cache and log devices to a pool after its creation. In this example, we will use a single drive for both cache and log. First, you need to create 2 partitions on the SSD with parted or gdisk
Important
Always use GPT partition tables.
The maximum size of a log device should be about half the size of physical memory, so this is usually quite small. The rest of the SSD can be used as cache.
# zpool add -f <pool> log <device-part1> cache <device-part2>

ZFS special device
Since version 0.8.0, ZFS supports special devices. A special device in a pool is used to store metadata, deduplication tables, and optionally small file blocks.

A special device can improve the speed of a pool consisting of slow spinning hard disks with a lot of metadata changes. For example, workloads that involve creating, updating or deleting a large number of files will benefit from the presence of a special device. ZFS datasets can also be configured to store small files on the special device, which can further improve the performance. Use fast SSDs for the special device.

Important
The redundancy of the special device should match the one of the pool, since the special device is a point of failure for the entire pool.
Warning
Adding a special device to a pool cannot be undone!
To create a pool with special device and RAID-1:

# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>

Adding a special device to an existing pool with RAID-1:

# zpool add <pool> special mirror <device1> <device2>

ZFS datasets expose the special_small_blocks=<size> property. size can be 0 to disable storing small file blocks on the special device, or a power of two in the range between 512B to 128K. After setting this property, new file blocks smaller than size will be allocated on the special device.

Important
If the value for special_small_blocks is greater than or equal to the recordsize (default 128K) of the dataset, all data will be written to the special device, so be careful!
Setting the special_small_blocks property on a pool will change the default value of that property for all child ZFS datasets (for example, all containers in the pool will opt in for small file blocks).

Opt in for all files smaller than 4K-blocks pool-wide:

# zfs set special_small_blocks=4K <pool>

Opt in for small file blocks for a single dataset:

# zfs set special_small_blocks=4K <pool>/<filesystem>

Opt out from small file blocks for a single dataset:

# zfs set special_small_blocks=0 <pool>/<filesystem>

But caching won't help you much with verify tasks. If your single backup is 32TB, it won't be faster to verify unless you got a L2ARC that can fit 32TB. It primarily helps with metadata, so the GC task will be way faster. You won't get any close to overall SSD performance with L2ARC, SLOG or special metadata devices. If you want SSD performance, build a SSD-only pool.

NdK73 · Jan 31, 2023

Tks. The all-SSD pool is "impossible" (not enough space: server only hosts 24 disks).

What about ditching completely ZFS and using MDRAID (the CPU does have 64 cores, so extra load should not be an issue) or HW RAID (changing the controller)? Is verify a sequential read?

The big backup is 32TB, but it's not the only one (some VMs have up to 4TB disks). I'd "just" need to be able to verify it in 12h, that means reading at about 3TB/h over 20 disks, about 150MB/disk/h... even an old IDE should do!
But currently it takes more than 2 days to verify 32TB over 5+3 disks, that means just 32*1024*1024/(5*48*3600) ~= 39MB/s/disk. What does slow it down so much?

fabian · Jan 31, 2023

verification is not sequential, but basically random access (each chunk is < 4MB, usually in the 1-2MB range).

Dunuin · Jan 31, 2023

Jup, so think of a verify of a 32TB virtual disk like random reading 16,000,000x 2MB files. HDD are just technically bad at doing small random IO. MDRAID or HW raid won't help you other with random reading millions over millions of small chunk files.

Search

Search

Expand PBS storage

NdK73

Renowned Member

Richard

Renowned Member

Dunuin

Distinguished Member

NdK73

Renowned Member

Dunuin

Distinguished Member

Create a new pool with cache (L2ARC)

Create a new pool with log (ZIL)

Add cache and log to an existing pool

ZFS special device

NdK73

Renowned Member

fabian

Proxmox Staff Member

Dunuin

Distinguished Member

Expand PBS storage

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Create a new pool with cache (L2ARC)​

Create a new pool with log (ZIL)​

Add cache and log to an existing pool​

ZFS special device​

Renowned Member

Proxmox Staff Member

Distinguished Member

Create a new pool with cache (L2ARC)

Create a new pool with log (ZIL)

Add cache and log to an existing pool

ZFS special device