ZPool Reconfiguration

praeluceo

New Member
Feb 20, 2024
1
0
1
So currently, I have a ZFS RaidZ10 array, that is just ridiculous in complexity for managing:
1708466978959.png

Unfortunately, I nearly had 2 simultaneous disk failures in the same RAIDZ1 sub-dev, which would have resulted in the loss of the entire 55TB array. So I backed up my files to a different system, and am going to rebuild the array. One thing is that it had pretty poor performance:

1708467114827.png

So as I've studied, I've found that I should have used 6 drives independently, I thought using sets of 5 disks per vdev was sufficient (also apparently 5 disks is inefficient in a raidz1 configuration?), but I guess the whole pool takes the performance hit when you raid across it, even if it's a striped raid0.

I'm debating about breaking up the pool into either: 2 raidz2 pools with 8 disks each, or 3 raidz1 pools with 6 disks each. I'm also going to take additional steps to optimize the layout of the array (originally I made no real optimizations), since these are mostly jpg/raw photos, videos, and backup files: ashift=12, recordsize=1M, atime=off, and xattr=sa.

Does anyone have any recommendations on how best to organize the pool? They're 5TB spinning rust drives, and I have a separate pair of SSDs for ZIL and L2ARC. I don't plan on retaining the hotspare in this setup, since the loss of one array won't result in the loss of all my data, and the scare made me get a USB backup drive to ensure my critical data doesn't die. I'm also going to set up smartd email notifications so I am notified of failures sooner, and rebuilding a 6-8 disk array will be substantially faster than rebuilding a 15 disk array.

So yes, recommendations, suggestions, and thoughts would be appreciated!
 
IOPS performance is based on the amount of vdevs, not the amount of disks. So in your case you should get 3x performance of a single disk. If you want IOPS performance use less disks per raidz1 and more raidz1. Or even better use a striped mirror which offers the most vdevs and skips all the parity overhead.

In theory, uneven disk numbers should perform best when using raidz1 and also would allow for lower blocksizes without wasting additional space.

Does anyone have any recommendations on how best to organize the pool? They're 5TB spinning rust drives, and I have a separate pair of SSDs for ZIL and L2ARC.
I would skip the SLOG and L2ARC and use special devices instead. SLOG is totally useless unless you run something like DBs that cause a lot of sync writes. L2ARC is rarely useful and you will sacrifice some way faster ARC for some more way slower L2ARC.

I don't plan on retaining the hotspare in this setup, since the loss of one array won't result in the loss of all my data, and the scare made me get a USB backup drive to ensure my critical data doesn't die.
Better realizing that "Raid is not a backup" before the disaster happens than after it.
 
Last edited:
> L2ARC is rarely useful and you will sacrifice some way faster ARC for some more way slower L2ARC.

Respectfully disagree. I've started using inexpensive PNY USB3 thumbdrives for L2ARC on limited-RAM systems (16GB or less) and they can make a difference with spinning-drive pools. Test with ' time find /zpoolname >/dev/null ' and run it twice with an L2ARC dev, + ZFS ARC limited to 1GB-1.5GB. The 2nd test should run faster.

Same with ' time ls -lahR /zpoolname >/dev/null ' twice (or redirect to a file if you want a grep-able directory tree.)

You can see how much L2ARC is being used with ' zpool iostat -v '

Also - L2ARC survives a reboot these days -- ARC notsomuch ;-) and a dead L2 device won't kill your pool (I found out they can't be mirrored, but you can add more L2 devs)
 
Respectfully disagree.
I'm with @Dunuin on this and we're all probably biased to the systems we use. Tried L2ARC on different big pools and the biggest speed impact for spinning rust pools is special devices with SSD. Even for you constructed example, the special devices will hold the metadata you're trying to benchmark and therefore be blazing fast. This is for ANY metadata, not just the ones that have already been read and are therefore maybe in the L2ARC.
 
  • Like
Reactions: Kingneutron
And buying more RAM to be able to use a bigger ARC will always be faster than buying more SSDs for more L2ARC. Only exception would be a case where you for example got a 200GB DB that is read all the time but your hardware platform only supports to use a max of 128GB RAM. In this case it would be better to have a L2ARC as the whole DB simply can't fit completely in the ARC. But in such a case it would still be better to buy the hardware with the intended workload in mind and buy a server that could handle 256+GB RAM in the first place.
 
I'm with @Dunuin on this and we're all probably biased to the systems we use. Tried L2ARC on different big pools and the biggest speed impact for spinning rust pools is special devices with SSD. Even for you constructed example, the special devices will hold the metadata you're trying to benchmark and therefore be blazing fast. This is for ANY metadata, not just the ones that have already been read and are therefore maybe in the L2ARC.
Fair point. But Special devices need to be mirrored, and if both die at the same time then your pool is dead. Most people would recommend Enterprise SSDs for that as well. L2ARC devices are practically disposable. Also keep in mind that Special allocation is per-dataset and has to be set with an additional command, whereas L2 just starts caching the whole pool.

If you have a low-RAM system that can't be upgraded - like a laptop, or in my case, an old 6-core Phenom II HP Pavilion and a 2011 iMac (with Firewire800 and ~70MB disk speed limits + USB2), you may see some cheap benefits just by adding an L2ARC device and doing some experiments. Very inexpensive investment for a homelab or portable ZFS.
 
Fair point. But Special devices need to be mirrored, and if both die at the same time then your pool is dead.
Then you mirror 3 of them so any 2 could fail without affecting the pool.

Also keep in mind that Special allocation is per-dataset and has to be set with an additional command, whereas L2 just starts caching the whole pool.
A special device will store ALL NEW METADATA for the whole pool until the special devices are full and spill over to the normal data vdevs. No command or additional config is required. I guess what you mean is the "special_small_blocks" option you could set for any dataset or zvol. But that is optional and will define if at all or what DATA should be stored on the special devices.
 
Last edited:
And buying more RAM to be able to use a bigger ARC will always be faster than buying more SSDs for more L2ARC. Only exception would be a case where you for example got a 200GB DB that is read all the time but your hardware platform only supports to use a max of 128GB RAM. In this case it would be better to have a L2ARC as the whole DB simply can't fit completely in the ARC. But in such a case it would still be better to buy the hardware with the intended workload in mind and buy a server that could handle 256+GB RAM in the first place.
For database workloads, you normally don't cache data, just metadata (primarycache=metadata), because the database will cache what it'll want and having it in ARC will be counter productive (double caching).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!