Opinions | ZFS On Proxmox

velocity08

Well-Known Member
May 25, 2019
247
16
58
48
Hi Team

Have been reading over the docs as my 1st port of call but haven't been able to find a definitive answer so through i would crowd source.

Mixed opinions on here RE ZFS setup so im looking for a little more clarity.

Best use of ZFS with you have 8 drive bays
2 drives can be allocated to Mirror Zil & Arc
If we are looking to use 1.2 TB 7.2k drives would the best Read/ Write config be 3 x Mirrors total 6 drives.

OS/ Root and Data to be spread across all 6 drives.

Have toyed with the possibility of RaidZ or Z2 but concerned with the write IO even with a Zil it could still be an issue.

thoughts and opinions welcome.

""Cheers
G
 
Hi,

3 striped mirror are not so good at all. Think at block size that will be split to 3. Best is to use power of 2. So 2 stripped raidz1 will be better.

Good luck / Bafta.
 
Regarding ZIL/ARC:

The ZIL usually can be on a single very fast drive (think fast SSDs like Intel Optane). It really does not need to be large, usually a few GB is plenty.

The L2ARC is the second level cache which resides on a disk which should be faster than the spinning rust -> SSD. In most scenarios, it is better though to increase the RAM so the first level ARC has more space. The L2ARC does need some RAM itself to hold the index.

Another option to speed up the operation of the Cluster is to use the new vdev class of "special device", See 3.8.10: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs
 
  • Like
Reactions: kwinz
Hi,

3 striped mirror are not so good at all. Think at block size that will be split to 3. Best is to use power of 2. So 2 stripped raidz1 will be better.

Good luck / Bafta.
Sorry I may not have worded my info properly.

3 x mirrors (2 +2 + 2)
that way we have the write ingest of 3 drives.
Hope that makes better sense.

ta
 
Regarding ZIL/ARC:

The ZIL usually can be on a single very fast drive (think fast SSDs like Intel Optane). It really does not need to be large, usually a few GB is plenty.

The L2ARC is the second level cache which resides on a disk which should be faster than the spinning rust -> SSD. In most scenarios, it is better though to increase the RAM so the first level ARC has more space. The L2ARC does need some RAM itself to hold the index.

Another option to speed up the operation of the Cluster is to use the new vdev class of "special device", See 3.8.10: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs
Good info.
Additional info:

256 GB Ram
2 x 480 GB enterprise SSD with capacitor for power outages.

wanted to run with mirrored ssd partitioned for zip and arc.

looking at restricting zfs ram to 64GB.
And 3 pairs of mirror drives for data including root/ data

hopefully the above makes more sense.

“”Cheers
G
 
looking at restricting zfs ram to 64GB.
Get some monitoring up and running that will also monitor the ARC size and hit rates. Once you run into problems of not enough free RAM, fast enough available, limit it.
Until then I would let it be. Ideally, during normal operation the ARC hit rate will be close to 100%. Meaning that almost all read operations can be satisfied from RAM.

In my personal productive cluster I have a hit rate of >99.5% most of the time.
 
  • Like
Reactions: velocity08
Get some monitoring up and running that will also monitor the ARC size and hit rates. Once you run into problems of not enough free RAM, fast enough available, limit it.
Until then I would let it be. Ideally, during normal operation the ARC hit rate will be close to 100%. Meaning that almost all read operations can be satisfied from RAM.

In my personal productive cluster I have a hit rate of >99.5% most of the time.
If arc is going to potentially hit what’s left for VM’s?

they will both be competing for the same memory space.

sorry what your saying is probably factually correct but it doesn’t make complete sense to me.

any clarity would be appreciated.

ta
 
ZFS will release RAM if requested. Sometimes that might take too long though for some operations that need it and that is when you probably should limit the RAM usage.
Until then I would suggest to just let it be and monitor it to get an idea of how the system behaves.

The ARC hit rate is a measure of how many read operations can be fulfilled from the RAM and thus don't need to be passed down to the slow disks.

It is also possible, depending on your use case, that the ARC size will never reach 64GB anyway. But if you have a monitoring system set up you will be able to see that and then make informed decisions :)
 
  • Like
Reactions: velocity08
ZFS will release RAM if requested. Sometimes that might take too long though for some operations that need it and that is when you probably should limit the RAM usage.
Until then I would suggest to just let it be and monitor it to get an idea of how the system behaves.

The ARC hit rate is a measure of how many read operations can be fulfilled from the RAM and thus don't need to be passed down to the slow disks.

It is also possible, depending on your use case, that the ARC size will never reach 64GB anyway. But if you have a monitoring system set up you will be able to see that and then make informed decisions :)

Ok cool that makes more sense.

I was thinking the ssd mirrors could be split 32 GB zil and the remaining 440 GB for arc cache.
for a production envoroment I wouldn’t want to have a single drive for cache & zil only, if the drive dies then we will end up with an offline host while we need to force reimport the zpool without the cache or get cold ssd drive to get it back on line.

wish to minimise potential downtime, with the mirror we can have a cold spare ready to replace while the host is online.

also looking for some feedback on RaidZ vs mirror from real world experiance.

thanks.
 
a failed ZIL as well as L2ARC will cost you performance but should not take the host offline
 
I was thinking the ssd mirrors could be split 32 GB zil and the remaining 440 GB for arc cache.

As @aaron pointed out repeatedly, you need a monitoring solution to monitor ARC and L2ARC. If you're lucky, your L2ARC is useful, it probably won't. You will only know after you're running with it at least until it is full.

3 x mirrors (2 +2 + 2)
that way we have the write ingest of 3 drives.
Hope that makes better sense.

@guletz is still right. 3 is not divisible by 2, so you will loose performance there.

Depending on your circumstances, you may have a faster setup with 4x mirrored 10k drives without ZIL and L2ARC. You will only know if you set it all up, run your usual stuff and find out for yourself. ZFS's performance is hugely depended on your workload.
 
  • Like
Reactions: velocity08
As @aaron pointed out repeatedly, you need a monitoring solution to monitor ARC and L2ARC. If you're lucky, your L2ARC is useful, it probably won't. You will only know after you're running with it at least until it is full.



@guletz is still right. 3 is not divisible by 2, so you will loose performance there.

Depending on your circumstances, you may have a faster setup with 4x mirrored 10k drives without ZIL and L2ARC. You will only know if you set it all up, run your usual stuff and find out for yourself. ZFS's performance is hugely depended on your workload.

@LnxBil have a few questions if i may.

1. on FreeNass we pulled a Zil drive intentionally for testing this action dropped the pool and we needed to re import, maybe the workflow we used may have been different (Power off > remove drive > power on > pool gone) we will try again with power on and running

Have you tested pulling a Zil or Larc while everything is running in ProxMox?

2. Can you please go into more detail why running 3 x 2 mirrors and striping across all 3 will deliver bad performance?

From my understanding striping across mirrors like a Raid 10 would deliver the best performance as you are writing across 3 vDevs delivering 3 x the write capacity of a single drive, when reading it's 6 x the read throughput.

With a Raidz or Z2 we are only going to see the write speed of a single drive and reads will be across all drives.

I've been reading this great article https://calomel.org/zfs_raid_speed_capacity.html

they have done a bit of testing with all available Raid groups.

Maybe i'm not communicating correctly or my understanding may be limited, would you mind going into more detail why 3 x mirrors striped will deliver poor performance?

thank you.
G
 
2. Can you please go into more detail why running 3 x 2 mirrors and striping across all 3 will deliver bad performance?

Let say your VM use the default 8k volbloksize (vsz).

3 stripped => 8k/3=2.66 so on each hdd from a mirror will be 2.66, but if you have ashift pool 12, then the block on disk will be 4k

It will be worst when you will need to modify a block => read modify data write (2 iops, insted of 1), aka RMW. And you will also waste your storage capacity. For each 8k vsz you will write 3x4k = 12k(around 33% wasted space)

Of you using 2 stripped, on hdd you wil have exact 4k, and no RMW, no wasted space storage.


With a Raidz or Z2 we are only going to see the write speed of a single drive

Only iops, not write speed.

Good luck/ Bafta!
 
Let say your VM use the default 8k volbloksize (vsz).

3 stripped => 8k/3=2.66 so on each hdd from a mirror will be 2.66, but if you have ashift pool 12, then the block on disk will be 4k

It will be worst when you will need to modify a block => read modify data write (2 iops, insted of 1), aka RMW. And you will also waste your storage capacity. For each 8k vsz you will write 3x4k = 12k(around 33% wasted space)

Of you using 2 stripped, on hdd you wil have exact 4k, and no RMW, no wasted space storage.




Only iops, not write speed.

Good luck/ Bafta!

Hi @guletz

thanks for the additional info.

From what ive been reading having compression turned on makes the above statement redundant as compression will allocate the block to what ever fits best reducing the used space and delivering less platter spins to get to the data.

From my research and the previous link posted mirrors are the fastest option for ZFS for writes and reads and best for virtual environments, yes you loose 1/2 the storage capacity but if IOP's and MB/s are being chased then striped mirrors are the way to go for spinning rust.

If your using SSD all round then the write of a RaidZ or RaidZ2 SSD still the same as a single drive for writes (Copy On Write) will still give plenty of performance for many use cases, for production shared hosting environments which have lots of random IO having more vDevs making up a pool (3 mirrored drives = 3 vDevs) will deliver the best performance SSD or Spinning disk for reads and writes + compression.

I've corss referenced this informaiton with FreeNas forums, ProxMox and other testing sites.

from my understanding striping across all disks spreads the data across all drives i.e. like Raid 10, happy to be corrected but it will be a hard sell.

""Cheers
G
 
Another thing to take into consideration when choosing between (stripped) mirrors or (stripped) raidz[x] is the overhead in calculating parity for raidz[x] and especially when doing resilvering of the pool. Calculating parity tend to require a CPU with higher clock speeds since parity calculation is CPU bound and mirror is more or less only IO bound so the performance of a mirror is a function of the IO capability only while performance of a raidz[x] is a function of IO capability and CPU clock speed. Resilvering is also an order of magnitude higher with mirrors compared to raidz[x]
 
  • Like
Reactions: velocity08
Ok, could you be kind and explain how it works in your opinion?

I think you are confusing stripes of mirrored vdevs with raidz. raidz has a overhead if the ratio of written block size to minimum block size is bad, since it means the overhead for parity writes is high. non-raidz pool setups don't have parity blocks. in your case, the 8k data block can just go to any top level vdev (and then be mirrored, if it is a mirror), the associated metadata can go to another or the same, it doesn't matter.

if every write would just be divided equally among the top-level vdevs, you could not have a stripe with differently sized or filled vdevs. it might look like it, since if you start a pool with equally sized top-level vdevs, the writes will be distributed to fill them up approximately evenly (because that is of course best for performance). that does not mean that every individual write is split up, which would be bad for performance ;)
 
  • Like
Reactions: velocity08
Hi,

Thx. @fabian for your response, really apreciated!

Basicaly I was explain with 8k example only striped mirror case(2 striped mirror versus 3 striped mirror) and not striped mirror versus striped raidz. I can tell only what I see with my only eyes, using

zpool iostat -v 2:

- if you have a striped pool, athe the vdevs components are not equal as size, then most of the time the write IO are done on the bigger vdev
- if your vdev members have the same size, most of the time(I would say > 80% in the worth cases) the write IO are done equally on all vdevs


Good luck/Bafta !
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!