Opinions | ZFS On Proxmox

velocity08 · Apr 1, 2020

Hi Team

Have been reading over the docs as my 1st port of call but haven't been able to find a definitive answer so through i would crowd source.

Mixed opinions on here RE ZFS setup so im looking for a little more clarity.

Best use of ZFS with you have 8 drive bays
2 drives can be allocated to Mirror Zil & Arc
If we are looking to use 1.2 TB 7.2k drives would the best Read/ Write config be 3 x Mirrors total 6 drives.

OS/ Root and Data to be spread across all 6 drives.

Have toyed with the possibility of RaidZ or Z2 but concerned with the write IO even with a Zil it could still be an issue.

thoughts and opinions welcome.

""Cheers
G

guletz · Apr 1, 2020

Hi,

3 striped mirror are not so good at all. Think at block size that will be split to 3. Best is to use power of 2. So 2 stripped raidz1 will be better.

Good luck / Bafta.

aaron · Apr 1, 2020

Regarding ZIL/ARC:

The ZIL usually can be on a single very fast drive (think fast SSDs like Intel Optane). It really does not need to be large, usually a few GB is plenty.

The L2ARC is the second level cache which resides on a disk which should be faster than the spinning rust -> SSD. In most scenarios, it is better though to increase the RAM so the first level ARC has more space. The L2ARC does need some RAM itself to hold the index.

Another option to speed up the operation of the Cluster is to use the new vdev class of "special device", See 3.8.10: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs

velocity08 · Apr 1, 2020

guletz said:
Hi,

3 striped mirror are not so good at all. Think at block size that will be split to 3. Best is to use power of 2. So 2 stripped raidz1 will be better.

Good luck / Bafta.

Sorry I may not have worded my info properly.

3 x mirrors (2 +2 + 2)
that way we have the write ingest of 3 drives.
Hope that makes better sense.

ta

velocity08 · Apr 1, 2020

aaron said:
Regarding ZIL/ARC:

The ZIL usually can be on a single very fast drive (think fast SSDs like Intel Optane). It really does not need to be large, usually a few GB is plenty.

The L2ARC is the second level cache which resides on a disk which should be faster than the spinning rust -> SSD. In most scenarios, it is better though to increase the RAM so the first level ARC has more space. The L2ARC does need some RAM itself to hold the index.

Another option to speed up the operation of the Cluster is to use the new vdev class of "special device", See 3.8.10: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_zfs

Good info.
Additional info:

256 GB Ram
2 x 480 GB enterprise SSD with capacitor for power outages.

wanted to run with mirrored ssd partitioned for zip and arc.

looking at restricting zfs ram to 64GB.
And 3 pairs of mirror drives for data including root/ data

hopefully the above makes more sense.

“”Cheers
G

aaron · Apr 1, 2020

velocity08 said:
looking at restricting zfs ram to 64GB.

Get some monitoring up and running that will also monitor the ARC size and hit rates. Once you run into problems of not enough free RAM, fast enough available, limit it.
Until then I would let it be. Ideally, during normal operation the ARC hit rate will be close to 100%. Meaning that almost all read operations can be satisfied from RAM.

In my personal productive cluster I have a hit rate of >99.5% most of the time.

velocity08 · Apr 1, 2020

aaron said:
Get some monitoring up and running that will also monitor the ARC size and hit rates. Once you run into problems of not enough free RAM, fast enough available, limit it.
Until then I would let it be. Ideally, during normal operation the ARC hit rate will be close to 100%. Meaning that almost all read operations can be satisfied from RAM.

In my personal productive cluster I have a hit rate of >99.5% most of the time.

If arc is going to potentially hit what’s left for VM’s?

they will both be competing for the same memory space.

sorry what your saying is probably factually correct but it doesn’t make complete sense to me.

any clarity would be appreciated.

ta

aaron · Apr 1, 2020

ZFS will release RAM if requested. Sometimes that might take too long though for some operations that need it and that is when you probably should limit the RAM usage.
Until then I would suggest to just let it be and monitor it to get an idea of how the system behaves.

The ARC hit rate is a measure of how many read operations can be fulfilled from the RAM and thus don't need to be passed down to the slow disks.

It is also possible, depending on your use case, that the ARC size will never reach 64GB anyway. But if you have a monitoring system set up you will be able to see that and then make informed decisions

velocity08 · Apr 1, 2020

aaron said:
ZFS will release RAM if requested. Sometimes that might take too long though for some operations that need it and that is when you probably should limit the RAM usage.
Until then I would suggest to just let it be and monitor it to get an idea of how the system behaves.

The ARC hit rate is a measure of how many read operations can be fulfilled from the RAM and thus don't need to be passed down to the slow disks.

It is also possible, depending on your use case, that the ARC size will never reach 64GB anyway. But if you have a monitoring system set up you will be able to see that and then make informed decisions

Ok cool that makes more sense.

I was thinking the ssd mirrors could be split 32 GB zil and the remaining 440 GB for arc cache.
for a production envoroment I wouldn’t want to have a single drive for cache & zil only, if the drive dies then we will end up with an offline host while we need to force reimport the zpool without the cache or get cold ssd drive to get it back on line.

wish to minimise potential downtime, with the mirror we can have a cold spare ready to replace while the host is online.

also looking for some feedback on RaidZ vs mirror from real world experiance.

thanks.

aaron · Apr 1, 2020

a failed ZIL as well as L2ARC will cost you performance but should not take the host offline

LnxBil · Apr 1, 2020

velocity08 said:
I was thinking the ssd mirrors could be split 32 GB zil and the remaining 440 GB for arc cache.

As @aaron pointed out repeatedly, you need a monitoring solution to monitor ARC and L2ARC. If you're lucky, your L2ARC is useful, it probably won't. You will only know after you're running with it at least until it is full.

velocity08 said:
3 x mirrors (2 +2 + 2)
that way we have the write ingest of 3 drives.
Hope that makes better sense.

@guletz is still right. 3 is not divisible by 2, so you will loose performance there.

Depending on your circumstances, you may have a faster setup with 4x mirrored 10k drives without ZIL and L2ARC. You will only know if you set it all up, run your usual stuff and find out for yourself. ZFS's performance is hugely depended on your workload.

velocity08 · Apr 1, 2020

LnxBil said:
As @aaron pointed out repeatedly, you need a monitoring solution to monitor ARC and L2ARC. If you're lucky, your L2ARC is useful, it probably won't. You will only know after you're running with it at least until it is full.

@guletz is still right. 3 is not divisible by 2, so you will loose performance there.

Depending on your circumstances, you may have a faster setup with 4x mirrored 10k drives without ZIL and L2ARC. You will only know if you set it all up, run your usual stuff and find out for yourself. ZFS's performance is hugely depended on your workload.

@LnxBil have a few questions if i may.

1. on FreeNass we pulled a Zil drive intentionally for testing this action dropped the pool and we needed to re import, maybe the workflow we used may have been different (Power off > remove drive > power on > pool gone) we will try again with power on and running

Have you tested pulling a Zil or Larc while everything is running in ProxMox?

2. Can you please go into more detail why running 3 x 2 mirrors and striping across all 3 will deliver bad performance?

From my understanding striping across mirrors like a Raid 10 would deliver the best performance as you are writing across 3 vDevs delivering 3 x the write capacity of a single drive, when reading it's 6 x the read throughput.

With a Raidz or Z2 we are only going to see the write speed of a single drive and reads will be across all drives.

I've been reading this great article https://calomel.org/zfs_raid_speed_capacity.html

they have done a bit of testing with all available Raid groups.

Maybe i'm not communicating correctly or my understanding may be limited, would you mind going into more detail why 3 x mirrors striped will deliver poor performance?

thank you.
G

guletz · Apr 2, 2020

velocity08 said:
2. Can you please go into more detail why running 3 x 2 mirrors and striping across all 3 will deliver bad performance?

Let say your VM use the default 8k volbloksize (vsz).

3 stripped => 8k/3=2.66 so on each hdd from a mirror will be 2.66, but if you have ashift pool 12, then the block on disk will be 4k

It will be worst when you will need to modify a block => read modify data write (2 iops, insted of 1), aka RMW. And you will also waste your storage capacity. For each 8k vsz you will write 3x4k = 12k(around 33% wasted space)

Of you using 2 stripped, on hdd you wil have exact 4k, and no RMW, no wasted space storage.

velocity08 said:
With a Raidz or Z2 we are only going to see the write speed of a single drive

Only iops, not write speed.

Good luck/ Bafta!

fabian · Apr 3, 2020

@guletz that is not how stripes work in ZFS..

velocity08 · Apr 3, 2020

guletz said:
Let say your VM use the default 8k volbloksize (vsz).

3 stripped => 8k/3=2.66 so on each hdd from a mirror will be 2.66, but if you have ashift pool 12, then the block on disk will be 4k

It will be worst when you will need to modify a block => read modify data write (2 iops, insted of 1), aka RMW. And you will also waste your storage capacity. For each 8k vsz you will write 3x4k = 12k(around 33% wasted space)

Of you using 2 stripped, on hdd you wil have exact 4k, and no RMW, no wasted space storage.

Only iops, not write speed.

Good luck/ Bafta!

Hi @guletz

thanks for the additional info.

From what ive been reading having compression turned on makes the above statement redundant as compression will allocate the block to what ever fits best reducing the used space and delivering less platter spins to get to the data.

From my research and the previous link posted mirrors are the fastest option for ZFS for writes and reads and best for virtual environments, yes you loose 1/2 the storage capacity but if IOP's and MB/s are being chased then striped mirrors are the way to go for spinning rust.

If your using SSD all round then the write of a RaidZ or RaidZ2 SSD still the same as a single drive for writes (Copy On Write) will still give plenty of performance for many use cases, for production shared hosting environments which have lots of random IO having more vDevs making up a pool (3 mirrored drives = 3 vDevs) will deliver the best performance SSD or Spinning disk for reads and writes + compression.

I've corss referenced this informaiton with FreeNas forums, ProxMox and other testing sites.

from my understanding striping across all disks spreads the data across all drives i.e. like Raid 10, happy to be corrected but it will be a hard sell.

""Cheers
G

guletz · Apr 3, 2020

fabian said:
@guletz that is not how stripes work in ZFS..

Ok, could you be kind and explain how it works in your opinion?

mir · Apr 3, 2020

Another thing to take into consideration when choosing between (stripped) mirrors or (stripped) raidz[x] is the overhead in calculating parity for raidz[x] and especially when doing resilvering of the pool. Calculating parity tend to require a CPU with higher clock speeds since parity calculation is CPU bound and mirror is more or less only IO bound so the performance of a mirror is a function of the IO capability only while performance of a raidz[x] is a function of IO capability and CPU clock speed. Resilvering is also an order of magnitude higher with mirrors compared to raidz[x]

fabian · Apr 3, 2020

guletz said:
Ok, could you be kind and explain how it works in your opinion?

I think you are confusing stripes of mirrored vdevs with raidz. raidz has a overhead if the ratio of written block size to minimum block size is bad, since it means the overhead for parity writes is high. non-raidz pool setups don't have parity blocks. in your case, the 8k data block can just go to any top level vdev (and then be mirrored, if it is a mirror), the associated metadata can go to another or the same, it doesn't matter.

if every write would just be divided equally among the top-level vdevs, you could not have a stripe with differently sized or filled vdevs. it might look like it, since if you start a pool with equally sized top-level vdevs, the writes will be distributed to fill them up approximately evenly (because that is of course best for performance). that does not mean that every individual write is split up, which would be bad for performance

guletz · Apr 4, 2020

Hi,

Thx. @fabian for your response, really apreciated!

Basicaly I was explain with 8k example only striped mirror case(2 striped mirror versus 3 striped mirror) and not striped mirror versus striped raidz. I can tell only what I see with my only eyes, using

zpool iostat -v 2:

- if you have a striped pool, athe the vdevs components are not equal as size, then most of the time the write IO are done on the bigger vdev
- if your vdev members have the same size, most of the time(I would say > 80% in the worth cases) the write IO are done equally on all vdevs

Good luck/Bafta !

guletz · Apr 4, 2020

mir said:
Calculating parity tend to require a CPU with higher clock speeds

Not at all. Even my 10 year old server do not have any problem with parity calculation(zfs ver 0.7).

Opinions | ZFS On Proxmox

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Distinguished Member

Well-Known Member

Distinguished Member

Proxmox Staff Member

Well-Known Member

Distinguished Member

Famous Member

Proxmox Staff Member

Distinguished Member

Distinguished Member

We value your privacy