Turning on ZFS compression on pool

DynFi User · Jun 22, 2021

Just a little question : my PBS is configured using ZFS and compression has been left to default which is "on" and "local" which stands for "lz4".

Shall this be left to the default "on" value ?
1. Is there any interest in using compression with PBS (= isn't PBS using it's own compression - in which case ZFS compression will simply be lost processor cycles for nothing).
If we have different datastore for different types of backup :
1. Does it have to be turned on for VM backups ?
2. Does it have to be turned on for file backups ?
3. If compress is interesting which algorithm shall we use preferably (lz4 or a more efficient zstd) ?

This might have a quite significant impact on processor usage and using new algorithm such as zstd might also have positive impact.
Hence the reason why I am asking these questions.

Dunuin · Jun 23, 2021

I also would like to know that. Right now I got lz4 enabled, atime disabled, deduplication disabled and sync=standard.

I just installed a PBS VM on my TrueNAS server, mounted a dataset on that TrueNAS server using NFS into the VM and added the PBS as a storage to my PVE.

The ZFS pool on TrueNAS is also encrypted so I don't need to active the PBS encryption. But is the communication between PVE and PBS encrypted if PBS encryption is disabled?

t.lamprecht · Jun 23, 2021

DynFi User said:
Shall this be left to the default "on" value ?

Is there any interest in using compression with PBS (= isn't PBS using it's own compression - in which case ZFS compression will simply be lost processor cycles for nothing).

Yes, Proxmox Backup Server already uses zstd compression for the blocks, but ZFS compression has some heuristics to detect compressed streams, or streams which cannot really benefit from compression, early and avoid re-compression. As this heuristic is relatively cheap, the performance penalty is small, so in practice this won't matter too much.

DynFi User said:
If we have different datastore for different types of backup :

Does it have to be turned on for VM backups ?

Does it have to be turned on for file backups ?

IIRC, ZFS compression works on records (but tbh, not 100% sure from top of my head), which is something between 4 KiB and 128 Kib in a default system (ashift 12 and default record (max) size) and as the dynamic chunk is 64 KiB to 4 MiB default and the fixed is static 4 MiB by default, they are similar enough that one may not want to invest a lot of time coming up with different fine-tuning parameters for each.

The answer, even if it won't make you all too happy is: it just won't matter much, but you can assume that ZFS level compression won't gain much regarding used file size either way, so it's not worth enabling, but if you do the impact is pretty much negligible, so I'd not sweat it too much.

t.lamprecht · Jun 23, 2021

Dunuin said:
The ZFS pool on TrueNAS is also encrypted so I don't need to active the PBS encryption. But is the communication between PVE and PBS encrypted if PBS encryption is disabled?

Transport encryption and encryption at rest are two different things. The communication between the client and the server are always going through TLS, there is never an unencrypted communication channel for any backup or API data, else one could snoop on the authentication etc.

Also note that while TrueNAS encryption can be a valid replacement for your use case, it is not equivalent to the Proxmox Backup Server one. The PBS one is client-side encryption, that means the server (and other with access to it) can never snoop on the actual backup data sent by the client, which allows that the server, or at least physical access to it, do not have to be 100% trusted.

As you probably enter the decryption password of the whole pool on boot on the TrueNAS it has also good protection for when the server is powered off, but in PBS the server doesn't have, nor needs unencrypted data access even if online.

leesteken · Jun 23, 2021

I thought the PBS Datastore needed atime=on, or at least relatime=on (which also requires atime=on). Is this true or can it be turned off completely on ZFS?

t.lamprecht · Jun 23, 2021

Support for atime (Access Time) is relevant for Proxmox Backup Server functionality, one should use relatime to get some performance improvements.

Edited: confused mtime/atime importance previously

Dunuin · Jun 23, 2021

And what about the recordsize? Is it fine to use the default 128K or would it be beneficial to use for example 1M or 32K if the underlaying pool would allow that without too much padding overhead?

t.lamprecht · Jun 24, 2021

avw said:
I thought the PBS Datastore needed atime=on, or at least relatime=on (which also requires atime=on). Is this true or can it be turned off completely on ZFS?

A colleague asked me to recheck, and you're actually right.

It's the atime and that needs to be at least relatime (on naturally works too). Sorry for the confusion, one should not post from top of one's head after certain hours. I'm going to edit above comment to avoid people using the coming to possible bad conclusions from its wrong info.

We're using atime to actually benefit from relatime performance optimization. We could use mtime too and that would allow to make the "trailing window" for GC smaller, but performance was deemed to be more important.

The above drawbacks from my other post may be true but should in practice not be a real issue, verify goes over existing chunks only anyway, and an admin triggered find or grep may want to exclude the backup directory to drastically improve performance for their command too.

Anyhow, sorry for any potential confusion caused.

Tmanok · Feb 21, 2022

t.lamprecht said:
We're using atime to actually benefit from relatime performance optimization. We could use mtime too and that would allow to make the "trailing window" for GC smaller, but performance was deemed to be more important.

@Dunuin I found it, the reason why GC is 24Hr minimum. Thanks Thomas!

On another note, more related to this thread and to clear my head after reading conflicting information about ZFS atime from another thread:

PVE VM Storage can safely be configured with relatime instead of atime using ZFS.
PBS Datastores need atime? Or can they also use relatime?

Thanks Again Thomas,

Tmanok

t.lamprecht · Feb 21, 2022

Tmanok said:
PBS Datastores need atime? Or can they also use relatime?

relatime is fine.

leesteken · Feb 21, 2022

Don't forget that a ZFS pool needs atime=on for relatime=on to work, in contrast to filesystems like ext4.

EDIT for clarification: I don't mean that atime appears on when you enable relatime on a ZFS pool. I mean that you need to explicitly turn on atime and turn on relatime both, for relatime to work. In other words: is atime is off then relatime is not working (even when relatime is set to on).

Tmanok · Feb 23, 2022

avw said:
Don't forget that a ZFS pool needs atime=on for relatime=on to work, in contrast to filesystems like ext4.

Thank you, that's good to know for diagnostic purposes, I had overlooked that. Doesn't mean that relatime doesn't do "the same" (similar) thing to EXT4 just because atime appears to be "on" while it is using relatime of course.

t.lamprecht said:
relatime is fine.

Relatime it is, thank you Thomas.

Tmanok

gneto · Jul 5, 2022

Good Morning ! I couldn't understand the suggestion that best applies, I have 12 vms that occupy 4TB of disk and 4 SSDs In ZFS without compression.
I am wanting to replicate on another identical server what is the best solution? It is in Raidz/Raidz1.
Should I put another SSD in each or use compression?
The VMs have php/Mysql and about 2TB of images and photos.
Backup Server is separate
Tanks

Dunuin · Jul 5, 2022

So you are talking about ZFS as a storage for VMs/LXCs using replicatiopn between two PVE nodes and not as a storage for a PBS datastore synced between two PBS servers?

4 disk raidz1 is bad for MySQL as you would need to use a volblocksize of atleast 32K (in case of ashift=12) to not loose alot of capacity to padding overhead. And MySQL is sync writing with 16K blocks and writing with lower blocksizes to higher blocksizes is always problematic.

gneto · Jul 5, 2022

Do you think I should put in raid 0 without compression and synchronize?
Or I create within the two servers this Storage because the ceph I already discarded is very complex.
Would the storage solution be a separate one?

Sorry for the beginner's doubt.
And Thanks

Dunuin · Jul 5, 2022

ZFS with replication is not a real shared storage, its two local storages that get synced by replication. So you need similar ZFS pools with the identical name on both nodes of your cluster. So both nodes need 4 SSDs each. And I wouldn't use raid0. You will loose all the reliability and bit rot protection using that. And because of replication, if a file corrupts on one node it will also be corrupted within a few minutes on the other node. If you care about your data and performance then a striped mirror (raid10) would be a better choice. But then you might need to buy even more disks.

gneto · Jul 5, 2022

In this case, the best thing is to use the zfs z1 pool to maintain the integrity and have the backup well adjusted, is that right?
Thank you for your help

Do you do consulting to make a nice setup for me or recommend someone?

Dunuin · Jul 5, 2022

Raidz1 is still not a great option because you are either...:
1.) use the default 8K volblocksize where you loose 50% of your raw capacity, even if you don't see it. It will show you everywhere that you got 75% of your raw capacity as usable space but that is wrong as everything you write to a virtual disk will consume 150% space. Lets say you got 4x 2TB SSDs in a raidz1 with ashift=12. When writing 4TB to zvols (your VMs virtual disks) it will also write 2TB of parity data and 2TB of padding overhead. So only 50% of your disks are usable and that is the same as using a striped mirror. And a striped mirror of 4 disks would be a better option as you would get the same capacity but double the IOPS performance.
2.) use a 32K volblocksize. Then you could fit 6TB of data on that raidz1 pool because the padding overhead would be negligible so you could store 6TB of data + 2TB of parity. But downside would be that performance and SSD wear would be terrbile for everything using a blocksize of below 32K. So bad for running a Mysql DB.

And by the way. A ZFS pool should always have 20% of free space or it will fragment faster and become slow. So even if you got 4TB of usable capacity you shouldn't use more than 3.2TB. Best would be to also set a 90% quota so that you can't screw up your pool by accident by writing it completely full where it would become unrecoverable as ZFS is a copy-on-write filesystem where empty space is required to actually delete or edit data. Really bad when filling it up completely where oyu would need to delete stuff to recover from the read-only state but nothing can be deleted because that would require free space to write new data first.

Now lets say you want to store 4TB of VMs and these should be part of a HA cluster that uses ZFS replication. And you already got 4x 2TB SSDs. then I would put 6x 2TB SDDs in the first PVe node + 6x 2TB SSDs in the second PVE node and setup both nodes with a 6 disk striped mirror (raid10) with a 16K volblocksize and a ashift of 12 and the same poolname on both nodes. That way you would get 4.8TB of HA storage to store VMs and still get a good performance, even when running MySQL DBs. So in total 12x 2TB to be able to store 4.8TB of virtual disks.

gneto · Jul 6, 2022

Dunuin said:
Raidz1 is still not a great option because you are either...:
1.) use the default 8K volblocksize where you loose 50% of your raw capacity, even if you don't see it. It will show you everywhere that you got 75% of your raw capacity as usable space but that is wrong as everything you write to a virtual disk will consume 150% space. Lets say you got 4x 2TB SSDs in a raidz1 with ashift=12. When writing 4TB to zvols (your VMs virtual disks) it will also write 2TB of parity data and 2TB of padding overhead. So only 50% of your disks are usable and that is the same as using a striped mirror. And a striped mirror of 4 disks would be a better option as you would get the same capacity but double the IOPS performance.
2.) use a 32K volblocksize. Then you could fit 6TB of data on that raidz1 pool because the padding overhead would be negligible so you could store 6TB of data + 2TB of parity. But downside would be that performance and SSD wear would be terrbile for everything using a blocksize of below 32K. So bad for running a Mysql DB.

And by the way. A ZFS pool should always have 20% of free space or it will fragment faster and become slow. So even if you got 4TB of usable capacity you shouldn't use more than 3.2TB. Best would be to also set a 90% quota so that you can't screw up your pool by accident by writing it completely full where it would become unrecoverable as ZFS is a copy-on-write filesystem where empty space is required to actually delete or edit data. Really bad when filling it up completely where oyu would need to delete stuff to recover from the read-only state but nothing can be deleted because that would require free space to write new data first.

Now lets say you want to store 4TB of VMs and these should be part of a HA cluster that uses ZFS replication. And you already got 4x 2TB SSDs. then I would put 6x 2TB SDDs in the first PVe node + 6x 2TB SSDs in the second PVE node and setup both nodes with a 6 disk striped mirror (raid10) with a 16K volblocksize and a ashift of 12 and the same poolname on both nodes. That way you would get 4.8TB of HA storage to store VMs and still get a good performance, even when running MySQL DBs. So in total 12x 2TB to be able to store 4.8TB of virtual disks.

Friend you helped me pacas I was seeing here in one that only left the 50% as you said and did not understand.
The other I did a raidzO even after I did the replication more this congests the network well.
I appreciate the help and congratulations for the knowledge!
I will try to apply what you said here.

gneto · Jul 6, 2022

Last question just to take the stubbornness out of the beginner.
For this use I will have if using lz4 compression would not be a better alternative then for these 4 ssd in raid z1 or any other suggestions?
I thought putting a trhueNas and putting everything centered on it would be a better alternative?
Thanks

Turning on ZFS compression on pool

Renowned Member

Distinguished Member

Proxmox Staff Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Distinguished Member

Renowned Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

New Member

We value your privacy