[SOLVED] Server config ZFS +encryption+cpu load questions

Josef Grassler · Jun 18, 2020

Hello there, ill be setting up a new Proxmox Server in the near future.

Supermicro mainboard H11SSL-i
AMD EPYC 7282 (2.80 GHz, 16-core, 64 MB)
128 GB (4x 32 GB) ECC Reg DDR4 RAM 2 Rank
4x960 GB SATA III Intel SSD 3D-NAND TLC 2.5" (D3 S4510) => RaidZ1 vmdata
2x6 TB SATA III Western Digital Ultrastar 3.5" 7.2k (512e) => zfs mirror vzdumps and data storage

*2x 240gb enterprise SSD mirror for PVE (to be chosen)

//The server will be hosting Windows AD, elk stack netflow, elk stack with log aggregation (winlogbeat ect.), possibly cisco fmc kvm(also "huge" database operations) + tons of other small VMs that are normally not IO heavy

I've been looking into openzfs basics and how to create encrypted datasets.
Did set up a PVE on Ryzen 3700x 8-core with 32gb ram and a ton of consumer SSDs for learning purposes.

Did make an encrypted dataset via:
zfs create -o encryption=on -o keyformat=passphrase -o reservation=none ssdpool/encrypt
and added it to storage.cfg
The vm disks (zvols? ) now appear as ssdpool/encrpyt/****

Also followed the zfs-mount-generator manpage to unlock it via passphrase on start.

I got some questions, since this will be a core machine and soon be needed in production and it will be hard to change storage config once ins up and running.

The 4x1tb enterprise SSDs will host VM-data, the 2x6tb hhds are meant to act as backup targets for vzdumps and (possibly) Storage Space for windows shares/home folders ect.
Are there huge risks of having a zfs mirror with those HDDS that do not feature power loss protection ?
Am i correct to understand, that in this kind of copy on write file-system, a power loss should not corrupt data ?
There will be weekly backups to another location.

Performance:
When i read about zfs lz4 compression (and before noticing, that there is a compression=on default) i did small performance tests
by using "Move Disk" of a VM disk in the Proxmox GUI from one (encrypted) zfs dataset to another and doing ftp transfer to the ubuntu vm on this disk.

While the ftp transfer created some noticeable CPU load, the moving of the the disk pinned the 8core cpu to 90%+ load on 16Threads for some seconds.
The once deemed overkill 16c/32t processor now suddenly worries me a bit, imagining that a VM could use 8 cores to the max and cause disk usage at the same time (if just for a minute).
Would you worry about disabling compression, when using zfs encryption, or is ZFS "smart" enough not to suffocate my virtual machienes CPU/IO resources?
Im afraid of killing/corrupting critical VMs because of a backup task or moving a disk.
Encryption is critical, compression would just be a nice to have atm.

I've read that swap on zfs can be problematic.
At least if the host system starts swapping and the swap would be located at the zfs pool therefore needing ram and computation ... crash.
Would you disable swap on the PM host? Should i disable swapping in VMs (where possible) and over-allocate more ram?
How much ram would you leave untouched for ZFS ? The current plan is to not use more than ~80gb for VMs and leaving more than enough memory empty.
Plan might change tho. We needed more than 64gb so we went with 128gb.

Without a UPS i should use Default(no chache) for virtual disks, even if located on the raidz1 SSDs with powerloss protection right? What about SSD emulation option?

I've seen some discussions about blocksize, shoudl i worry ? It won't be an IO heavy server outside of backup tasks.

Am i correct, that having an encrypted zfs mirror for the host root partition is currently not supported ? (i did it with debian manual config+luks back then for my other servers)

Coming from old old HP DL380 G7 with HDDs this server should have no problems handling our load, but im afraid to create bottlenecks or IO issues, as this server will be business critical.

That's a lot of things, apologies for the wall of text and thanks for your input in advance.

aaron · Jun 19, 2020

Hey, some thoughts from my side. I wasn't able to give a confident answer to each question though.

Josef Grassler said:
4x960 GB SATA III Intel SSD 3D-NAND TLC 2.5" (D3 S4510) => RaidZ1 vmdata

If it's 4 disks, arrange them in a RAID 10 pattern -> 2 mirror VDEVs. Using any RAIDZ level for VMs might surprise you with how much space is lost to parity data. See this thread and @fabian's answer https://forum.proxmox.com/threads/zfs-counts-double-the-space.71536/#post-320919

Josef Grassler said:
Am i correct to understand, that in this kind of copy on write file-system, a power loss should not corrupt data ?

Pretty much yes, if you search around you will find this discussed and explained quite a bit on the internet. Worst case is that you might lose some data from the incomplete write, but the existing data should be okay.

Josef Grassler said:
Would you worry about disabling compression, when using zfs encryption, or is ZFS "smart" enough not to suffocate my virtual machienes CPU/IO resources?

Did you test this with other load on the server and did you expect CPU starvation there? I don't have too much experience myself in that regard but without other things running that cause a base load it is hard to tell if the move disk command will take away these other resources or just use what's available.

Josef Grassler said:
Would you disable swap on the PM host?

The PVE installer does not set up any swap when ZFS is used as root FS.

Josef Grassler said:
What about SSD emulation option?

This tells the guest OS that the disk is an SSD and supports TRIM. This is useful to trigger the automatic TRIMs inside the guest OS to pass through to the physical storage. Should you back up with the included VZDump tool this will help you a lot. As the VZDump tool backs up the full disk it benefits from areas that are trimmed.

Josef Grassler said:
I've seen some discussions about blocksize, shoudl i worry ? It won't be an IO heavy server outside of backup tasks.

If you avoid any RAID-Z VDEVs you should be fine. Blocksize optimizations can help a lot though depending on the DB used. But that then involves the storage settings and the FS inside the VM.

Josef Grassler said:
Am i correct, that having an encrypted zfs mirror for the host root partition is currently not supported ? (i did it with debian manual config+luks back then for my other servers)

The PVE installer does not support this option right now. If you want to encrypt the OS itself too, you can install a regular Debian the way you prefer and then install PVE on top of it. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_install_proxmox_ve_on_debian

LnxBil · Jun 19, 2020

Josef Grassler said:
Would you worry about disabling compression, when using zfs encryption, or is ZFS "smart" enough not to suffocate my virtual machienes CPU/IO resources?

Never disable compression. It is proven, that you get better performance with enabled compression than without. The I/O stack compresses the data first and then encrypts it, the other way around does not make any sense. Compression is transparent (speaking of the default compression), so that you will write faster, not slower. Encryption itself is - if run on any modern (less than 10 years old) server hardware - a no-brainer due to the CPU-internal AES-NI extension, which encrypts GB/s per core.

guletz · Jun 19, 2020

Josef Grassler said:
Would you worry about disabling compression, when using zfs encryption, or is ZFS "smart" enough not to suffocate my virtual machienes CPU/IO resources?

As you know, any encrypted data are almost incomprehensible, so your best choice is to disable compression on any encrypted dataset or zvol.

Josef Grassler · Jun 22, 2020

Thanks for all the replies!

aaron said:
If it's 4 disks, arrange them in a RAID 10 pattern -> 2 mirror VDEVs. Using any RAIDZ level for VMs might surprise you with how much space is lost to parity data. See this thread and @fabian's answer https://forum.proxmox.com/threads/zfs-counts-double-the-space.71536/#post-320919

This is a big one, thanks for the info! Will read into it.

aaron said:
Did you test this with other load on the server and did you expect CPU starvation there? I don't have too much experience myself in that regard but without other things running that cause a base load it is hard to tell if the move disk command will take away these other resources or just use what's available.

The PVE installer does not support this option right now. If you want to encrypt the OS itself too, you can install a regular Debian the way you prefer and then install PVE on top of it. See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_install_proxmox_ve_on_debian

@1 i assume that the move disk command is just very aggressive, i did fail-configure my first PM server with raid5 HDDs and heavy IO limitations, so i might be over cautious here. I can imagine, that IO-waits +swapping are fare more problematic, than having a CPU at high load from time to time.
@2 i did do that back then with LUKS and a raid controller, will look into, if debian can be set up with ZFS mirror + encryption^^

LnxBil said:
Never disable compression. It is proven, that you get better performance with enabled compression than without. The I/O stack compresses the data first and then encrypts it, the other way around does not make any sense. Compression is transparent (speaking of the default compression), so that you will write faster, not slower. Encryption itself is - if run on any modern (less than 10 years old) server hardware - a no-brainer due to the CPU-internal AES-NI extension, which encrypts GB/s per core.

I thought so too, thanks for the confirmation! i know those basic encryption benchmarks ok veracrypt or truecrypt, that far exceed IO capabilitied with the "basic" encryption settings.

guletz said:
As you know, any encrypted data are almost incomprehensible, so your best choice is to disable compression on any encrypted dataset or zvol.

I think its always compression into encryption. If encrypted data where compressible it would mean there are partners to be found and therefore not well encrypted.

Thanks again for your inputs, will report back when its up and running^^

Josef Grassler · Jun 22, 2020

I've been reading into ashift and its purpose for HDDs, and i meant to set up a mirrored raid to compare performance, but quickly learned that:

You can't really use dd if=/dev/zero with compression since it will not write anything

.
Also repeated hdparm will benefit greatly from zfs cache features so the output is not very useful for comparison .

@aaron after 1-2 hours of google research into raiz1+vmdata vs raid10 im still not quite there.
Some articles/forum posts suggest that raidz-1 makes sense starting at 5 disks (maybe because of extra parity data beyond the 1 extra drive ? Or is it the general %drives lost to parity?)
Then again there is discussions that i should not use ashift12 because after compression there will be smaller blocks mostly...
Then there is google docs talking about just some small % allocation overhead.
Also there is often the talk about raidz2 and im not sure if this "total write 8k+4k+4k = 16k" is also present when "only" using parity of one.
in short: information overflow

I don't mean to be lazy and im curious to understand details about zfs, ashift and all the zfs raid levels (since i will be using it often in the future) but may i ask:

Would the raidz1 with vmdata just waste some percentage of the storage beyond the one parity SSD or might it be even similar to raid10 and therefore you suggest going with the faster mirror right away?

I'd prefer having more than ~2Tb storage but i cannot estimate yet if the raidz1 with vmdata would give me much more space versus the speed benefits. (not neglecting, that an answer to that would totally change depending on what is stored)

I planned to have a scenario similar to raid5 with ~3 tb storage 1tb lost to parity, but then again i also didn't plan in zfs compression so 2tb storage might be ok.

im looking for something like:
"Don't worry, you might lose 20% extra storage space in raidz1 but it's not a misconfiguration, just raid10 is faster"
Not sure if that is close to reality

thanks for your help!

guletz · Jun 22, 2020

Josef Grassler said:
Also there is often the talk about raidz2 and im not sure if this "total write 8k+4k+4k = 16k" is also present when "only" using parity of one.
in short: information overflow

This is happend like this, in my understanding:

At the zvol level you will need let say default 8k zvolblocksize. This block is split betheen N total disk member of vdev raidzX - P(no. of parity disks). A simple example, for a raidz1(3 HDD), will be 4k data on a let sau HDD 1, another 4k on HDD2, and a parity data on HDD3. But you must take in account that compression could change a lot this. A 8k zvolblockdisk after compression cand result in let say 6k data. So using the example of 3x HDD raidz1 => 6k/2 HDD of data + 3k of parity. But ... you can write at minimum only 4k(ashift=12), so you will write in the end 3k+1k paddind(=4k) to each of your HDD.
For this resons, you can design your zpool taken in account this facts. As a dump ideaa, if you use a bigger volblocksize, the performance will be better(if I exclude the compression effect).

And to be more complicated, in case of SSD/NVME .... for most of the cases, the block size is mostly 16k, so in the end, if your case is this, then instead ok any 4k data/parity you will write ... 16k -> RMW(read 16k, modify 4k in RAM and write the all 16k block= 4k + rest of 12k). This will wear much faster your SSD/NVME. Also not any compression taken in account.

Let say that you have mange to find the lucky zvolblocksize. You are not ... because any OS(linux/Win) will use by default 512 b at their FS level. In this case your OS will write mostly 512 b(1/8*4k) => but for zfs will be minimum of 8k(default). So... for one 512b => zfs will write 8k ... and so on on the all NVME/SSD, as I exemplify before. For this reasons you will also need to tunning your OS FS(to use at lest 8k, as your zvolblocksize for example).

But you are not finish ...., if you have several VMs, it is possible to need to have different optimum zvolblocksizes(16k and 32k)!

And I tell only about raidzX !!!!

Good luck / Bafta!

guletz · Jun 22, 2020

Josef Grassler said:
"Don't worry, you might lose 20% extra storage space in raidz1 but it's not a misconfiguration, just raid10 is faster"

Yes, for sure, raid10 is faster(better IOPs) then any raidzX.

... and you can lose at least 10% in the best case(3x HDD raidz1), or most likely much more!

Also note, that at any time, a raid10 can be extended with new disks, but this is not the case for any raidzX. So looking in the future, use raid10, and also for the same reason use ashift=13.

Good luck /Bafta!

guletz · Jun 22, 2020

LnxBil said:
It is proven, that you get better performance with enabled compression than without

Hi @LnxBil

Not for SSD/NVME, who already compress data using their own firmware. At least this is I have read from many places on the Internet (blogs, zfsonlinux mail list). Also i do not read any opinion that say this is not true! Maybe you have more knowldge about this subject, or perhaps I do not understood corectly the context. Or simply I am wrong. We all learn new things, including from this Forum(me including, and I learn a lot from yours posts - thx. for this and to many Forum members/developers)!

Good luck /Bafta!

LnxBil · Jul 1, 2020

guletz said:
Not for SSD/NVME, who already compress data using their own firmware.

Yes, this can be true and introduces another layer of nightmare in the storage stack.

I concur that if you have a non-perfect setup (like you described above), you will loose the "compression is always better". Let's state that "compression is always better or equal to not using compression in a best case scenario, but it will not harm". If you would use shift=9 and the underlying storage structure is also 512n-based, you will always gain throughput and compressibility, it depends otherwise.

guletz · Jul 1, 2020

Thx a lot @LnxBil for the details !

Josef Grassler · Jul 1, 2020

Hello,
reporting back and thinking of redoing the SSD mirror.

After a bit research I did the "hddmirror" with ashift 12 and the 4xSSD "vmdata" with ashift=13:

zpool create -f -o ashift=13 vmdata mirror sdb sdc mirror sdd sde

I think the ashift=13 was a major rookie mistake, maybe i misunderstood and the suggestion by @guletz was only meant for HDDs ?

nevertheless: here is why i think i did it wrong:

windows server installation on "scsi" disk on datasets with compression= on (copy paste of the relevant outputs):

hddmirror ashift 12 local
hddmirror/encrypted/vm-701-disk-0 compression on inherited from hddmirror
hddmirror/encrypted/vm-701-disk-0 compressratio 1.16x -
hddmirror/encrypted/vm-701-disk-0 20.9G 5.24T 20.9G -

vmdata ashift 13 local
vmdata/encrypted/vm-701-disk-0 compression on inherited from vmdata
vmdata/encrypted/vm-701-disk-0 compressratio 1.00x -
vmdata/encrypted/vm-701-disk-0 24.3G 1.61T 24.3G -

I did enable compression on the dataset after creating the VM.
Therefore I moved the "HDD" via PM gui to a dataset with compression on and deleted the source.
Then moved it back without deleting the source being shocked, that it takes 16% more space in the SSD ashift 13 pool.

Is this because of wrong ashift settings, or possibly becasue the dataset is not being compressed.
The dataset vmdata/encrypted/vm-701-disk-0 has been deleted at some point so when moving from "hddmirror" i assumed it would be compressed, but maybe there is some trickery and the data is still present?

Any ideas? Im still at the point, where i could just redo the zpools with different ashifts.

also a kind of obvious question:
this : "zpool create -f -o ashift=13 vmdata mirror sdb sdc mirror sdd sde" will still use UUIDs or something of the partitions internally meaning it does't matter if sde becomes sdd some day.. right ?

Thanks, appreciate your help!

guletz · Jul 1, 2020

Hi again,

It is very wise to test your config before you go in production with your data.

Yes if you think that the compression rate is important to you you can go with default ashift=12. Also I would do others tests, like:

For both cases(ashit 12 and 13)

1. copy some of real data of you and see the time spent to complete in each cases
2. use the script from here https://askubuntu.com/questions/865792/how-can-i-monitor-the-tbw-on-my-samsung-ssd (method 2) and
check how many data is written to SDDs (less data is better, SSD life will be longer)

After you finish, take in account all this results and use the OPTIMUM ashift for your case.

Good luck

guletz · Jul 1, 2020

Josef Grassler said:
also a kind of obvious question:
this : "zpool create -f -o ashift=13 vmdata mirror sdb sdc mirror sdd sde" will still use UUIDs or something of the partitions internally meaning it does't matter if sde becomes sdd some day.. right ?

The best when you create a new pool is tu use by-id(see the output of ls - l /dev/disk/by-id)

zpool create ..... mirror /dev/disk/by-id/ata-HGST_HDN724040ALE640_xxxxxxxxx /dev/disk/by-id/ata-HGST_HDN724040ALE640_yyyyyyyy

becuse xxxxxxxxx or yyyyyyyyyy is printed on the label of your HDD/SSD, so it will be easy to replace if it will be the case!

Josef Grassler · Jul 1, 2020

guletz said:
The best when you create a new pool is tu use by-id(see the output of ls - l /dev/disk/by-id)

zpool create ..... mirror /dev/disk/by-id/ata-HGST_HDN724040ALE640_xxxxxxxxx /dev/disk/by-id/ata-HGST_HDN724040ALE640_yyyyyyyy

becuse xxxxxxxxx or yyyyyyyyyy is printed on the label of your HDD/SSD, so it will be easy to replace if it will be the case!

cool tip, will do!

@testing real data copy times seems hard, since with those fast SSDs there will be al lot other bottlenecks, i suppose but copy + comparing smarctl written data seems cool, will do that.
thx for the tips!

edit: on different SSDs (mircron5300) on hte root pool with ashift=12 the size is also ~21gb.

Josef Grassler · Jul 1, 2020

Youp! with ashift=12 i get better copressionratios and smaller diskusage:

Also a comparison between unencrypted "rpool" and encrypted "vmdata", not sure if size difference can be attributed to encryption or different ssd models:

rpool/data/vm-222-disk-0 2.05G 369G 2.05G -
rpool/data/vm-308-disk-0 12.4G 369G 12.4G -
rpool/data/vm-701-disk-0 20.7G 369G 20.7G -

vmdata/encrypted 56.6G 1.62T 192K /vmdata/encrypted
vmdata/encrypted/vm-222-disk-0 2.07G 1.62T 2.07G -
vmdata/encrypted/vm-308-disk-0 12.7G 1.62T 12.7G -
vmdata/encrypted/vm-701-disk-0 20.9G 1.62T 20.9G -

+ sweet compress ratios, especially on the 308 VM the netflow-elk machine.

vmdata/encrypted/vm-222-disk-0 compressratio 1.27x -
vmdata/encrypted/vm-308-disk-0 compressratio 1.61x -
vmdata/encrypted/vm-701-disk-0 compressratio 1.16x -

Would be curious if the elasticsearch shards compress well.
In my understanding they are designed for fast queries, not efficient storage

Im happy with this!

guletz · Jul 1, 2020

Josef Grassler said:
Im happy with this!

Me too

Maybe now you can edit your post title and write Solved!

Good luck / Bafta.

LnxBil · Jul 5, 2020

Josef Grassler said:
Youp! with ashift=12 i get better copressionratios and smaller diskusage:

That should be easy explainable: If you use the default 8K volblocksize, the ashift=13 uses also 8K per block stored on ZFS. Every block that is written to this pool cannot be compressed further, because you will always use a single block of 8K. As with ashift=12, if you only write a bit in a 8K volblock, you can compress this block from 8K down to 4K, so that you will save in fact space.

Josef Grassler said:
this : "zpool create -f -o ashift=13 vmdata mirror sdb sdc mirror sdd sde" will still use UUIDs or something of the partitions internally meaning it does't matter if sde becomes sdd some day.. right ?

ZFS detects its disks automatically, but the suggestion from @guletz is point on. Identifying your disks is crucial if you have many. We have a lot of external shelves and use udev to rename the disks according to their physical placement. We also have their serial written on the drive caddy for identification.

Search

Search

[SOLVED] Server config ZFS +encryption+cpu load questions

Josef Grassler

Member

aaron

Proxmox Staff Member

LnxBil

Distinguished Member

guletz

Famous Member

Josef Grassler

Member

Josef Grassler

Member

guletz

Famous Member

guletz

Famous Member

guletz

Famous Member

LnxBil

Distinguished Member

guletz

Famous Member

Josef Grassler

Member

guletz

Famous Member

guletz

Famous Member

Josef Grassler

Member

Josef Grassler

Member

guletz

Famous Member

LnxBil

Distinguished Member