Proxmox problem with memory limits in ARC (ZFS)

RobertWojtowicz

New Member
May 13, 2024
10
0
1
Hi,

1. I am using a new installation of Proxmox 8.2.2. As described in the wiki:
ZFS uses 50 % of the host memory for the Adaptive ReplacementCache (ARC) by default. For new installations starting with Proxmox VE 8.1, theARC usage limit will be set to 10 % of the installed physical memory, clamped to a maximum of 16 GiB. This value is written to /etc/modprobe.d/zfs.conf.
These entries are of course not there, rally you need to configure it (I already have 74/128 GB of ram eaten up).
What would be the optimal limits to take with a ZFS Raidz2 configuration, with 8x 22 TB drives (125 TB to use) with default compression (LZ4) and 16k block size.

2. Another thing is how the space is calculated with raidz2, if 2 disks are spare there should be 132 TB/120,05 TiB of space, while shown in Proxmox is the space (in GUI/SSH) 124,98 TB/114 TiB. Can I occupy the entire logical pools space shown in GUI/SSH with data or do I still need to add 20% reserve free space to it ?


BR,
Robert
 
Last edited:
These entries are of course not there, rally you need to configure it (I already have 74 GB of ram eaten up).
Could you please explain how you installed your PVE instance? This limit should get set if you selected ZFS in the installer. Otherwise, no limit will be set.

What would be the optimal limits to take with a ZFS Raidz2 configuration, with 8x 22 TB drives (125 TB to use) with default compression (LZ4) and 16k block size.
We usually recommend a base of 2 GiB plus 1 GiB per TiB of storage. So the ARC size in your case should be 116 GiB. However, this is almost all the memory you have available. You may want to limit it to 64 GiB, depending on you usecase. The performance might not be quite what would be expected otherwise, but everything should still work.

if 2 disks are spare there should be 132 TB/120,05 TiB of space, while shown in Proxmox is the space (in GUI/SSH) 124,98 TB/114 TiB.
There is typically always some extra space taken up by metadata and things like slop space. So you never get the full capacity you may expect. For best performance you should still keep about 20% of the pool freed up.
 
  • Like
Reactions: RobertWojtowicz
New installation without zfs, only after that I added disks and created ZFS.
I was thinking of setting it to 32GB (there will also be virtual machines), that's where the archive is supposed to be, in read mode.

Does it make much difference whether the virtual machine will have, for example, 6 disks (raw files) of 20TB each, or one large 120TB disk ?


BR,
Robert
 
Last edited:
120TB is lot, i don't know any downsides, but i wouldn't do that personally.
Don't understand me wrong, it will likely be just fine.

However, i would prefer using an lxc container if possible and mount the Storage directly to the lxc container. (Primary to avoid the usage of zvols)
Otherwise 20TB is a lot either, so i don't think there is any difference in one 120TB disk vs 6x20TB Disks.
One 120TB should actually have benefits, because you wouldn't need an Raid inside your VM, which causes overhead.

Cheers
 
I want to use zfs for compression and replication of a virtual machine with these disks to a second server (mirror of two raidz2).

I want to create a file archive accessible from nextcloud (virtual machine) with a replica on a second server (2-disk and server-level fault tolerance)

Do you see big problems of such a configuration here?

BR,
Robert
 
Last edited:
Then you have no other way as using VM.

But replication is not live, what i mean is, if one server goes down, and the vm gets started on the other one, you loose 2hours of data if you set to sync every 2h for example.
Just as a sidenote.
 
  • Like
Reactions: RobertWojtowicz
Thanks for the information, how much performance drops if we reduce the amount of RAM from the recommended amount ?

E.g. in this case:
116GB <> 64GB
116GB <> 32GB

So hypothetically, what to expect.
What is the largest raidz2 pool (zfs pool) you have dealt with?

Or maybe it is better to divide into more zfs pools smaller instead of one big one ?

BR,
Robert
 
Last edited:
You wrote you'll using 16k blocksize, you mean really volblocksize?
Im not sure if it will have some downsides for VM's (i don't think), but it should help with the needed space for metadata.
Same for recordsize, usually the larger the recordsize, the less metadata you need. But 128k (the default) is usually a great middleway.

However, with that limited Ram you have, i would make a dataset for the database only with 64k recordsize and set everything else to 1m recordsize.
Sure you'll need a separate LXC Container or mount additionally a disk from that dataset to your nextcloud VM, but that should give you 3 benefits, more speed for database (64k), more speed for data of nextcloud (1m) and less metadata.

You could go totally insane and set primarycache=metadata, that saves you a huge amount of memory (almost all of it), but with a relative big performance degradation on spinning disks.
But that depends highly on usecase, for nextcloud data only it's probably worth it to make a dataset with primarycache=metadata.

Sidenote, you can set primarycache for the disk itself for example:
zfs set primarycache=metadata STORAGE-POOL/vm-138-disk-1
and probably recordsize either, but im not completely sure if zvols take care of recordsize (that needs to be googled)
What i want to say is simply, that you don't need extra datasets, zvols support most of that stuff either (basically the disks in vms directly)

To reduce the space for ARC:
/etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=68719476736
options zfs zfs_arc_min=17179869184

Thats minimal 16gb and max 64gb, so you basically let zfs decide the arc size in the adaptive way.

More smaller pools instead of one big one has only downsides in my opinion. A big pool is usually faster and more comfortable to handle.
But you could for example do 2x a stripe of z1 for more speed (with the risk of 2 disks dying on the same group = dead)
 
Last edited:
Thank you very much for the comprehensive explanation,

EDIT 16.05.2024 22:00
I finally read up and checked:
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html#zvol-volblocksize
https://ibug.io/blog/2023/10/zfs-block-size
https://forum.proxmox.com/threads/zfs-is-block-size-actually-record-size.55897

I have the default values set:
recordsize = 128k
volblocksize = block size in GUI = 16k

If I wanted to simplify it to the classification of disk pools by purpose:
- The first for virtual systems (2TB nvme disk, currently without mirrored target with mirrored)
- The second for archive data - RAW virtual disk attached to the virtual as the second (8* 22TB sata DC, raidz2)
What recordsize/volblocksize do you suggest ?

I will change the parameters as you wrote min 16 GB, max 64 GB.

BR,
Robert
 
Last edited:
Thanks for the information, how much performance drops if we reduce the amount of RAM from the recommended amount ?

E.g. in this case:
116GB <> 64GB
116GB <> 32GB

So hypothetically, what to expect.
What is the largest raidz2 pool (zfs pool) you have dealt with?

Or maybe it is better to divide into more zfs pools smaller instead of one big one ?

BR,
Robert
Code:
zpool status
  pool: HDD_Z2
 state: ONLINE
  scan: scrub repaired 0B in 06:17:32 with 0 errors on Sun May 12 06:41:33 2024
config:

    NAME                                                    STATE     READ WRITE CKSUM
    HDD_Z2                                                  ONLINE       0     0     0
      raidz2-0                                              ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
        ata-WDC_WUH722020BLE6L4_8LGxxx                    ONLINE       0     0     0
    special
      mirror-1                                              ONLINE       0     0     0
        nvme-eui.002538483xxx-part5                     ONLINE       0     0     0
        nvme-eui.002538443xxx-part5                     ONLINE       0     0     0
      mirror-2                                              ONLINE       0     0     0
        nvme-eui.00253844314xxx-part5                     ONLINE       0     0     0
        nvme-Samsung_SSD_990_PRO_2TB_S7DNNxxx-part5  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:03:47 with 0 errors on Sun May 12 00:27:49 2024
config:

    NAME                                                    STATE     READ WRITE CKSUM
    rpool                                                   ONLINE       0     0     0
      mirror-0                                              ONLINE       0     0     0
        nvme-eui.0025384831xxx-part3                     ONLINE       0     0     0
        nvme-eui.0025384431xxx-part3                     ONLINE       0     0     0
      mirror-1                                              ONLINE       0     0     0
        nvme-eui.002538443xxx-part3                     ONLINE       0     0     0
        nvme-Samsung_SSD_990_PRO_2TB_S7DNNxxx-part3  ONLINE       0     0     0

errors: No known data errors
Code:
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDD_Z2   146T  28.1T   118T        -         -     0%    19%  1.19x    ONLINE  -

Very similar to your pool, or almost identical, its a home-server pool for media and samba shares...
and the special vdev with special small blocks etc... To speedup that crap spinning Drives for samba search in windows...

I have in the company a lot other pools, but for the company i never needed such a big pool, the backupserver has a 70TB pool in Raid10 with special vdev either.
That 120TB home pool i really dont know why i made it, for the sake of it, dunno. I cannot get it full and there is already a shitton of media xD

Cheers
 
@Ramalama

If I add primarycache on ssd disks (in mirror 2 disks or is a single one enough?), what happens when primarycache fails?

BR,
Robert
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!