ARC size suggestions

wbarnard81

Member
Sep 2, 2022
17
0
6
South Africa
www.vaimo.com
Hi All,

I am planning setting up a HA Proxmox cluster and I am wondering what would be the best arc size setting for my setup.

Hardware I am using:
AMD EPYC 7702P 64-Core Processor
512GB DDR4 3200Mhz
2x Micron 7300 480GB (Raid1 zfs for boot)
6x Kioxia KCD6XLUL960G 960GB NVMe
Mellanox ConnectX 4 (25Gbps) NICs

The idea is to run 6x Ubuntu 22.04 LTS VMs per server (32vCore and 64GB RAM) with as much disk size as I can get.

I have already set up one server with proxmox and I configured a Raid10 zfs pool and all is great, but RAM usage is at 92% most of the time and I saw that zfs is taking half of system memory.
Code:
ARC size (current):                                    99.6 %  250.7 GiB
        Target size (adaptive):                       100.0 %  251.8 GiB
        Min size (hard limit):                          6.2 %   15.7 GiB
        Max size (high water):                           16:1  251.8 GiB
        Most Frequently Used (MFU) cache size:         15.9 %   37.7 GiB
        Most Recently Used (MRU) cache size:           84.1 %  199.0 GiB
        Metadata cache size (hard limit):              75.0 %  188.8 GiB
        Metadata cache size (current):                  8.4 %   15.9 GiB
        Dnode cache size (hard limit):                 10.0 %   18.9 GiB
        Dnode cache size (current):                     0.1 %   27.7 MiB

I have read that arc will free ram for VMs, but I have had VMs crash or just running very slow on this server and actually had to move VMs away from it to just get the others in a working state again. Since these setups are for production use, I cannot struggle with this.

So the question then is, would it be an issue if I set the zfs_arc_max to let say 64GB? That should leave enough room for the server as well. Also, is this value set in bits? If so, would that number then be 549755813888?

I am open to other suggestions as well.
 
The value is in bytes (use 64*1024^3=68719476736 for 64GB). I would set it as low as it will go without noticeably impacting performance. I don't really notice a difference between 5% and 10% of total memory. It really depends on the workload generated by your VMs.
 
  • Like
Reactions: wbarnard81
Jup, the more RAM for the ARC the better. But at some point you won't see much improvement when increasing ARC size. Best you start with a big ARC, then lower it step by step and have a look at the values like hit rates you see when running arc_summary. As soon as you see a big performance hit (hit rates dropping or something similar) use something a bit bigger.

You got fast SSDs. I think even an 8 or 16 GB ARC should be fine. But best you really benchmark it yourself.
 
Last edited:
Do I need to reboot the server for the arc size to take affect?

As for the raid setup:
The current server has RAID10, which gives me 2.87TB space to use. But I am not sure if I am just being stupid, but currently I have 13 VMs with a combined disk space of 1384GB, but only 727GB is allocated, according to Disk->ZFS. I can only assume this is because of thin provisioning... But if I click on the "volume" (on the left) it shows: Usage 55.33% (1.53 TB of 2.77 TB). I do not know where the other 200-odd GB comes from. I have had it before where I ran out of space on the server and that brought all the VMs down.

I want the most available space for the VM Disks, but have some redundancy as well.
 
Do I need to reboot the server for the arc size to take affect?
Yes.
As for the raid setup:
The current server has RAID10, which gives me 2.87TB space to use. But I am not sure if I am just being stupid, but currently I have 13 VMs with a combined disk space of 1384GB, but only 727GB is allocated, according to Disk->ZFS. I can only assume this is because of thin provisioning... But if I click on the "volume" (on the left) it shows: Usage 55.33% (1.53 TB of 2.77 TB). I do not know where the other 200-odd GB comes from. I have had it before where I ran out of space on the server and that brought all the VMs down.

I want the most available space for the VM Disks, but have some redundancy as well.
With ZFS you have to keep some things in mind:
1.) you always should keep 20% of space free
2.) you shouldn't keep snapshots for too long as these will grow over time and prevent ZFS from freeing up deleted/edited stuff
3.) you need a complete TRIM/discard chain from guestOS, over virtual disk, over virtual disk controller and protocol, over physical disk controller, down to the physical disks. Otherwise deleted stuff can't be freed up
4.) if you use a raidz1/2/3 and you didn`t changed the default volblocksize you are probably wasting alot of capacity due to padding overhead. This will result in everything written to a zvol being way bigger than needed

For point 2 and 3 you can run zfs list -o space -r YourPoolname. If "USEDREFRESERV" is high your discard isn't working. If "USEDSNAP" is high you should remove your old snapshots.

For point 4 that depends on a alot of factors. To answer that, information like zpool status YourPoolName, zpool get ashift YourPoolName and zfs get volblocksize YourPoolName is needed.
 
Last edited:
  • Like
Reactions: wbarnard81
Where can I set the snapshop settings or is this something I have to do manually?
TRIM/discard, do you mean I just need to tick discard under advanced, at the harddisk setup?
It looks like the default volblocksize is 8k. Should I go for 1M or smaller?
 
Where can I set the snapshop settings or is this something I have to do manually?
Manually. If you want automated snapshots + automated pruning (so you can't forget to delete them after a few day) have a look at this script: https://github.com/Corsinvest/cv4pve-autosnap
For long term backups you should use a PBS instead of snapshots: https://www.proxmox.com/en/proxmox-backup-server
TRIM/discard, do you mean I just need to tick discard under advanced, at the harddisk setup?
Thats one part of it. But your HBA/raid card also needs to support TRIM. And you need to setup discard in every guest OS. And you need to use something like "virtio SCSI" with "SCSI" as virtual disk controller als protocols like "IDE" or "virtio block" won't support TRIM commands.
It looks like the default volblocksize is 8k. Should I go for 1M or smaller?
Keep in mind that you will get terrible performance if you try to read/write a block that is smaller than your volblocksize. Do a 8K sync write (like a postgres DB would do) and you would only see 1/125th (because 1M/8K) of the performance and your SSDs will wear 125 times faster. So you might want the volblocksize as small as possible. But the smaller you choose it, the more space you will waste because of padding overhead when chossing a raidz1/2/3 (but no problem with striped mirrors). So in case you for example got alot of MySQL in your workload with its 16K writes, you shouldn`t use a raidz1 or raidz3 because then the 16K operations would be smaller than the reasonable volblocksize.

Usable capacity of raw capacity:Disks may fail:IOPS Performance:Throughput Performance (Read/Write):Reasonable blocksize:
6 disk raidz3: 35%31x3x/3x64K
6 disk raidz2:53%21x4x/4x16K
6 disk raidz1:64%11x5x/5x32K
6 disk striped 2-way mirror: 40%1 (up to 3)3x6x/3x8K or 16K
6 disk striped 3-way mirror:27%2 (up to 4)2x6x/2x8K
Above is only valid for ashift=12 and 6 disks. As soon s you use another ashift or number of disks it looks different.
 
Last edited:
  • Like
Reactions: wbarnard81
Thank you for all the information. As per your previous post:
zfs list -o space -r Raid10
Code:
NAME                  AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
Raid10                1.13T  1.39T        0B     96K             0B      1.39T
Raid10/vm-106-disk-0  1.22T   124G        0B   24.6G          99.2G         0B
Raid10/vm-110-disk-0  1.13T  33.0G        0B   32.3G           716M         0B
Raid10/vm-114-disk-0  1.13T  33.0G        0B   32.3G           717M         0B
Raid10/vm-115-disk-0  1.24T   124G        0B   7.63G           116G         0B
Raid10/vm-116-disk-0  1.23T   124G        0B   14.5G           109G         0B
Raid10/vm-117-disk-0  1.24T   124G        0B   8.73G           115G         0B
Raid10/vm-146-disk-0  1.16T   124G        0B   85.1G          38.7G         0B
Raid10/vm-147-disk-0  1.17T   124G        0B   83.3G          40.5G         0B
Raid10/vm-148-disk-0  1.17T   124G        0B   74.3G          49.5G         0B
Raid10/vm-149-disk-0  1.17T   124G        0B   82.7G          41.0G         0B
Raid10/vm-150-disk-0  1.17T   124G        0B   78.8G          45.0G         0B
Raid10/vm-151-disk-0  1.16T   124G        0B   87.9G          35.9G         0B
Raid10/vm-152-disk-0  1.18T   124G        0B   63.9G          59.9G         0B
zpool status Raid10
Code:
  pool: Raid10
 state: ONLINE
  scan: scrub repaired 0B in 00:02:29 with 0 errors on Sun Aug 14 00:26:30 2022
config:

        NAME                                      STATE     READ WRITE CKSUM
        Raid10                                    ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0ZXT5M8-part1  ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0YVT5M8-part1  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0YYT5M8-part1  ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0ZDT5M8-part1  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0YZT5M8-part1  ONLINE       0     0     0
            nvme-KCD6XLUL960G_12N0A0ZRT5M8-part1  ONLINE       0     0     0

errors: No known data errors
zpool get ashift Raid10
Code:
NAME    PROPERTY  VALUE   SOURCE
Raid10  ashift    12      local
zfs get volblocksize Raid10
Code:
NAME    PROPERTY      VALUE     SOURCE
Raid10  volblocksize  -         -
 
It's a striped mirror so padding overhead isn't the problem.
And no snapshots are used, so this also isn't a problem.
But there is a lot of refreservation, so your discard/TRIM isn't working. So ZFS won't free up space when your guest OS deletes or overwrites something.

After choosing a protocol that supports TRIM commands and enabling the "discard" checkbox for each virtual disk you can run a single manual trim inside your guests to instantly free up the space. For a Linux as guest OS you could run fstrim -a and for windows as guest OS a Optimize-Volume -DriveLetter YourDriveLetter -ReTrim -Verbose.
But don't forget to mount your virtual disks with discard or setup a daily fstrim -a so discading is automated.
 
Last edited:
I never ticked the discard option when I created the VMs. I didn't know what it does. :oops:

I saw this page, but I do not have this file on my server: /etc/modprobe.d/zfs.conf
Can I create it and add those entries, save and reboot and it will take affect?
 
I never ticked the discard option when I created the VMs. I didn't know what it does. :oops:

I saw this page, but I do not have this file on my server: /etc/modprobe.d/zfs.conf
Can I create it and add those entries, save and reboot and it will take affect?
Yes, you have to create that file yourself.
 
Okay, then last question, I guess.

When I create a ZFS pool through the GUI, I cannot choose if I want striped 2 way mirror or striped 3 way mirror, it seems. Can this only be done through the terminal then?
 
PVE's GUI only support the most general pool layouts. So it can do a normal striped (2-way) mirror but not a striped 3-way mirror. But creating that is just a single line in the CLI.

zpool create -f -o ashift=12 YourPoolName mirror /dev/YourDisk1 /dev/YourDisk2 /dev/YourDisk3 mirror /dev/YourDisk4 /dev/YourDisk5 /dev/YourDisk6 and then adding a new ZFS storage in webUI at "Datacenter -> Storage -> Add -> ZFS" pointing to the freshly created pool (and don't forget to set the "Thin Provision" checkbox).
 
Last edited:
  • Like
Reactions: wbarnard81
  • Like
Reactions: wbarnard81

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!