ZFS ARC compared to linux page cache.

chrcoluk

Renowned Member
Oct 7, 2018
217
44
68
46
Hi guys, I have decided to make this thread as there is threads popping up related to the ARC.

In my experience ZFS ARC is a lot more polite than the old 'host page cache', and is also actually much more configurable. However its defaults are not necessarily suited for proxmox due to that ARC usage isnt considered as available by the linux kernel.

If you use ext4 on the host, then the page cache will be used by the operating system, this cache can consume almost all of your free ram, has a dumb eviction mechanism, is barely configurable, and if swap file is enabled will usually cause the system to start swapping out (like windows, linux page cache and swap is dumb). You can dump it via a command, but this only has a temporary effect until it fills up again. It is used for both read caching and dirty write caching. The write caching is a little bit configurable, but the read caching is a case of the devs know better and you let it do its thing.

For VM data, the behaviour is manipulated by the per virtual storage device settings as documented here. This affects both ext4 and ZFS.

https://pve.proxmox.com/wiki/Performance_Tweaks

If you use the default 'none' settings and are using ZFS, then no page cache will be utilised by VM storage, instead it will use the ARC (read cache for ZFS), and a dedicated ZFS dirty cache.
The ARC has a min size and a max size which are both configurable, the min size is very low, and max much higher, but the max is dependent on having available memory, as well as enough data to cache. As a rule of thumb this will shrink much better than the old linux page cache, ARC doesnt usually cause OOM's or swapping.
The ZFS dirty cache is not quite as polite, and is less well known, most discussions are about the ARC, but ZFS write caching is not inside the ARC. By default async writes will start to be flushed within 5 seconds, and how quick this dirty cache grows depends on how write heavy the workload is and how fast the disks are. However like the ARC it is configurable and has a cap.

If you use either writethrough, writeback, writeback (unsafe), then even with ZFS you will utilise the page cache system, and will have much higher cache usage, the kernel wont think twice about filling all free RAM with cached data and then to start swapping out, even with vm.swappiness set to 1. This will also cause double caching if using ZFS which of course is a bad idea.
The main difference between directsync and 'none' for ZFS is that 'none' will still allow drives to buffer writes for async writes, whilst directsync will force disk storage flushes for 'everything', and as such will be much slower.

Lets say you have 64 gigs of ram, and you configure a low amount of ram usage for VM's e.g. 4 gig for 3 VM's so 12 gig utilised and all set to none for cache. With ARC e.g. configured to 50% of RAM, and ZFS dirty cache a bit more as well, this will be a totally safe configuration. The memory screen in Proxmox will show ARC usage, making it feel worse than the host page cache but it isnt, the host page cache is invisible usage that doesnt show on the graph.

So when should ZFS be configured?

If you have 32 gigs or less of RAM it might be an idea to reduce the dirty ZFS cache limit as because that is unwritten data, the cache cannot be quickly reduced when needed, and as such is far less polite than the ARC. Especially with only 16 gigs.
Otherwise, only be concerned if you actually getting OOM's or swap usage. When there is a choice between swap and cache, having no swap utilisation should always be priority.

For those who dont know.

/sys/module/zfs/parameters/zfs_arc_max - max arc size in bytes (set this to 10% of your total ram if you want to match new proxmox defaults)
/sys/module/zfs/parameters/zfs_arc_min - min arc size in bytes
/sys/module/zfs/parameters/zfs_dirty_data_max - max zfs dirty cache in bytes (tune this before ARC if you have less than 40 gigs of total ram)

I dont know current defaults in proxmox, but I know it used to be the case that zfs dirty cache would default to either a % of ram or a set fixed value depending on which is highest, and if you dont have much ram it meant a higher % of ram was used by default. So this is far less likely to be an issue with 64gigs of ram and higher.

--edit, looks like default is either 10% of ram for dirty cache or 4 gigs, whichever is higher, so 40gigs of ram or higher it will default to 10%, lower then that, it will use 4 gigs and be a higher % of ram, for 16 gigs of ram, 25% of ram can be used for dirty.
From 8.1 onwards on new installs ARC usage is capped to 10% or 16 gigs, whichever is lower.
In both of these cases, they are configurable to override
.--edit

So remember ARC usage shows on the graph, zfs dirty cache and host page cache does not show. This is because ARC counts as used memory.

Example below of nasty host page cache.

Code:
# free -m
               total        used        free      shared  buff/cache   available
Mem:           64194       42510        2354          54       20666       21684
Swap:           2300        2156         144

The memory graph showed the 20666 as not utilised. Note the swap usage. Server was sluggish to use.

Same machine with only ZFS caching for VMs. (before i force flushed the swap)

Code:
# free -m
               total        used        free      shared  buff/cache   available
Mem:           64194       30411       34250          50         369       33783
Swap:           2300        1434         866

and currently with now only zram configured for swap as well.

Code:
# free -m
               total        used        free      shared  buff/cache   available
Mem:           64194       43202       19627          55        2522       20992
Swap:            255           0         255

I do agree with the new 10% defaults in proxmox 8.1 as things can be awkrawd if you want large amounts if available memory for firing up new VM's. But on the flip side if you are like me and use less than 50% of ram for VM allocation., then a higher amount for the ARC cap should be fine and helps a ton to speed up hdd reads. It will shrink when it detects available memory is low.
 
Last edited:
Interesting article, thank you!
In my experience ZFS ARC is a lot more polite
Yes, probably. But to my (very limited!) understanding it is slow. (Edit: I mean specifically shrinking the size, not its function.)

When a single process (one VM, one KVM process) requests a large block of memory, the kernel can NOT force ARC to shrink during this call. It will run oom. The kernel internal buffers and/or(?) caches can get evacuated much faster --> oom avoided.

That's my possibly wrong understanding. How can we prove it?
 
Last edited:
Interesting article, thank you!

Yes, probably. But to my (very limited!) understanding it is slow. (Edit: I mean specifically shrinking the size, not its function.)

When a single process (one VM, one KVM process) requests a large block of memory, the kernel can NOT force ARC to shrink during this call. It will run oom. The kernel internal buffers and/or(?) caches can get evacuated much faster --> oom avoided.

That's my possibly wrong understanding. How can we prove it?
This is why I support the new default of max 10%, for people like me it can still be tuned to a higher amount, but the new default behaviour is more VM friendly.

The ARC is better and worse at the same time, I might edit my post and reword it a bit more.

The page cache is managed internally by the kernel and is available memory, so you can fire up new processes and they will be able to start, whilst with ARC this may not be possible because the kernel see's it as used memory.

But when thinking of things like swapping out, I think the ARC is better than host page cache. So in something like a web server or file server, ARC behaves very well, but not so well on Proxmox where you might suddenly out of nowhere want lots of memory free to start up a new VM.

I feel like open ZFS needs an extra tunable, where you configure min available memory, if it falls below that, then the ARC shrinks. Then you could have best of both worlds, a high max size, but it would only do that if there is a large amount of free memory. I will do a feature request on the ZFS github.
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
There's real quiet lots of effort done into zfs 2.3.* for fast cache eviction for shrinking if another application needs to allocate memory which previously could easily lead to oom to the app (which was never a problem for the "old host page cache" in kernel).
Swapping itself is ... what should I say ... it's nice to have some configured and are able to use it for any case but if you need it you have to less ram.
 
Last edited:
Nice writeup, but I am still uncertain about some parts.
When there is a choice between swap and cache, having no swap utilisation should always be priority.
By default, Proxmox has no SWAP, when installed on ZFS.
To me, this makes sense, since swapping to a CoW FS probably isn't great. Slow, unnecessary TBW.

Since the SWAP disk of a VM is on the RAW disk, which is in return on the ZFS of Proxmox, I guess it is probably also best to disable SWAP for VMs?

Since ZFS TXG writes can't be easily shrunken, you could run into OOM.
So if you for example, if you have a 1GBit NIC, there could be roughly (1GBit - overhead = 930mbit) / (8 so we get mb) * (5 because of the 5s goal of ZFS) = 581MB/s in dirty write cache. For 10Gbit that would be around 5.8GB.

Making up OOM scenarios:
So assuming a 10Gbit hypervisor, at worst ARC has 5.8GB that can't be shrunken. Let's say another thing that can't be shrunken would be proxmox itself and that is 4.2GB. So 10GB unshrinkable in total.

  1. Assuming I have 64GB RAM and my VMs use 58RAM and my ZFS limit is 10%. OOM could happen. The 10% limit (6,4GB) isn't restrictive enough.
  2. Assuming I have 64GB RAM and my VMs use 50RAM and my ZFS limit is 10%. OOM could not happen. The 10% limit (6,4GB) is restrictive enough. But since the NIC limits our write cache to 5.8GB anyway, the ARC limit is not a factor here.
  3. Assuming I have 64GB RAM and my VMs use 50RAM and my ZFS limit is 50%. OOM could not happen. The 50% limit (32GB) is not restrictive enough, but since the NIC limits our write cache to 5.8GB anyway, the ARC limit is not a factor here.

So that is why I have a hard time understanding, what the new default 10% instead of the old 50% brings to the table.
Unless you have 100GBit/s NICs, I don't see any realistic advantages. Am I missing something or am I completely wrong?
 
Last edited:
  • Like
Reactions: Johannes S
By default, Proxmox has no SWAP, when installed on ZFS.

Yes because swapfiles on ZFS datasets or ZFS volumes are problematic see https://pve.proxmox.com/wiki/ZFS_on_Linux#zfs_swap

If somebody wants to have swap with ZFS he should use a dedicated partition or using zramswap (since then the swap won't be on a real disc but a part of the ram leading in effect to compress parts of the ram).
To me, this makes sense, since swapping to a CoW FS probably isn't great. Slow, unnecessary TBW.

Exactly.

Since the SWAP disk of a VM is on the RAW disk, which is in return on the ZFS of Proxmox, I guess it is probably also best to disable SWAP for VMs?

In theory yes but in edge cases like https://forum.proxmox.com/threads/n...ovisioning-8vms-64-gb-ram.168237/#post-782559 it can be useful: If your current hardware or budget constraints don't allow you to add more memory a swap partition inside the VM might allow you to run (although slower) a workload you couldn't run otherwise.