Use case: Better understanding how ARC, SWAP and RAM usage is expected to behave

keson

Active Member
Dec 16, 2017
19
2
43
50
Dear proxmox gurus!
first please accept my deep respect as the solution is absolutely amazing. I do use it for a non-profit organization to help them cope with their IT on an affordable way. It is running for 5+ years and I am on version 6.4-13 at the moment, planning to go to 7 by xmas. When in doubt, I do study the documentation and read forums but I have one thing I would like someone to interpret for me as I am afraid my understanding is not correct. I am strugling in the way how RAM is allocated and swap used. The reason behind is that in past 3-5 months I can see a nearly regular bi-weekly VM restarts most likely caused by RAM cap which started to be utilized above past 5 years average.

The setup is HP ML server with HBA, Xeon 6 cores, 86 GB RAM. ZFS setup with SSD drives used for logs and cache (mirrored to make sure defect SSD does not corrupt the data, although the server is on UPS) and 4 7K2 drives in mirror/strip.

Code:
zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool  3.62T   510G  3.13T        -         -    41%    13%  1.00x    ONLINE  -

Code:
zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 09:30:35 with 0 errors on Sun Oct 10 09:54:38 2021
config:


    NAME                              STATE     READ WRITE CKSUM
    rpool                             ONLINE       0     0     0
      mirror-0                        ONLINE       0     0     0
        wwn-0x50014ee004593460-part2  ONLINE       0     0     0
        wwn-0x50014ee00459337b-part2  ONLINE       0     0     0
      mirror-1                        ONLINE       0     0     0
        wwn-0x50014ee059ae513a        ONLINE       0     0     0
        wwn-0x50014ee00459347d        ONLINE       0     0     0
    logs   
      mirror-2                        ONLINE       0     0     0
        wwn-0x55cd2e414dddc1ca-part1  ONLINE       0     0     0
        wwn-0x55cd2e414dddf7cc-part1  ONLINE       0     0     0
        wwn-0x55cd2e414dddc34e-part1  ONLINE       0     0     0
        wwn-0x55cd2e414dddf74c-part1  ONLINE       0     0     0
    cache
      wwn-0x55cd2e414dddc1ca-part2    ONLINE       0     0     0
      wwn-0x55cd2e414dddf7cc-part2    ONLINE       0     0     0
      wwn-0x55cd2e414dddc34e-part2    ONLINE       0     0     0
      wwn-0x55cd2e414dddf74c-part2    ONLINE       0     0     0

Also there is a ZRAM installed of size 12 GB for swapping

Code:
zramctl
NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram5 lzo-rle         2G   4K   74B   12K       6 [SWAP]
/dev/zram4 lzo-rle         2G   4K   74B   12K       6 [SWAP]
/dev/zram3 lzo-rle         2G   4K   74B   12K       6 [SWAP]
/dev/zram2 lzo-rle         2G   4K   74B   12K       6 [SWAP]
/dev/zram1 lzo-rle         2G   4K   74B   12K       6 [SWAP]
/dev/zram0 lzo-rle         2G   7M    2M  2.4M       6 [SWAP]

There are 4 VM - all windows 2012 R2 64 bit and the RAM assigned is 32+3+3+4 GB = 42 GB RAM allocated. All machines have balooning enabled and working.

Now comes my confusion.
For many years I had the swappines set to 1 and the system was rock stable. The zfs_arc_min and max were not set thus = 0. The RAM usage was always like 75 % for many years with the same VM running all the time. Most likely after some proxmox update in May/June I started to experience random VM restarts which I concluded were due to a high RAM usage because since about June I can see the RAM usage as 92-96% = 82 to 84 GB out of 86 GB. So I started to read more about the RAM usage and SWAP and ARC size and did a test 3 weeks back when I committed following changes:

1) I have limited the ARC min and max size (followed by update-initramfs -u):
Code:
/etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=8589934592

2) I have modified the swappiness (to 10, 20, 60 and 100 - one by one for couple of days)
sysctl -w vm.swappiness=60

My experience was that the total usage of RAM was suddenly all 2 weeks about 57 GB - which could be theoretically all allocated RAM to all VMs + ZRAM for SWAP.
The SWAP usage was still 0 all the time but what wet drasticly up were the io delays causing many replications to time out and also the backups with PBS went much longer.
I concluded, that the tuning of the zfs zfs_arc_max was not a good step and I reverted it back to nothing (deleted the content of the /etc/modprobe.d/zfs.conf file).
From the revert the system is again on 94-96% of RAM usage, there are no timeouts on replications/backups and the system runs stable. Ony I am not sure whether I get some random VM crash/restart with such a loda RAM utilization.

And now I would like fellow gurus to clarify my following questions/assumptions:

1) The ZFS/Poxmox by design uses about 50% of available RAM for FS caching - is that right? Was the zfs zfs_arc_max expected to limit the maximum of RAM used for this cache?
2) The SWAP is used first when the RAM runs out, so it is correct to assume that in my case the SWAP usage is 0 becasue the system does barely run out of the RAM? - is there any difference (apart performance) in the logic of using SWAP between SWAP on HDD and SWAP in RAM?
3) If the ZFS system is expected to use 50% of RAM for cache, is it correct to not to alocate more than 50% of physical RAM to all virtual machines together to be on a safe side?
4) is it expected behavior in case the RAM runs out to not to crash the VM but instead to utilize SWAP? Can the VM crashing be somehow prevented in this particular case?
5) What is the exact logic behind swappines set to 10 and 90 in a situation described here when the virtual machines have total RAM assigned about 45% of physical RAM and the cache uses another 50% of RAM - will the change between 10 and 60 or 90 be seen somehow?

I know that the reddish RAM usage in Proxmox GUI is more a psychological thing than anything else, but I would like to keep the RAM usage below say 90-85% to "be on a safe side" and at the same moment give the system most of available resources without putting the stability to danger?

My plan is to set the zfs zfs_arc_max to 32GB so that the performance during backups and replications is still OK but at the same time to use max 85% of the RAM to prevent VM crashes, but I am not sure, whether this is the best approach.
I also plan to disable the SWAp completly as I see it as wasted 12 GB of RAM or to limit it to say 6 GB to save 6 more GB of RAM for cache....

I have no other test environement where I could observer behavior during long time and during backups and replications so I better ask before I do something really bad.I really do like the system and do not want to play with it beyond the configuration limits as I admire its stability and do not want to break it just becasue of lack of understanding.

P.S. if there is any config details you woudl like to see, pelase let me know.


Code:
arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Oct 18 14:27:22 2021
Linux 5.4.140-1-pve                                   2.0.5-pve1~bpo10+1
Machine: pve01 (x86_64)                               2.0.5-pve1~bpo10+1


ARC status:                                                      HEALTHY
        Memory throttle count:                                         0


ARC size (current):                                   100.0 %   43.2 GiB
        Target size (adaptive):                       100.0 %   43.2 GiB
        Min size (hard limit):                          6.2 %    2.7 GiB
        Max size (high water):                           16:1   43.2 GiB
        Most Frequently Used (MFU) cache size:         68.5 %   20.8 GiB
        Most Recently Used (MRU) cache size:           31.5 %    9.6 GiB
        Metadata cache size (hard limit):              75.0 %   32.4 GiB
        Metadata cache size (current):                 47.6 %   15.4 GiB
        Dnode cache size (hard limit):                 10.0 %    3.2 GiB
        Dnode cache size (current):                     0.7 %   22.7 MiB
 
ZFS will use up to 50% of RAM for it's ARC if it is available. It might be a bit slow freeing it up again if other processes need it. This is when limiting the ARC size can be useful, if adding more RAM is not an option.

Swap is much more than just an extension to RAM, see https://chrisdown.name/2018/01/02/in-defence-of-swap.html
But it is not necessary to have any swap. If you run out of memory, a few more GB of swap usually won't save you.

An L2ARC (cache device) can actually be counterproductive regarding RAM usage, especially if you are already low on memory. ZFS now also needs to keep that additional index for it in memory. I would try it without it and see how the system behaves.

You mentioned memory ballooning. How over provisioned is your system if you calculate the max possible RAM that each VM could take if it actually needed it all?

Your setup overall is quite complicated with many knobs and dials that can cause unexpected behavior. I would give it a try to run it without swap and zram and without the L2ARC (cache vdev). Then all that needs RAM is the system itself (OS, ZFS) plus the VMs. That should make it quite a bit simpler. If you can avoid memory ballooning, even better. Everything that is over provisioned can become a problem if enough VMs need their over provisioned resources.
 
ZFS will use up to 50% of RAM for it's ARC if it is available. It might be a bit slow freeing it up again if other processes need it. This is when limiting the ARC size can be useful, if adding more RAM is not an option.

Swap is much more than just an extension to RAM, see https://chrisdown.name/2018/01/02/in-defence-of-swap.html
But it is not necessary to have any swap. If you run out of memory, a few more GB of swap usually won't save you.

An L2ARC (cache device) can actually be counterproductive regarding RAM usage, especially if you are already low on memory. ZFS now also needs to keep that additional index for it in memory. I would try it without it and see how the system behaves.

You mentioned memory ballooning. How over provisioned is your system if you calculate the max possible RAM that each VM could take if it actually needed it all?

Your setup overall is quite complicated with many knobs and dials that can cause unexpected behavior. I would give it a try to run it without swap and zram and without the L2ARC (cache vdev). Then all that needs RAM is the system itself (OS, ZFS) plus the VMs. That should make it quite a bit simpler. If you can avoid memory ballooning, even better. Everything that is over provisioned can become a problem if enough VMs need their over provisioned resources.
Thansk Aaron for reply.
I do follow your explanations and they make sense to me. I can try to disable the L2ARC, although I am a bit of a amateur so I want to make sure that I dont screw the server up by disabling it. The server is several hundreds of KM from me so it can be a nightmare for me to fix my own stupid errors. But I will give it a try.

Regarding ballooning, it was perhaps not expressed in right words, but I did not over provision the server at all. The total amount of RAM on the physical bare metal server is 86 GB and all 4 VMs are setup to use maximum of 32 + 3 + 3 + 4 GB = 42 GB. Which is slightly under 50%. And as one is now nearly unused SQL server utilising most of the time 1 GB and two servers are DC using each about 1,5 GB the only real working server is the TS which is assigned 32 GB and even during high use I can still see use of 18-24 GB. So to answer, no overprovissioning and no real 100% use of what is provisioned. I guess around 60% in peaks.

I have set it up 5 years back with ZRAM and L2ARC following "at the day" best practices or at least best experiences. my intention was to have it as expected by the designer to prevent unexpected behaviors. I agree that whatever makes it less complex is good. so I will start with this in order:
1) disable ZRAM / SWAP
2) remove L2ARC

Regarding the "avoid memory ballooning" - I would like to ask what is the purpose of this when the system is not overprovaisoned? My understanding was that it is a good practice to keep the memory "organised". But of course it can be also disabled on the guest level by disabling the service I suppose.

Once again I thank a lot for your input, I will do "tests" over next few weeks observing the performance results after each singe change (I rather do one change at a time to see the real impact).
 
Maybe you use your zram wrong. The idea of zram is to compress your RAM, not to swap out RAM to disks when you run out of RAM (like the normal use case of swap). So your normal RAM is uncompressed and uses more space. Your zram swap space in RAM is compressed. So you want a high swappiness so as much as possible data in RAM gets swapped out into zram in RAM so that data gets compressed and needs less space but still resides in the fast RAM. But in think in that case you don't want swap on your HDDs/SDDs so your RAM isn't swapped out to the slow SSD/HDD when your zram swap gets full. So yes, zram is swap but that isn'T the real intention. Its just used as a method to compress your RAM.

Meanwhile there also is a ZFS "special device" class. You can use a SSD mirror as special devices for your HDD pool so that metadata, deduplication tables and small files will be stored on the fast SSD instead of the slow HDDs. That way you move alot of IOPS from your HDDs to the SSDs and stuff like directory listings should be way faster. So maybe you want to try this instead of L2ARC.
 
Last edited:
Maybe you use your zram wrong. The idea of zram is to compress your RAM, not to swap when you run out of RAM. So your normal RAM is uncompressed and uses more space. Your zram swap space in RAM is compressed. So you want a high swappiness so as much as possible data in RAM gets swapped out into zram in RAM so that data gets compressed and needs less space in the end. But in think in that case you don't want swap on your HDDs/SDDs so your RAM isn't swapped out to the slow SSD/HDD when your zram swap gets full.

Meanwhile there also is a "special device" class. You can use a SSD mirror as special devices for your HDD pool so that metadata, deduplication tables and small files will be stored on the fast SSD instead of the slow HDDs. That way you move alot of IOPS from your HDDs to the SSDs and stuff like directory listings should be way faster. So maybe you want to try this instead of L2ARC.
Thanks Dunuin, now when I read your first sentence loud it makes more sense. I setup ZRAM as SWAP in order to allow the system to have fast swapping available when it runs out of RAM, but I actually took the RAM for this purpose so I basically just take part of my precious RAM and serve it as a "second level SWAP" to provide "more RAM". Removing SWAP and ZRAM makes more sense for me now.

Regarding the special device - I am glad you mentioned it. I was reading a lot about it recently, but I was never sure whether it is the right use. So pelase correct me if I am wrong, but I can take my super durable Intel SSD and 1) disable their use as L2ARC and instead 2) assign that partition as Special device for the ZFS so that it is used for indexes and small data Chungs faster available than from standard spinning HDD. Is ether soem good real wold example you know about how to proceed (the steps taken) to utilize the existing SSD partition as a special device (or to repartition it if needed)? I have read the documentation which clarifies the theory for me, but some personal experience would really make me more confident.

I see both the steps proposed by Aaron and Dunuin as much clearer to me. Simplication, better use of RAM and better use of fast and reliable SSDs.
Thanks!
 
P:S. When the SWAP is disabled completly, do I suppose correctly that the swapiness level plays no role or shall it also be set to 0 just to make sure the system does not expect the swapping as a backup plan?
 
And there is also the KSMtuned package. This allows RAM deduplication and can save alot of RAM if you run multiple VMs that are quite similar because identical data needs only to be stored once in RAM. So if you run 4x WinVMs its very likely that many of the Win processes store identical data in RAM that can be deduplicated. Here I got alot of Debian VMs and deduplication is saving me 9-11GB of my 64GB RAM. But KSM RAM deduplication isn't working for zram swap space, so you need to decide if you want uncompressed RAM that gets deduplicated or undeduplicated but compressed RAM.

Does someone of the ZFS experts know if additional 5GB RAM per 1TB of storage is still needed for deduplication if a SSD special device is set to store the deduplication tables of a HDD pool? Or will it be fine without additional RAM because reading the deduplication tables from SSD is fast enough for a HDD pool?
 
Regarding the special device - I am glad you mentioned it. I was reading a lot about it recently, but I was never sure whether it is the right use. So pelase correct me if I am wrong, but I can take my super durable Intel SSD and 1) disable their use as L2ARC and instead 2) assign that partition as Special device for the ZFS so that it is used for indexes and small data Chungs faster available than from standard spinning HDD. Is ether soem good real wold example you know about how to proceed (the steps taken) to utilize the existing SSD partition as a special device (or to repartition it if needed)? I have read the documentation which clarifies the theory for me, but some personal experience would really make me more confident.
Removing L2ARC and/or SLOG should be easy. That can be done using the zpool remove command without loosing any data.
But if you want to just test special devices first you should do that with another pool, because a special device can't be removed once added without destroying the complete pool including all your data on the HDDs. And after adding special devices ZFS will only use them for new data and won't move old data from HDDs to your your SSDs. For that you would need to move your old data around so it gets written again (for example moving data from one dataset to another dataset and back).
P:S. When the SWAP is disabled completly, do I suppose correctly that the swapiness level plays no role or shall it also be set to 0 just to make sure the system does not expect the swapping as a backup plan?
If there is no SWAP you can ignore it.
 
Removing L2ARC and/or SLOG should be easy. That can be done using the zpool remove command without loosing any data.
But if you want to just test special devices first you should do that with another pool, because a special device can't be removed once added without destroying the complete pool including all your data on the HDDs. And after adding special devices ZFS will only use them for new data and won't move old data from HDDs to your your SSDs. For that you would need to move your old data around so it gets written again (for example moving data from one dataset to another dataset and back).

If there is no SWAP you can ignore it.
 
Hi everyone,
just sharing my experience, in case someone else goes the same path. I have disabled the SWAP and then removed the ZRAM entirely. Perhaps an obvious thing for gurus, but I had to dig into this a bit deeper as the Proxmox server was installed 5 years back and at that time the ram was not part of Proxmox, so it was installed as a standalone package and I was wondering where the configuration was becasue all the current ram how-tos on Proxmox documentation mention another way of configuration - I had to disable the ram by
Code:
modprobe -r zram
I was also wondering, where the configuration was and I t is in /usr/bin/init-zram-swapping - there is a logic which determines the total ram and total count of cores and creates one dev per core, that is why mine had 43 GB in total (6 x 7,2 GB)...
So disabling zram / SWAP my io delays went down visibly (last 2,5 days on the graph below)...

1634880005548.png
I am preparing now for the L2ARC / SLOG now and will see for another few days.

As I have 4 SSD drives for slog /l2arc I will put them into mirror to prevent data loss in case the special device disk fails, I hope it will be redundant enough.
Thanks all of you for your contribution and sharing experience!
 
  • Like
Reactions: Dunuin
Thank you for the keeping us updated. The reduction in the IO delay spikes is quite impressive and shows that there was some unfortunate interaction, costing performance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!