Too much RAM usage, and it doesn't appear to be caching

BJ Quinn

Member
Mar 1, 2018
17
2
23
43
I'm having an issue with a few Proxmox 6.x hosts where the total *used* RAM as shown in top and free are roughly double or more the amount of RAM used by the VMs themselves. There's a mixture of LXC and KVM, and in some cases I believe the Windows VMs do not have the balloon driver installed, but even if I add up the total allocated RAM (not just used RAM) of all the containers/VMs, I still get a number roughly half of what's actually showing as used.

I'm well aware of zfs caching and how you can trick yourself into thinking you don't actually have any free RAM, but in this case I'm pretty confident this isn't that -- for one thing, I'd expect zfs cache to show up as buff/cache in top and free. But what I'm seeing is way too much RAM under "used" without large amounts in free, buff/cache, or avail. Additionally htop shows it as green memory usage (rather than yellow for cache), and I'm actually waking up the OOM killer, which is killing some of my VMs.

Any suggestions? It doesn't seem to be caching, and the amount of RAM allocated to the containers/VMs doesn't seem to be nearly enough to account for the used RAM. These machines have ~128GB RAM, so what I'm seeing is something like 110GB used (not including buff/cache) out of 128GB, but less than half of that actually allocated to containers or VMs, and maybe 1/4 of that if you only counted RAM actually used within the containers/VMs.
 
hi,

can you post the output of 'free -m' and maybe 'top' (or htop) sorted by memory usage
also the vm and storage config would be good
 
Here's one 64GB machine.

Code:
top - 09:43:59 up 258 days, 18:34,  9 users,  load average: 2.08, 2.96, 3.46
Tasks: 1722 total,   1 running, 1721 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.3 us,  0.8 sy,  0.0 ni, 97.8 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64312.6 total,   7878.7 free,  51379.5 used,   5054.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   8254.3 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                             
61633 27        20   0   13.7g   7.4g  17800 S   8.2  11.8  15808:42 mysqld                                                                                             
44006 root      20   0 5322524   4.1g   6908 S  22.7   6.5  16441:56 kvm                                                                                                 
39768 root      20   0 3919476   3.0g   4668 S   3.9   4.8  20496:38 kvm                                                                                                 
46952 496       20   0   17.1g   3.0g  29440 S   2.0   4.8   1035:42 java                                                                                               
17332 496       20   0 3157376   2.6g   6044 S   0.3   4.2   1847:54 mysqld                                                                                             
10773 496       20   0 2088328   1.2g   2636 S   2.3   2.0   6250:11 clamd                                                                                               
38250 496       30  10 4180416 314388  23984 S   0.0   0.5   2:11.31 java                                                                                               
 3361 48        20   0 1279116 173736  10908 S   0.0   0.3  30:52.33 httpd                                                                                               
 3360 48        20   0 1210508 166600  10884 S   0.0   0.3  31:09.62 httpd                                                                                               
42751 root      20   0  267832 165172 164884 S   0.0   0.3  15:12.29 systemd-journal                                                                                     
 3366 48        20   0 1149324 163372  10924 S   0.0   0.2  30:37.81 httpd                                                                                               
 3359 48        20   0 1085580 162056  10876 S   0.0   0.2  30:30.23 httpd                                                                                               
 3369 48        20   0 1217932 162048  10936 S   0.0   0.2  31:04.86 httpd                                                                                               
 3367 48        20   0 1148300 156248  10924 S   8.6   0.2  30:26.53 httpd                                                                                               
 3353 48        20   0 1214348 155784  10912 S   0.0   0.2  31:05.23 httpd                                                                                               
 3368 48        20   0 1081996 153948  10920 S   0.0   0.2  30:45.57 httpd                                                                                               
 2310 496       20   0  443896 149196   7728 S   0.0   0.2   0:15.05 amavisd                                                                                             
 3351 48        20   0 1013900 148164  10920 S   0.0   0.2  30:13.72 httpd                                                                                               
 3362 48        20   0 1078668 147632  10884 S  11.8   0.2  30:43.30 httpd                                                                                               
 3358 48        20   0 1078412 147516  10928 S   0.0   0.2  30:16.47 httpd                                                                                               
 3354 48        20   0 1142924 146680  10908 S   0.0   0.2  30:44.10 httpd                                                                                               
 3364 48        20   0 1018508 145324  10928 S   0.0   0.2  30:35.53 httpd                                                                                               
55135 496       20   0  440228 145288   7476 S   0.0   0.2   0:03.05 amavisd                                                                                             
 3372 48        20   0 1015436 143572  10896 S   0.0   0.2  30:20.79 httpd                                                                                               
29793 496       20   0  438260 143464   7620 S   0.0   0.2   0:03.90 amavisd                                                                                             
 3355 48        20   0 1015948 142624  10896 S   0.0   0.2  31:13.46 httpd                                                                                               
51183 496       20   0  436880 141716   7588 S  17.1   0.2   0:02.17 amavisd                                                                                             
 3363 48        20   0 1012364 141484  10920 S   0.0   0.2  30:45.40 httpd                                                                                               
 3352 48        20   0 1014668 141300  10912 S   0.0   0.2  30:42.32 httpd                                                                                               
 3357 48        20   0 1012876 137476  10904 S   0.0   0.2  30:48.05 httpd                                                                                               
56805 496       20   0  431532 136536   7588 S   0.0   0.2   0:01.81 amavisd                                                                                             
63213 www-data  20   0  367700 136396   9312 S   0.0   0.2   0:01.52 pveproxy worker                                                                                     
 7198 496       20   0  430712 135696   7588 S   0.0   0.2   0:00.69 amavisd                                                                                             
63214 www-data  20   0  367064 135352   9088 S   0.0   0.2   0:01.31 pveproxy worker                                                                                     
 4754 496       20   0  431160 134748   6260 S   0.0   0.2   0:00.39 amavisd                                                                                             
 3365 48        20   0  949900 132808  10924 S   6.6   0.2  30:10.60 httpd                                                                                               
63215 www-data  20   0  363312 132104   9072 S   0.0   0.2   0:01.07 pveproxy worker                                                                                     
 6892 www-data  20   0  350440 131344  13908 S   0.0   0.2   9:04.41 pveproxy                                                                                           
 3356 48        20   0  948108 130908  10884 S   5.9   0.2  30:59.11 httpd

Code:
              total        used        free      shared  buff/cache   available
Mem:          64312       51405        7852        3961        5054        8228
Swap:             0           0           0

Thanks!
 
also the vm and storage config would be good
please post that info too...

also i think that zfs may not show up in the buffer/cache section (different part of the kernel)
so the output of 'arc_summary' would also be good
 
Container RAM config: 8GB, 1GB, 9GB, 8GB, 0.5GB, 2GB, 1GB, 8GB, 4GB, 8GB = 49.5GB
Windows VM (with balloon driver) RAM config: 3GB, 4GB = 7GB

So if you add up total allocated RAM, it makes sense. But I thought that it would only count RAM actually used by the VMs/containers.

Container/VM RAM usage: 2GB, 0.5GB, 8.5GB, 7.5GB, 0.1GB, 0.2GB, 0.1GB, 3GB, 0.1GB, 0.6GB, 1.5GB, 3.5GB = 27.6GB
 
Code:
ZFS Subsystem Report                            Tue Nov 02 09:57:28 2021
Linux 5.4.34-1-pve                                            0.8.3-pve1
Machine: pve8 (x86_64)                                        0.8.3-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    40.8 %   12.8 GiB
        Target size (adaptive):                        40.9 %   12.8 GiB
        Min size (hard limit):                          6.2 %    2.0 GiB
        Max size (high water):                           16:1   31.4 GiB
        Most Frequently Used (MFU) cache size:         74.6 %    8.4 GiB
        Most Recently Used (MRU) cache size:           25.4 %    2.9 GiB
        Metadata cache size (hard limit):              75.0 %   23.6 GiB
        Metadata cache size (current):                 14.6 %    3.5 GiB
        Dnode cache size (hard limit):                 10.0 %    2.4 GiB
        Dnode cache size (current):                    32.9 %  793.0 MiB

ARC hash breakdown:
        Elements max:                                               4.8M
        Elements current:                              12.2 %     586.7k
        Collisions:                                               357.9M
        Chain max:                                                     8
        Chains:                                                    19.5k

ARC misc:
        Deleted:                                                  545.9M
        Mutex misses:                                               2.2M
        Eviction skips:                                             6.0G

ARC total accesses (hits + misses):                                43.0G
        Cache hit ratio:                               98.7 %      42.5G
        Cache miss ratio:                               1.3 %     547.8M
        Actual hit ratio (MFU + MRU hits):             98.5 %      42.4G
        Data demand efficiency:                        99.0 %      18.2G
        Data prefetch efficiency:                      71.5 %     529.2M

Cache hits by cache type:
        Most frequently used (MFU):                    93.8 %      39.8G
        Most recently used (MRU):                       6.0 %       2.5G
        Most frequently used (MFU) ghost:               0.1 %      36.7M
        Most recently used (MRU) ghost:                 0.2 %      81.9M

Cache hits by data type:
        Demand data:                                   42.4 %      18.0G
        Demand prefetch data:                           0.9 %     378.6M
        Demand metadata:                               56.2 %      23.9G
        Demand prefetch metadata:                       0.5 %     222.3M

Cache misses by data type:
        Demand data:                                   31.6 %     173.2M
        Demand prefetch data:                          27.5 %     150.6M
        Demand metadata:                                4.1 %      22.3M
        Demand prefetch metadata:                      36.8 %     201.6M

DMU prefetch efficiency:                                           31.3G
        Hit ratio:                                      1.1 %     339.0M
        Miss ratio:                                    98.9 %      30.9G
 
Here's the same info from another machine, 128GB RAM, this one is actively OOM killing VMs.

Code:
Tasks: 945 total,   2 running, 943 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.8 us,  2.0 sy,  0.0 ni, 82.8 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem : 128542.3 total,  22820.0 free, 103553.7 used,   2168.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  22405.6 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                 
19665 root      20   0   16.3g  15.7g   7256 S  61.1  12.5  80:55.59 kvm                                                                     
28461 root      20   0 9837404   8.1g   7380 S 490.2   6.4 136122:47 kvm                                                                     
28099 root      20   0 9135256   8.0g   7104 S   3.0   6.4  93:54.07 kvm                                                                     
13596 root      20   0 4835680   4.0g   7220 S  16.6   3.2  46:15.31 kvm                                                                     
11906 root      20   0 4913820   4.0g   7016 S   1.9   3.2  31:26.30 kvm                                                                     
33669 root      20   0 4107100   3.1g   6840 S   2.3   2.4   8255:50 kvm                                                                     
18801 root      20   0 2650644   1.9g   7084 S   3.0   1.5  54:36.83 kvm                                                                     
 3432 www-data  20   0  351876 135292  16376 S   0.0   0.1   6:02.03 pveproxy                                                                 
30815 www-data  20   0  364716 134532  10168 S   0.4   0.1   0:01.76 pveproxy worker                                                         
34179 www-data  20   0  360584 130476  10056 S   0.4   0.1   0:02.05 pveproxy worker                                                         
10989 www-data  20   0  359780 128732   9128 S   0.0   0.1   0:00.02 pveproxy worker                                                         
16186 root      20   0  358956 128284   9380 S   0.0   0.1   0:03.97 pvedaemon worke                                                         
22084 root      20   0  358980 128200   9328 S   0.0   0.1   0:04.29 pvedaemon worke                                                         
 3536 root      20   0  358956 127824   8920 S   0.8   0.1   0:01.26 pvedaemon worke                                                         
14296 root      20   0  358956 122156   3252 S   0.0   0.1   0:00.00 task UPID:lanep                                                         
 3422 root      20   0  350336 120012   2548 S   0.0   0.1   3:38.52 pvedaemon                                                               
 3431 root      20   0  331368  95732   6676 S   0.0   0.1  21:36.21 pve-ha-crm                                                               
 3441 root      20   0  330912  95452   6748 S   0.0   0.1  33:25.64 pve-ha-lrm                                                               
 3397 root      20   0  299556  85004   6980 S   0.0   0.1 399:55.78 pve-firewall                                                             
 3399 root      20   0  298148  84916   8336 S   0.0   0.1   1631:20 pvestatd                                                                 
30239 100048    20   0  491788  82828  12928 S   0.0   0.1   0:06.27 httpd                                                                   
19788 100048    20   0  490176  81248  12900 S   0.0   0.1   0:43.15 httpd                                                                   
30238 100048    20   0  489808  80952  12916 S   0.0   0.1   0:06.48 httpd                                                                   
35184 100048    20   0  489500  80584  12944 S   0.0   0.1   0:28.17 httpd                                                                   
17987 100048    20   0  489420  80516  12928 S   0.0   0.1   0:27.74 httpd                                                                   
 8465 100048    20   0  489452  80424  12876 S   0.0   0.1   0:19.41 httpd                                                                   
17353 100048    20   0  489420  80380  12872 S   0.0   0.1   0:12.02 httpd                                                                   
17352 100048    20   0  489408  80316  12868 S   0.0   0.1   0:13.63 httpd                                                                   
 5692 100048    20   0  489100  80100  12832 S   0.0   0.1   0:06.84 httpd                                                                   
30222 100048    20   0  489140  79960  12828 S   0.0   0.1   0:06.57 httpd                                                                   
 3178 root      20   0  670076  61684  46884 S   0.0   0.0 241:07.97 pmxcfs

Code:
              total        used        free      shared  buff/cache   available
Mem:         128542      103767       22606        1416        2168       22191
Swap:             0           0           0

Code:
ZFS Subsystem Report                            Tue Nov 02 10:05:13 2021
Linux 5.4.60-1-pve                                            0.8.4-pve1
Machine: lanepve2 (x86_64)                                    0.8.4-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.1 %   62.9 GiB
        Target size (adaptive):                       100.0 %   62.8 GiB
        Min size (hard limit):                          6.2 %    3.9 GiB
        Max size (high water):                           16:1   62.8 GiB
        Most Frequently Used (MFU) cache size:         93.8 %   54.7 GiB
        Most Recently Used (MRU) cache size:            6.2 %    3.6 GiB
        Metadata cache size (hard limit):              75.0 %   47.1 GiB
        Metadata cache size (current):                 14.4 %    6.8 GiB
        Dnode cache size (hard limit):                 10.0 %    4.7 GiB
        Dnode cache size (current):                    20.7 %  997.9 MiB

ARC hash breakdown:
        Elements max:                                              13.2M
        Elements current:                              56.2 %       7.4M
        Collisions:                                               763.2M
        Chain max:                                                     9
        Chains:                                                     1.2M

ARC misc:
        Deleted:                                                  548.4M
        Mutex misses:                                              34.8k
        Eviction skips:                                           134.0M

ARC total accesses (hits + misses):                                 7.0G
        Cache hit ratio:                               92.3 %       6.5G
        Cache miss ratio:                               7.7 %     540.0M
        Actual hit ratio (MFU + MRU hits):             91.3 %       6.4G
        Data demand efficiency:                        95.5 %       3.3G
        Data prefetch efficiency:                      52.6 %     767.7M

Cache hits by cache type:
        Most frequently used (MFU):                    78.6 %       5.1G
        Most recently used (MRU):                      20.3 %       1.3G
        Most frequently used (MFU) ghost:               0.3 %      18.4M
        Most recently used (MRU) ghost:                 3.5 %     229.2M

Cache hits by data type:
        Demand data:                                   48.9 %       3.2G
        Demand prefetch data:                           6.2 %     403.6M
        Demand metadata:                               44.7 %       2.9G
        Demand prefetch metadata:                       0.2 %      13.8M

Cache misses by data type:
        Demand data:                                   27.9 %     150.9M
        Demand prefetch data:                          67.4 %     364.1M
        Demand metadata:                                1.0 %       5.2M
        Demand prefetch metadata:                       3.7 %      19.8M

DMU prefetch efficiency:                                            1.9G
        Hit ratio:                                      7.6 %     141.6M
        Miss ratio:                                    92.4 %       1.7G
 
On the 2nd (128GB) machine, here's the RAM allocation:

Containers: 4GB, 4GB, 4GB, 4GB, 4GB = 20GB
VMs: 3GB, 4GB, 16GB, 2GB, 8GB, 4GB, 8GB = 45GB

Used RAM: 0.2GB, 2GB, 0.1GB, 0.7GB, 0.1GB, 2.5GB, 3.5GB. 4.5GB, 1.5GB, 4.5GB, 1.5GB, 3.5GB = 24.6GB

This machine is the one that really confuses me. Even if Proxmox is reserving all allocated RAM (which is not how I thought that worked), I can't account for the used/not-cached RAM. And it does not appear to be caching, as the OOM killer is waking up and killing containers, which is not what I'd expect in a situation where the RAM is just being used for caching.

Please let me know if there is any other info I can send over. Thanks for taking a look at this!
 
yes, from the info you can clearly see that the ARC from zfs is contained in the 'used' memory and not in the buff/cache
on the 128G server you have ~63G ARC, and minus the top kvm processes about 20GiB left (which is conform with the htop/free output)
 
yes, from the info you can clearly see that the ARC from zfs is contained in the 'used' memory and not in the buff/cache
on the 128G server you have ~63G ARC, and minus the top kvm processes about 20GiB left (which is conform with the htop/free output)
Thank you, that is helpful. I incorrectly expected zfs cache to show up in buff/cache. What's the proper metric to use in order to know whether it's safe to allocate additional memory to VMs/containers? Is avail the right one? That seems odd, though, as it implies that I can't reclaim some of the zfs cache, which I ought to be able to do. I'm aware that I can hard wire the zfs config to use less cache, but that doesn't seem like the right answer.

Also, the issue that triggered me to look into this is that one of my containers was simply not running one morning. After investigating in dmesg, I see lots of OOM activity, where the container methodically has all of its processes killed (smb, then nmb, then sshd, etc.), though that wouldn't explain why the container was stopped. Perhaps this does, however:

Code:
[Sun Oct 31 04:55:14 2021] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/300,task_memcg=/lxc/300/ns,task=systemd,pid=30624,uid=0
[Sun Oct 31 04:55:14 2021] Memory cgroup out of memory: Killed process 30624 (systemd) total-vm:190812kB, anon-rss:1224kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:136kB oom_score_adj:0
[Sun Oct 31 04:55:14 2021] oom_reaper: reaped process 30624 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Sun Oct 31 04:55:14 2021] vmbr0: port 3(veth300i0) entered disabled state
[Sun Oct 31 04:55:14 2021] device veth300i0 left promiscuous mode
[Sun Oct 31 04:55:14 2021] vmbr0: port 3(veth300i0) entered disabled state

I suppose if you kill systemd, there's not really anything left. I can't explain why this container went haywire -- it's only using 0.2GB of 4GB RAM allocated to it now that it's been back and running again for a few days, but I suppose I can look at that further.

Initially I read the above as the host being low on RAM and the OOM killer started killing things indiscriminately, and this container just happened to be the unlucky target of the OOM killer. But is what I'm seeing here instead that the container itself triggered the OOM killer based on its cgroup RAM limits? I suppose that makes more sense, but beyond the fact that I can't explain why the container went haywire, it seems really odd that the OOM killer felt the need to kill literally everything on the container, and then kill systemd too. Wouldn't the container have gotten back under control sometime before then, ending the need for the OOM killer to finally get around to killing systemd?
 
Does anyone have a suggestion especially on the question about what the proper metric is in order to know whether it's safe to allocate additional memory to VMs/containers?

I'll figure out what went wrong with my container, but at least I need to know when to stop allocating additional RAM to VMs/containers. If zfs cache shows up in used, then unless I'm wrong about the fact that zfs should evict cache to allow more RAM for VMs/containers, I don't know how to tell how much RAM is available to allocate to them. I suppose I could either artificially limit the zfs cache or run arc_summary to see what the zfs cache is doing, but it seems I'm missing something.
 
I personally set my ARC to be fixed in size. Gave it 8GB as ARC max and 7,99GB as ARC min. So it will always use 8GB RAM. Just make sure your ARC min is always atleast 1 byte smaller than the max or ZFS will ignore your custom limits. You could limit your ARC more and more in steps and watch how the hit rates in arc_summary are changing to find the sweetspot for your ARC where you get the best hitrates with the least amount of ARC size.
 
Last edited:
I personally set my ARC to be fixed in size. Gave it 8GB as ARC max and 7,99GB as ARC min. So it will always use 8GB RAM. Just make sure your ARC min is always atleast 1 byte smaller than the max or ZFS will ignore your custom limits. You could limit your ARC more and more in steps and watch how the hit rates in arc_summary are changing to find the sweetspot for your ARC where you get the best hitrates with the least amount of ARC size.
Thanks for the suggestion, I may end up handling it that way. Is this actually best practice for Proxmox? I certainly know how to set a fixed ARC size, but it would surprise me that that's what Proxmox expects us to do if we want to allocate >50% RAM to VMs/containers.
 
Would be best to allow ZFS to use as much RAM as possible so there never is unused RAM left. But in some situations ZFS can't free up RAM fast enough and OOM kills processes. In such a case it might make sense to limit your ARC so that can't happen anymore.
 
Few things.

ZFS doesnt use the normal OS file cache, it wont show in the buff/cache columns.

You have no swap, even with bundles of ram, on proxmox you can get OOM's due to memory fragmentation, disable transparent huge pages and add some swap. You should only get minimal swap usage but it will be enough to prevent the OOM's.

Finally ZFS has a dirty cache which doesnt show up in the arc summary.

I think proxmox installer needs to enable swap by default again, given zvol based swap isnt fully stable, probably via a dedicated swap partition.
 
Last edited:
Few things.

ZFS doesnt use the normal OS file cache, it wont show in the buff/cache columns.

You have no swap, even with bundles of ram, on proxmox you can get OOM's due to memory fragmentation, disable transparent huge pages and add some swap. You should only get minimal swap usage but it will be enough to prevent the OOM's.

Finally ZFS has a dirty cache which doesnt show up in the arc summary.

I think proxmox installer needs to enable swap by default again, given zvol based swap isnt fully stable, probably via a dedicated swap partition.
Thanks for the advice, I may try that as well.

Do you have any idea what the best way is to judge how much more RAM can be allocated to VMs? If I can't look at used or buff/cache, what do I look at? Do I just add up allocated RAM and leave some nominal amount (say, 8GB) for the host? There's got to be some way to judge the right balance there.

Thanks!
 
I dont know sorry, but personally I have no hosts allocated more than 60% of system ram to VM's. I also dont use dynamic allocations, so min and max set to same.

This is going to depend on ZFS configuration, if you want to be more aggressive, cap the ARC, and cap the dirty cache. Whenever I have triggered OOM's, its always during heavy ZFS write's, so the dirty cache seems to be a bigger risk than the read ARC.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!