Continuously increasing memory usage until oom-killer kill processes

sal

New Member
Aug 12, 2019
2
0
1
44
We are currently experiencing a problem on our main server, which uses Proxmox 6.1. This seems to happen since we updated from 5.4 to 6.1 on 2020-03-17 on that node.

Our System uses mostly containers on a local ZFS storage. There is also one old KVM running. Backups of the containers are stored on an NFS storage (synology) using vzdump.

Yesterday we first noticed that processes got killed. Since we are running some java applications, they get killed first (jira, confluence). The Proxmox UI reported only nearly full memory usage. That is weird, because the processes in these containers are usually not nearly using that much of the 96GB on that machine. And in the overview in the UI of all container, the combined used memory was much lower.

First: It's not ZFS.

Code:
root@amor:~# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
16:32:08    40     0      0     0    0     0    0     0    0   3.3G  3.3G

The processes on the system don't use that much memory (that corresponds to the per container usage on the Proxmox UI):
Code:
root@amor:~# ps -ax -o rss  | tail -n +2 | paste -s -d+ - | bc
19807784

slabtop also seemed ok (around 13GB usage). I only have a copy of /proc/slabinfo of that time (which I attach).

meminfo showed how few memory is left:
Code:
root@amor:~# cat /proc/meminfo 
MemTotal:       98885928 kB
MemFree:         1138156 kB
MemAvailable:    1664236 kB
Buffers:               0 kB
Cached:          1378444 kB
SwapCached:       192508 kB
Active:         12403076 kB
Inactive:        6676824 kB
Active(anon):   11869160 kB
Inactive(anon):  6370316 kB
Active(file):     533916 kB
Inactive(file):   306508 kB
Unevictable:      159248 kB
Mlocked:          159248 kB
SwapTotal:       8388604 kB
SwapFree:           1680 kB
Dirty:             16552 kB
Writeback:          1228 kB
AnonPages:      17710588 kB
Mapped:           704340 kB
Shmem:            539972 kB
KReclaimable:    1010396 kB
Slab:           13705372 kB
SReclaimable:    1010396 kB
SUnreclaim:     12694976 kB
KernelStack:       89024 kB
PageTables:       251980 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    57831568 kB
Committed_AS:   52493672 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1688564 kB
VmallocChunk:          0 kB
Percpu:         60249024 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    82923100 kB
DirectMap2M:    17659904 kB
DirectMap1G:     2097152 kB

I wasn't able to figure out what was allocating all the memory. The processes, arc and slab caches shouldn't be even 40GB.

So I started shutting the VM and each container one after another and checked the memory consumption. But that only freed the memory of the processes in each container. In the end I got:
Code:
# no running processes
root@amor:~# pct list | grep running | wc -l
0
# and few available memory
root@amor:~# free -m
              total        used        free      shared  buff/cache   available
Mem:          96568       78852       16076          75        1639       16107
Swap:          8191        1065        7126

root@amor:~# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
20:42:15    38     0      0     0    0     0    0     0    0   4.0G  4.0G  

root@amor:~# cat /proc/meminfo 
MemTotal:       98885928 kB
MemFree:        16451828 kB
MemAvailable:   16485748 kB
Buffers:               0 kB
Cached:          1042656 kB
SwapCached:       124856 kB
Active:           947648 kB
Inactive:         431952 kB
Active(anon):     302944 kB
Inactive(anon):   123776 kB
Active(file):     644704 kB
Inactive(file):   308176 kB
Unevictable:      159244 kB
Mlocked:          159244 kB
SwapTotal:       8388604 kB
SwapFree:        7297552 kB
Dirty:               248 kB
Writeback:             0 kB
AnonPages:        410996 kB
Mapped:           100072 kB
Shmem:             77512 kB
KReclaimable:     638892 kB
Slab:           12985512 kB
SReclaimable:     638892 kB
SUnreclaim:     12346620 kB
KernelStack:       13152 kB
PageTables:         8508 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    57831568 kB
Committed_AS:    4248996 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1639340 kB
VmallocChunk:          0 kB
Percpu:         62615808 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    85710428 kB
DirectMap2M:    14872576 kB
DirectMap1G:     2097152 kB

slab sizes also didn't change significantly.

The number that got my attention was the Percpu-row. That is roughly the "missing" memory, but if that is the problem I'm not sure how to proceed and get detailed usage information for that memory. I just don't find useful information about it.

To get the machine usable again, I rebooted it and the memory is available, but we would like a better solution:

Code:
root@amor:~# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
21:19:13     0     0      0     0    0     0    0     0    0   7.1G   47G  

root@amor:~# cat /proc/meminfo  
MemTotal:       98885928 kB
MemFree:        94535824 kB
MemAvailable:   94246216 kB
Buffers:            8356 kB
Cached:           391696 kB
SwapCached:            0 kB
Active:          1184412 kB
Inactive:         245936 kB
Active(anon):    1064432 kB
Inactive(anon):   101452 kB
Active(file):     119980 kB
Inactive(file):   144484 kB
Unevictable:      159248 kB
Mlocked:          159248 kB
SwapTotal:       8388604 kB
SwapFree:        8388604 kB
Dirty:               676 kB
Writeback:           136 kB
AnonPages:       1189644 kB
Mapped:           247864 kB
Shmem:            123324 kB
KReclaimable:     125632 kB
Slab:             723044 kB
SReclaimable:     125632 kB
SUnreclaim:       597412 kB
KernelStack:       15360 kB
PageTables:        29104 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    57831568 kB
Committed_AS:    5086664 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1391068 kB
VmallocChunk:          0 kB
Percpu:           138240 kB
HardwareCorrupted:     0 kB
AnonHugePages:      6144 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      530012 kB
DirectMap2M:     7778304 kB
DirectMap1G:    94371840 kB

I later updated to newest packages (Proxmox 6.1-8), but still the same kernel and also limited the maximum arcsize for ZFS, but only because I don't think the 47GB are necessary there.

We are now running the following versions
Code:
root@amor:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.4.117-2-pve: 4.4.117-110
pve-kernel-4.4.117-1-pve: 4.4.117-109
pve-kernel-4.4.98-5-pve: 4.4.98-105
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

And the problem seems to persist:

Code:
root@amor:~# date && cat /proc/meminfo 
Fri 27 Mar 2020 01:56:54 PM CET
MemTotal:       98885928 kB
MemFree:        10125640 kB
MemAvailable:   57181032 kB
...
Percpu:          4957056 kB
...

I'm looking forward to any help or suggestions !

Thanks,
Stephan
 

Attachments

  • slabinfo-before-stopping-all-containers.txt
    22.9 KB · Views: 6
  • slabtop-after-stopping-all-containers.txt
    17.1 KB · Views: 2
The number that got my attention was the Percpu-row. That is roughly the "missing" memory, but if that is the problem I'm not sure how to proceed and get detailed usage information for that memory. I just don't find useful information about it.
I don't know how much memory to expect to be allocated.
Percpu Memory allocated to the percpu allocator used to back percpu allocations. This stat excludes the cost of metadata.
https://www.kernel.org/doc/html/latest/filesystems/proc.html?highlight=meminfo

Some time back there was a memory leak with check_mk. It doesn't seem to happen at your end but I wanted to post it anyway. Maybe it gives you any ideas.
https://forum.proxmox.com/threads/pve6-slab-cache-grows-until-vms-start-to-crash.58307/

You could try to install pve-kernel 5.4 and see if that resolves it.
 
There is a very serious issue with LXC and ram usage under ZFS. OOMkiller is running constantly on containers that had ram from 5.x that was just fine for them forever. Suddenly i need to add 1, 2 or even 4GB to all containers.

I think caches are being accounted for to the CT's, oomkiller comes when they overrun (I see this on ancient jessie containers running just a couple apaches and a mysql with maybe 150-200MB of ram total in use for the processes, sitting in a 2GB container - then mysql is down, cuz it was killed once the cache grows).

I think the containers are not allowed to evict cache ram (or there's no mechanism for it), and just overflow til OOM comes. And no amount of ram assigned to the container will help because caches just grow until they overflow.

However, if you drop your ZFS caches, suddenly the CT's ram usage goes back to nearly nothing outside of actual processes.

echo 3 > /proc/sys/vm/drop_caches

Then check ram usage on your container.
 
Last edited:
This is ridiculous, this container is running a simple apache and mailman with a gig of ram:

total used free shared buffers cached Mem: 1048576 1048464 112 920736 0 921504 -/+ buffers/cache: 126960 921616 Swap: 1572864 524348 1048516 #ps auxwwf bash: fork: Cannot allocate memory

A very serious issue. Adding more ram wont help, eventually cache will fill. Meanwhile OOMkiller took out part of mailman which wedged it.

This is on pve 6.3 with a debian 8.11 client but it happens to all versions of debian or ubuntu or redhat on any pve 6.x or 5.x.

After restart, no issues:

total used free shared buffers cached Mem: 1048576 294692 753884 8392 0 27400 -/+ buffers/cache: 267292 781284 Swap: 1572864 0 1572864

Seems to be a known issue. Is there some lxc.conf option we can tweak to change this behaviour?

https://discuss.linuxcontainers.org/t/container-sees-buffer-cached-memory-as-used/10699
 
Last edited:
from the linked post, did you check whether you might be bitten by this (one usual culprit is non-persistent journal, but given the distro versions you use that doesn't seem too likely ;)):

Remember that tmpfs usage will count as memory usage within a cgroup, so a system with no obvious memory usage through processes but high amount of used memory can sometimes be caused by tmpfs filesystems consuming the memory.
 
Not using tmpfs much at all in this situation. though ill keep an eye on the systemd journal in /run

#df -k `grep tmpfs /proc/mounts | awk '{print $2}'` Filesystem 1K-blocks Used Available Use% Mounted on none 492 0 492 0% /dev tmpfs 65997564 0 65997564 0% /dev/shm tmpfs 65997564 90236 65907328 1% /run tmpfs 5120 4 5116 1% /run/lock tmpfs 65997564 0 65997564 0% /sys/fs/cgroup
 
Last edited:
More examples:

total used free shared buff/cache available Mem: 4194304 1883736 36 2305568 2310532 2310568 Swap: 5242880 1048532 4194348


This container is really irritating to use because it pauses all the time doing GC or whatever to get free ram.

I have 128GB on the pve host with about 40 free, but I dont see any reason why increasing ram wont eventually be eaten up by buff/cache, so no amount will help.

Is there some tuning somewhere I can modify kernel behaviour to free up buffers earlier?

This is basically fatal for running any LXC instances at all, entirely linux-wide if this is not just PVE suffering this. Random OOMkills is not production-safe.
 
Last edited:
could you give a summary of how your system looks like? e.g., the following:

  • pveversion -v
  • storage.cfg
  • container config
  • /proc/meminfo on the host
  • /proc/meminfo in the container
  • anything else that is special about your setup
(FWIW, I don't see this behaviour on any of my containers - and given that we don't have any other reports about wide-spread unexpected OOM kills, I suspect it is something specific to your setup..)
 
But its easy to see the behaviour regardless -- just watch buffers on the container, and when they get to be similar size as the free ram (~70-80%) you are running into problems if your container is >20% of ram usage. Something's gotta give, and buffers arent evicted, OOM is awakened instead.

I will collect some stats, but ill see if 7.1 has the same issue first.
 
systemd-journal could also be a culprit.

50 root 20 0 284232 181392 178368 S 0.0 11.5 1:46.72 systemd-journal

huge and the container was just restarted a few days ago. It runs a few screen sessions and irssi for users. When it was created under debian 6, it had 256MB ram and lived a happy life til a recent upgrade to systemd hell. Eventually increased it to 2GB, and still, systemd will just eat up ram forever as usual.

Tried MaxSize and MaxFileSize, but no help. Restarted yesterday and yet

54286 root 20 0 259668 163616 161616 S 0.0 10.4 0:32.34 systemd-journal

Meanwhile daily restart with crontab as temp fix for now. :(
 
Last edited:
systemd-journal could also be a culprit.

50 root 20 0 284232 181392 178368 S 0.0 11.5 1:46.72 systemd-journal

huge and the container was just restarted a few days ago. It runs a few screen sessions and irssi for users. When it was created under debian 6, it had 256MB ram and lived a happy life til a recent upgrade to systemd hell. Eventually increased it to 2GB, and still, systemd will just eat up ram forever as usual.

Tried MaxSize and MaxFileSize, but no help. Restarted yesterday and yet

54286 root 20 0 259668 163616 161616 S 0.0 10.4 0:32.34 systemd-journal

Meanwhile daily restart with crontab as temp fix for now. :(

maybe it writes into a tmpfs somewhere? doesn't use as much memory here even on long-running containers..
 
OK! Finally caught this happening. It's NOT systemd-journal.*
*EDIT: (Actually, it is, but in a non obvious way. See the further down in this thread to post-472649)

I started logging mem and top on the container continuously to a log every 15 seconds or so, and this is the last top output just before the CT got OOMkilled. It's absolutely buffers being charged to container ram, and OOMkilling it. This makes it pretty impossible to run any production anything on them because no matter how much ram you add, you'll always fill buffers and OOM.

Could dump caches on the host with echo 3 > /proc/sys/vm/drop_caches but that kills all read buffers too for the whole machine (including ZFS arc), which is a brutal performance hit for reads. Could perhaps do it at 3am and slowly rebuild caches with low traffic, but not an elegant solution.

Note mem's output of 1487.6 mem available there, which assumes evictable, but it's not.

Code:
MiB Mem :   1536.0 total,      0.6 free,     48.4 used,   1487.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1487.6 avail Mem

top - 21:25:01 up 13 days,  9:29,  0 users,  load average: 8.07, 8.11, 6.00
Tasks:  44 total,   1 running,  43 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  3.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    128 oident    20   0    8776    160      0 S   0.0   0.0   0:00.55 oidentd
    147 root      20   0    7448    184     60 S   0.0   0.0   0:00.00 agetty
    146 root      20   0    7448    196     68 S   0.0   0.0   0:00.00 agetty
    148 root      20   0    7448    200     72 S   0.0   0.0   0:00.00 agetty
     89 root      20   0   10540    472    204 S   0.0   0.0   0:03.53 cron
 238799 root      20   0    2300    544    484 S   0.0   0.0   0:00.00 sleep
    467 dircpro+  20   0    2704    760    340 S   0.0   0.0   1:08.40 dircproxy
    371 root      20   0    4128    780    144 S   0.0   0.0   0:00.22 bash
   2075 root      20   0    4128    948    380 S   0.0   0.1   0:00.30 bash
  89038 ryan      20   0    9484    952     80 S   0.0   0.1   0:00.98 screen
 236038 postfix   20   0   43504   1044    236 S   0.0   0.1   0:00.02 pickup
    352 root      20   0   43476   1048    212 S   0.0   0.1   0:06.52 master
 115827 justin    20   0    6244   1068    172 S   0.0   0.1   0:02.13 screen
    358 postfix   20   0   43624   1132    276 S   0.0   0.1   0:04.52 qmgr
  89039 ryan      20   0    8072   1400    144 S   0.0   0.1   0:00.03 bash
 115828 justin    20   0    4796   1520    144 S   0.0   0.1   0:00.07 bash
    142 root      20   0   15856   1540    684 S   0.0   0.1   2:15.79 sshd
    188 root      20   0    6032   1556    304 S   0.0   0.1   1:05.57 apache2
 216633 www-data  20   0    6112   1624    300 S   0.0   0.1   0:00.08 apache2
 217282 www-data  20   0    6112   1636    316 S   0.0   0.1   0:00.07 apache2
 216630 www-data  20   0    6120   1644    316 S   0.0   0.1   0:00.08 apache2
 216667 www-data  20   0    6120   1644    312 S   0.0   0.1   0:00.39 apache2
 216628 www-data  20   0    6120   1656    328 S   0.0   0.1   0:00.07 apache2
 216631 www-data  20   0    6120   1660    332 S   0.0   0.1   0:00.10 apache2
 223929 www-data  20   0    6120   1664    336 S   0.0   0.1   0:00.04 apache2
 223930 www-data  20   0    6120   1664    336 S   0.0   0.1   0:00.03 apache2
     94 root      20   0  225828   1740    216 S   0.0   0.1   2:47.78 rsyslogd
 238957 root      20   0    8012   1896   1536 R   0.0   0.1   0:00.01 top
    160 irc       20   0  292712   1940    276 S   0.0   0.1   1:49.63 ircd-hybrid
 238955 sshd      20   0   15856   2264   1404 S   6.2   0.1   0:00.01 sshd
 238959 sshd      20   0   15856   2316   1456 S   0.0   0.1   0:00.00 sshd
 216629 www-data  20   0    6120   2356   1028 S   0.0   0.1   0:00.07 apache2
 223928 www-data  20   0    6120   2404   1076 S   0.0   0.2   0:00.04 apache2
      1 root      20   0  169348   2952    800 S   0.0   0.2   0:38.41 systemd
  89044 ryan      20   0   92696   3224    852 S   0.0   0.2   2:38.36 irssi
 115832 justin    20   0   89520   3380    912 S   0.0   0.2   2:11.31 irssi
 238951 sshd      20   0   15856   3392   2508 S   0.0   0.2   0:00.01 sshd
 238872 root      20   0   15856   4248   3404 S   0.0   0.3   0:00.00 sshd
 238956 root      20   0   15856   4312   3468 S   0.0   0.3   0:00.02 sshd
 238912 root      20   0   15856   4316   3476 S   0.0   0.3   0:00.01 sshd
 238949 root      20   0   15856   4320   3472 S   0.0   0.3   0:00.01 sshd
 238954 root      20   0   15856   4340   3492 S   0.0   0.3   0:00.01 sshd
 238950 root      20   0   16248   4812   3788 S   0.0   0.3   0:00.01 sshd
 217114 root      20   0  292552 184764 182576 S   0.0  11.7   0:46.59 systemd-journal

pveversion -v:

Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.1-12 (running version: 7.1-12/b3c09de3)
pve-kernel-helper: 7.1-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-7
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-5
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-7
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-6
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
 
Last edited:
Same problem here, all my containers are silently taking more memory in cache until oom-kill kicks in. This happens in all my containers but in pve6. In my case, restarting systemd-journald did not help, however reducing journals log size increased the available memory by the same size I reduced the log. (But the memory keeps decreasing, so it is just a matter of time, I just gained time)
 
  • Like
Reactions: mathx
@mathx Thanks very much for the workaround. I am a little reluctant to do this on a production server, but to be honest having to restart containers on a weekly basis is really a problem, so I will check this out.
 
Note that doing so will clear the caches/buffers for the entire host and dump ARC if you're using ZFS. This will spike read usage on your disks, since there's no more read cache available, and it will start repopulating as blocks are read. It's a horrible solution, honestly. This is a major bug, there must be some traction somewhere about it in linux kernel, it's widespread and not specifically a proxmox issue.

https://discuss.linuxcontainers.org/t/container-sees-buffer-cached-memory-as-used/10699

https://discuss.linuxcontainers.org/t/lxc-eating-memory/2325

https://github.com/lxc/lxd/issues/3337#issuecomment-303593876 - lots of good info in here, reading.

This is an interesting comment in that last thread:

"LXD 2.13 will automatically set soft_limit_in_bytes to be 90% of limit_in_bytes if memory limits are set in hard mode. My understanding is that this creates some amount of memory pressure at the kernel level which will cause things like buffers to be flushed at that point rather than when hitting 100%." Can this be tuned in proxmox?
 
Last edited:
Ok, after more reading I was wrong. While OOMkiller may be held at bay by LXC/LXD defaults of 90% ram use thresholds, it cannot evict TMPFS pages.

How is buffers/cache related to TMPFS? There's a misleading 'feature' in LXCFS reporting TMPFS as 'buffers/cache' ram. So no amount of researching 'how to evict buffers/cache' will clarify what's going on til you realize this: you must reduce the size of your TMPFS manually.

(I am wondering if tmpfs is using up all the buffers/cache for the container, does the container then enjoy no buffers/cache? Or is it all in the host's arc cache anyway?)

Question then becomes (and goes back to) "what is eating so much TMPFS", and again, it's systemd-journald.

Code:
# du -s /run
1556544 /run

#free                                                                                                        
              total        used        free      shared  buff/cache   available                                                                                                             
Mem:        4194304       56596     2536308     1560880     1601400     4137708

#journalctl --vacuum-time=2days >& /dev/null
#free
              total        used        free      shared  buff/cache   available                                                                                                             
Mem:        4194304       53384     3923956      176432      216964     4140920

#du -s /run
172108   /run

Simple solution is then to set RuntimeMaxUse=100M in /etc/systemd/journald.conf (and restart it).

journalctl --disk-usage will report use as well.

I am saddened that there is no sane default in systemd-journald to vacuum logfiles - at all. They just grow forever. Systemd strikes again!
 
Last edited:
  • Like
Reactions: leesteken
That said, is there a way to restrict tmpfs size in a container directly? Sane defaults of half the operating ram unless overridden would be nice.
 
How is buffers/cache related to TMPFS? There's a misleading 'feature' in LXCFS reporting TMPFS as 'buffers/cache' ram. So no amount of researching 'how to evict buffers/cache' will clarify what's going on til you realize this: you must reduce the size of your TMPFS manually.

(I am wondering if tmpfs is using up all the buffers/cache for the container, does the container then enjoy no buffers/cache? Or is it all in the host's arc cache anyway?)

this is not a misleading feature of lxcfs at all - a tmpfs is memory, so writing to a tmpfs counts as using memory (from the kernel's perspective, which enforces the memory limits you have configured). lxcfs just exposes the kernel's view using the interface userspace expects. it would not help you at all if lxcfs would not take that into account, and it looks like you have free memory, but all your stuff gets OOM-killed anyway ;)

the default of half the host happens in the kernel, and the kernel doesn't have a concept of containers - it just has namespaces for various things on one side, and resource limits for various resources on the other. so you basically have to set an explicit limit where possible if you want something else (e.g., /etc/fstab for things mounted by systemd in the container, logind/journald settings for tmpfs setup by those services, etc.pp.).
 
this is not a misleading feature of lxcfs at all - a tmpfs is memory, so writing to a tmpfs counts as using memory (from the kernel's perspective, which enforces the memory limits you have configured).

This misleading issue is reporting it as buffers/cache instead of in "used". If you google it, there are dozens of posts by others misled by this and trying to flush their caches, which will not help at all. "Used" would be a fairer category to put it in, but Im guessing because of the way kernel structures such things, that may be harder to code. Nonetheless, it's misleading.

At any rate, now that we know, we know.

Is there a way to restrict tmpfs size in containers?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!