[SOLVED] small LXC + tmpfs + journald with defaults => LXC die with OutOfMemory

Alexey Pavlyuts

Active Member
Jun 16, 2018
24
2
43
52
Hello All!

I have been experiencing some LXC with small memory allocation stops, randomly dies silently without any trace of cause. Watching over time, I found that the most probably reason is out of memory. Allocating more memory allows it to run longer, but it fails anyway. When I try to find out where the memory lost, I found that "shared" grows and grows until reach OOM condition.

The result my investigation is very courious, follow the sequence:
1. When container starts, it has tmpfs mount at /run and the size of tmpfs is set to a half of physical server physical memory size, not the size of LXC allocated memory. In my case the tmpfs size is 63 GB (server has 128GB) while container has 2GB allocation.
2. Systemd-journald puts journal files to /run/log/journal (tmpfs!), with default journal size limit is 10% of the filesystem size (my case - 6.3GB limit)
3. Grow of journals depends of app and it's loging mode, therefore when LXC run something that never use journald extensively, it may run without problem. But if the app use journald it lead to constant growith of tmpfs usage.
4. However, tmpfs really allocated within container memory limit!
5. Then, OOM condition is just a matter of time if LCX memory allocacion is less then 5% of physical memory size and there extensive use of journald inside LXC.

So, I have some LXCs with 1-2GB allocation for lightweigt apps that really use 150-200Mb of memory, but these apps use journald to log. With 6GB default journald log limit it goes to OOM condition, just a matter of time.

This issue has a very simple workaround: to set journald runtime log size limit in /etc/systemd/journald.conf
Code:
RuntimeMaxUse=128M
and restart systemd-journald

This should force Journald to keep tmpfs usage within 128M (or whatever your configure) and flush logs to disk, where another 10% to filesystem size limit should apply.

Questons: is it a bug of just behavior by design? Should PVE set tmpfs size accourdng to LXC memory allocation? Is there a way to manage it globally, without taking care about journald config in each LXC?
 
Last edited:
Update: I did review of my historical data for LXC available memory records in zabbix and found that the memory leak intensed after upgrade form 7.4.18 to 8.3.1.
Then, it is possible that my long-term problem with stopping LXCs has no relation to this issue.
However, the issue appears in 8.3.1.
1739376425875.png
I still have one host 7.4.18 and some LXC which was work some time on 8.3.1 and then moved to 7.4, and it's available memory chart confirms that this comes from version change:
1739376978255.png
Memory was leaking while it was located on a node with 8.3 and healed when back to 7.4

And finally, the Conservaton Law work all the time: before upgrade I experienced PVE nodes available memory decrease with time to about 20% and then keep at this level, and first what I feel after upgrade to 8.3 is that the memory consumption does not rise so aggressive with time:
1739377421559.png
Upgrade was happen 12-11..16.

My conclustion on this:
1. 7.4 allocates tmpfs from node memory space and set limit to 50% of node PHY memory size, and does not count in in the memory limit. In my case, it lead to drain down of node available memory, while container available memory was not drained down. Probabaly, very hard usage of tmpfs may lead to LXC crash. No way to proof this.
2. 8.3 allocates tmpfs inside container memory limit but set tmpfs size to 50% of node PHY memory size. So, it drais quite fast and then I meet OOM crash much faster.

Both cases comes from journald runtime logs and /run tmpfs.

It could be very noce if someone point me to any source of information on PVE memory management. Probably, some LXC docs?

Should I open a ticket on this somewhere?
 
FINALLY:
It is not PVE issue/bug. It is not even LXC or systemd issue. It is a kernel issue.

The core reason:
Any tmpfs mount without explicitly given size will be done by kernel with kernel-default tmpfs size, which is usually a half of phy RAM.
Even when run mount inside container, kernel process it without respect to cgroup memory limits, just one default for any case.

This leads to negative effects:
1. ApiVFS like /run will have kernel-default half-RAM size unless explicitly remounted to other size
2. Any app which creates any tmpfs for it's needs will got half-RAM size tmpfs unless the size is managed by app itself
3. Writing to any tmpfs more then LXC memory+swap will lead to LXC OOM crash. And it is possible when tmpfs size is over LXC memory+swap (in fact, most cases of small LXC)

Workaround:
1. Remount tmpfs with smaller size by put remount lines into etc/fstab , example:
Code:
none /run none remount,size=128M 0 0
2. Journald may lead to trouble. Limit it's tmpfs usage with RuntimeMaxUse= option
3. Watch "shared" part of free command for growith, mind that PVE 7 does not decrease "available" for it and this may lead to OOM without "available memory" drain (see below). It seems like healthy LXC just stops.

Side findings:
Monitoring of available memory shows that under PVE 7.4 memory allocations to tmpfs shown like "shared" but does not decrease "available" amount. So when LXC ran out of memory, you can't see this in web-interface or zabbix charts, and finally it crash with OOM while seems to have available memory (!!!). However, the PVE node available memory decrease with time and I blame container tmpfs for this. In PVE 8.2 "shared" decrease "available" and then you may see how available memory drain to zero and finally OOM crash.