[SOLVED] LXC OOM when low usage

ca_maer

Well-Known Member
Dec 5, 2017
181
14
58
45
Hello,

I was wondering if maybe the memory reported in the webUI for LXC containers is wrong.

Here's a 'free -m' from a container with 2GB ram and 0 swap
Code:
              total        used        free      shared  buff/cache   available
Mem:           2048         978           9        2757        1060           9
Swap:             0           0           0

You can see that the shared values is 2700 which is more than the available total memory for this container. I'm not sure if this is normal.

This also cause issue in htop
Screen Shot 2017-12-15 at 10.17.33 AM.png

This value seems to be from /proc/meminfo under Shmem

Therefore when using softwares with graphs like librenms, it reports that the ram usage is at 100% constantly while the webUi report like less than 5%.

Any idea ?

Code:
proxmox-ve: 5.1-30 (running kernel: 4.13.8-2-pve)
pve-manager: 5.1-38 (running version: 5.1-38/1e9bc777)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.8-2-pve: 4.13.8-28
pve-kernel-4.13.8-3-pve: 4.13.8-30
libpve-http-server-perl: 2.0-7
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-22
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-3
pve-container: 2.0-17
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
libpve-apiclient-perl: 2.0-2
 
Last edited:
Also when using lxc-top I'm getting a different value than the webui.

Here's a screenshot of container 105 which shows 10GB ram of usage
Code:
Container                   CPU          CPU          CPU          BlkIO        Mem
Name                       Used          Sys         User          Total       Used
105                   168820.39     35720.76    131137.77      292.00 KB   10.30 GB

but yet in the webui I get a completely different value
Screen Shot 2017-12-19 at 11.25.16 AM.png

at this point I have no idea where to look to get the exact memory usage of my LXC container
 
the view from inside the container is a virtual one provided by lxcfs, which tries to calculate meaningful values for the fields in /proc/meminfo. unfortunately this is not possible for all of them, so for some fields you see wrong values or simply those of the host. the view from outside is just what the kernel thinks/knows the cgroup of the container is using.
 
Thanks for your input. So if I understand correctly I could get the limited memory from /sys/fs/cgroup/memory/lxc/id/ minus the RSS mem used by the container and that should give me the RAM usage of the container ? How does the webUI calculate the ram usage ?
 
Sorry for reopening this thread but it seems related to this: https://github.com/lxc/lxcfs/issues/175

It seems to have been fixed 18 days ago.

This would also explain why my containers would start swapping even when plenty of ram was available.

Also it seems someone else open an issue with the exact same problem on github: https://github.com/lxc/lxcfs/issues/222 so we will see how they respond.
 
Last edited:
There is definetely a problem with the way the free ram is calculated. One of our container was reporting low ram usage in the webui and our Nagios pluging, which is based on the same calculation, never reported anything critical but yet the container was hit by the OOM. Which crashed the whole thing. Those are production servers so we want to be able to correctly monitor those system and so far the solution provided is inacurrate and hitting the OOM is a critical issue. Sure we can add more ram but there is no way for us right now to know if it will be enough since we can't monitor the ram usage currectly.

Code:
[2092606.150856] Memory cgroup out of memory: Kill process 7052 (systemd) score 2 or sacrifice child
[2092606.157133] Killed process 7052 (systemd) total-vm:37428kB, anon-rss:1512kB, file-rss:0kB, shmem-rss:0kB

Thanks
 
I think the swapping is more likely due to a full tmpfs inside the container, but we will review including the fix you linked as well.
 
Got the issue again on 2 others containers. Seems to only happen on containers that have a low ram limit.

Both machine have 1 GB of ram and are Ubuntu 16.04

Here's one of them

Code:
[3174928.549171] Memory cgroup out of memory: Kill process 2663 (mysqld) score 112 or sacrifice child
[3174928.555861] Killed process 2663 (mysqld) total-vm:1234292kB, anon-rss:117464kB, file-rss:0kB, shmem-rss:0kB

Code:
              total        used        free      shared  buff/cache   available
Mem:           1024          67         116       15151         840         116
Swap:             0           0           0


You'll find the whole dmesg output attached

Thanks
 

Attachments

Dumping more logs of another instance that happened this morning if it can help.

Code:
[2018-01-19 00:01:55]  Process accounting resumed
[2018-01-19 11:13:54]  apache2 invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=0
[2018-01-19 11:13:54]  apache2 cpuset=ns mems_allowed=0-1
[2018-01-19 11:13:54]  CPU: 1 PID: 4272 Comm: apache2 Tainted: P          IO    4.13.8-2-pve #1
[2018-01-19 11:13:54]  Hardware name: HP ProLiant DL380 G6, BIOS P62 05/05/2011
[2018-01-19 11:13:54]  Call Trace:
[2018-01-19 11:13:54]   dump_stack+0x63/0x8b
[2018-01-19 11:13:54]   dump_header+0x97/0x225
[2018-01-19 11:13:54]   ? mem_cgroup_scan_tasks+0xc4/0xf0
[2018-01-19 11:13:54]   oom_kill_process+0x208/0x410
[2018-01-19 11:13:54]   out_of_memory+0x11d/0x4c0
[2018-01-19 11:13:54]   mem_cgroup_out_of_memory+0x4b/0x80
[2018-01-19 11:13:54]   mem_cgroup_oom_synchronize+0x31e/0x340
[2018-01-19 11:13:54]   ? get_mem_cgroup_from_mm+0x90/0x90
[2018-01-19 11:13:54]   pagefault_out_of_memory+0x36/0x7b
[2018-01-19 11:13:54]   mm_fault_error+0x8f/0x190
[2018-01-19 11:13:54]   __do_page_fault+0x4be/0x4f0
[2018-01-19 11:13:54]   do_page_fault+0x22/0x30
[2018-01-19 11:13:54]   page_fault+0x28/0x30
[2018-01-19 11:13:54]  RIP: 0033:0x7fbb16c3b820
[2018-01-19 11:13:54]  RSP: 002b:00007fbb01feac48 EFLAGS: 00010206
[2018-01-19 11:13:54]  RAX: 0000000000000030 RBX: 0000000000000039 RCX: 0000000000000a6c
[2018-01-19 11:13:54]  RDX: 00007fbb01feacd8 RSI: 0000000000000000 RDI: 00007fbb173ca621
[2018-01-19 11:13:54]  RBP: 00007fbb01feacb0 R08: 000000000000fa01 R09: 00007fbb173ca621
[2018-01-19 11:13:54]  R10: 00007fbb01feb6b0 R11: 00007fbb173ca5e8 R12: 00007fbb173c9028
[2018-01-19 11:13:54]  R13: 0000000000000004 R14: 00007fbb173ca588 R15: 00007fbb173ca629
[2018-01-19 11:13:54]  Task in /lxc/129/ns killed as a result of limit of /lxc/129
[2018-01-19 11:13:54]  memory: usage 2621440kB, limit 2621440kB, failcnt 0
[2018-01-19 11:13:54]  memory+swap: usage 2621440kB, limit 2621440kB, failcnt 2352493
[2018-01-19 11:13:54]  kmem: usage 24916kB, limit 9007199254740988kB, failcnt 0
[2018-01-19 11:13:54]  Memory cgroup stats for /lxc/129: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[2018-01-19 11:13:54]  Memory cgroup stats for /lxc/129/ns: cache:1272596KB rss:1323928KB rss_huge:0KB shmem:1272580KB mapped_file:15728KB dirty:0KB writeback:0KB swap:0KB inactive_anon:701924KB active_anon:1894584KB inactive_file:0KB active_file:0KB unevictable:0KB
[2018-01-19 11:13:54]  [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[2018-01-19 11:13:54]  [11828]     0 11828     9388      414      24       3        0             0 systemd
[2018-01-19 11:13:54]  [11992]     0 11992    16024     6327      36       3        0             0 systemd-journal
[2018-01-19 11:13:54]  [12109]   110 12109    33082      924      53       3        0             0 freshclam
[2018-01-19 11:13:54]  [12111]     0 12111     7202      147      19       3        0             0 systemd-logind
[2018-01-19 11:13:54]  [12112]     0 12112    68267      229      36       3        0             0 accounts-daemon
[2018-01-19 11:13:54]  [12115]     0 12115     6517       67      18       3        0             0 cron
[2018-01-19 11:13:54]  [12118]   109 12118    10723      129      26       3        0          -900 dbus-daemon
[2018-01-19 11:13:54]  [12174]   104 12174    64099      370      27       3        0             0 rsyslogd
[2018-01-19 11:13:54]  [12332]     0 12332    16380      183      36       3        0         -1000 sshd
[2018-01-19 11:13:54]  [12356]     0 12356     3211       34      12       3        0             0 agetty
[2018-01-19 11:13:54]  [12357]     0 12357     3211       35      12       3        0             0 agetty
[2018-01-19 11:13:54]  [12358]     0 12358     3211       34      12       3        0             0 agetty
[2018-01-19 11:13:54]  [12565]     0 12565    18967      376      41       3        0             0 apache2
[2018-01-19 11:13:54]  [13457]     0 13457    16352      119      23       3        0             0 master
[2018-01-19 11:13:54]  [13467]   106 13467    16881      114      25       3        0             0 qmgr
[2018-01-19 11:13:54]  [20048]  1005 20048    11278      162      26       3        0             0 systemd
[2018-01-19 11:13:54]  [20049]  1005 20049    15190      381      31       3        0             0 (sd-pam)
[2018-01-19 11:13:54]  [32193]   112 32193    19010     5918      39       3        0             0 snmpd
[2018-01-19 11:13:54]  [31613]     0 31613   521476   243324     568       5        0             0 clamd
[2018-01-19 11:13:54]  [ 4218]    33  4218   502348      681      99       5        0             0 apache2
[2018-01-19 11:13:54]  [ 4219]    33  4219   502354      695      99       5        0             0 apache2
[2018-01-19 11:13:54]  [ 1624]  1003  1624    11279      162      27       3        0             0 systemd
[2018-01-19 11:13:54]  [ 1625]  1003  1625    15251      441      32       3        0             0 (sd-pam)
[2018-01-19 11:13:54]  [ 4376]     0  4376    23842      236      49       3        0             0 sshd
[2018-01-19 11:13:54]  [ 4392]  1003  4392    23842      243      48       3        0             0 sshd
[2018-01-19 11:13:54]  [ 4393]  1003  4393     5575      417      17       3        0             0 bash
[2018-01-19 11:13:54]  [18358]   106 18358    16869      112      24       3        0             0 pickup
[2018-01-19 11:13:54]  [23436]   110 23436   113662    76900     196       3        0             0 freshclam
[2018-01-19 11:13:54]  Memory cgroup out of memory: Kill process 31613 (clamd) score 360 or sacrifice child
[2018-01-19 11:13:54]  Killed process 31613 (clamd) total-vm:2085904kB, anon-rss:973292kB, file-rss:4kB, shmem-rss:0kB

LXC 124 have 2.5 GB available and was showing 40% usage in the webui yet 90%+ in librenms so I'm assuming librenms was right and not the webUI since the oom was called.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!