Container process being OOM killed

otbutz

Active Member
Oct 17, 2017
13
3
43
33
I'm running Proxmox 5.1 with ZFS as the storage backend and i can't wrap my head around the memory usage reported by the OOM killer:

Code:
Nov 02 07:19:54 srv-01-1 kernel: Task in /lxc/111 killed as a result of limit of /lxc/111
Nov 02 07:19:54 srv-01-1 kernel: memory: usage 1048576kB, limit 1048576kB, failcnt 2777408
Nov 02 07:19:54 srv-01-1 kernel: memory+swap: usage 1048576kB, limit 2097152kB, failcnt 0
Nov 02 07:19:54 srv-01-1 kernel: kmem: usage 17044kB, limit 9007199254740988kB, failcnt 0
Nov 02 07:19:54 srv-01-1 kernel: Memory cgroup stats for /lxc/111: cache:1008064KB rss:23468KB rss_huge:0KB shmem:1008048KB mapped_file:156KB dirty:12KB writeback:0KB swap:0KB inactive_anon:497020KB active_anon:534496KB inactive_file:12KB active_file:0KB unevictable:0KB
Nov 02 07:19:54 srv-01-1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Nov 02 07:19:54 srv-01-1 kernel: [ 4253] 100000  4253     9340      365      21       3        0             0 systemd
Nov 02 07:19:54 srv-01-1 kernel: [ 4557] 100109  4557    11763      121      27       3        0             0 dbus-daemon
Nov 02 07:19:54 srv-01-1 kernel: [ 4605] 100000  4605     8176       84      21       3        0             0 systemd-logind
Nov 02 07:19:54 srv-01-1 kernel: [ 4607] 100104  4607    46705      360      27       4        0             0 rsyslogd
Nov 02 07:19:54 srv-01-1 kernel: [ 4609] 100000  4609    69282      197      39       3        0             0 accounts-daemon
Nov 02 07:19:54 srv-01-1 kernel: [ 4612] 100000  4612     6518       65      18       3        0             0 cron
Nov 02 07:19:54 srv-01-1 kernel: [ 4721] 100110  4721    12497      106      27       3        0             0 dnsmasq
Nov 02 07:19:54 srv-01-1 kernel: [ 4740] 100000  4740    11716      170      24       3        0             0 monit
Nov 02 07:19:54 srv-01-1 kernel: [ 4799] 100000  4799    16381      178      35       3        0             0 sshd
Nov 02 07:19:54 srv-01-1 kernel: [ 4821] 100000  4821     3212       35      12       3        0             0 agetty
Nov 02 07:19:54 srv-01-1 kernel: [ 4822] 100000  4822    24582      189      52       3        0             0 login
Nov 02 07:19:54 srv-01-1 kernel: [ 4825] 100000  4825     3212       35      12       3        0             0 agetty
Nov 02 07:19:54 srv-01-1 kernel: [ 5186] 100000  5186    60321      465     116       3        0             0 nmbd
Nov 02 07:19:54 srv-01-1 kernel: [ 5213] 100000  5213    84411      681     163       3        0             0 smbd
Nov 02 07:19:54 srv-01-1 kernel: [ 5215] 100000  5215    81734      611     152       3        0             0 smbd
Nov 02 07:19:54 srv-01-1 kernel: [ 7259] 100000  7259    84411      647     157       3        0             0 smbd
Nov 02 07:19:54 srv-01-1 kernel: [26505] 100000 26505    16169     1970      35       3        0             0 check-new-relea
Nov 02 07:19:54 srv-01-1 kernel: [26507] 111106 26507    12318      172      30       3        0             0 systemd
Nov 02 07:19:54 srv-01-1 kernel: [26514] 111106 26514    23289      441      47       3        0             0 (sd-pam)
Nov 02 07:19:54 srv-01-1 kernel: [26528] 111106 26528     6410      393      17       3        0             0 bash
Nov 02 07:19:54 srv-01-1 kernel: Memory cgroup out of memory: Kill process 26505 (check-new-relea) score 7 or sacrifice child
Nov 02 07:19:54 srv-01-1 kernel: Killed process 26505 (check-new-relea) total-vm:64676kB, anon-rss:7880kB, file-rss:0kB, shmem-rss:0kB
Nov 02 07:19:54 srv-01-1 kernel: oom_reaper: reaped process 26505 (check-new-relea), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Code:
root@dnsmasq:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           1024          38           0        1268         985           0
Swap:             0           0           0

Code:
root@dnsmasq:~# cat /proc/meminfo
MemTotal:        1048576 kB
MemFree:             244 kB
MemAvailable:        244 kB
Buffers:               0 kB
Cached:          1008336 kB
SwapCached:            0 kB
Active:           528740 kB
Inactive:         492724 kB
Active(anon):     528636 kB
Inactive(anon):   492488 kB
Active(file):        104 kB
Inactive(file):      236 kB
Unevictable:           0 kB
Mlocked:           80240 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               200 kB
Writeback:            28 kB
AnonPages:      11042224 kB
Mapped:           215760 kB
Shmem:           1299336 kB
Slab:               0 kB
SReclaimable:          0 kB
SUnreclaim:            0 kB
KernelStack:       12000 kB
PageTables:        64364 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32927204 kB
Committed_AS:   21973272 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   9613312 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     8494268 kB
DirectMap2M:    58499072 kB
DirectMap1G:     2097152 kB

Code:
root@dnsmasq:~# ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.2  37360  2212 ?        Ss   Oct27   0:01 /sbin/init
message+    75  0.0  0.0  47052   484 ?        Ss   Oct27   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       107  0.0  0.0  32704   416 ?        Ss   Oct27   0:00 /lib/systemd/systemd-logind
syslog     109  0.0  0.1 186820  1440 ?        Ssl  Oct27   0:34 /usr/sbin/rsyslogd -n
root       111  0.0  0.1 277128  1112 ?        Ssl  Oct27   0:14 /usr/lib/accountsservice/accounts-daemon
root       114  0.0  0.0  26072   420 ?        Ss   Oct27   0:00 /usr/sbin/cron -f
dnsmasq    165  0.0  0.1  49988  1072 ?        S    Oct27   1:12 /usr/sbin/dnsmasq -x /var/run/dnsmasq/dnsmasq.pid -u dnsmasq -r /var/run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E5
root       179  0.0  0.1  46864  1416 ?        Sl   Oct27   1:56 /usr/bin/monit -c /etc/monit/monitrc
root       236  0.0  0.0  65524   724 ?        Ss   Oct27   0:00 /usr/sbin/sshd -D
root       245  0.0  0.0  12848   140 pts/1    Ss+  Oct27   0:00 /sbin/agetty --noclear --keep-baud pts/1 115200 38400 9600 vt220
root       247  0.0  0.0  12848   140 console  Ss+  Oct27   0:00 /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt220
root       294  0.0  0.2 241284  2496 ?        Ss   Oct27   0:28 /usr/sbin/nmbd -D
root       311  0.0  0.2 337644  2892 ?        Ss   Oct27   0:00 /usr/sbin/smbd -D
root       312  0.0  0.2 326936  2444 ?        S    Oct27   0:00 /usr/sbin/smbd -D
root       340  0.0  0.2 337644  2756 ?        S    Oct27   0:01 /usr/sbin/smbd -D
root      3396  0.0  0.0  12848   144 pts/0    Ss+  06:21   0:00 /sbin/agetty --noclear --keep-baud pts/0 115200 38400 9600 vt220
root      3638  0.0  0.1 114228  1888 ?        Ss   15:36   0:00 sshd: root@pts/2
root      3640  0.0  0.0  36660   936 ?        Ss   15:36   0:00 /lib/systemd/systemd --user
root      3641  0.0  0.1  91068  1744 ?        S    15:36   0:00 (sd-pam)
root      3658  0.0  0.2  19272  3068 pts/2    Ss   15:36   0:00 -bash
root      3680  0.0  0.1  38580  1844 pts/2    R+   15:38   0:00 ps -aux

Code:
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-25
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.10.17-1-pve: 4.10.17-18
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: not correctly installed
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90

What is causing the high buffer usage?
 
Seems like journald was the culprit. It filled `/run/log/journal` with its logs which caused the tmpfs of /run to consume the whole memory dedicated to the container.

Code:
root@dnsmasq:~# journalctl --disk-usage
Archived and active journals take up 144.0M on disk.
root@dnsmasq:~# df -h
Filesystem                    Size  Used Avail Use% Mounted on
vm_storage/subvol-111-disk-1   20G  555M   20G   3% /
none                          492K     0  492K   0% /dev
udev                           32G     0   32G   0% /dev/tty
tmpfs                          32G     0   32G   0% /dev/shm
tmpfs                          32G  145M   32G   1% /run
tmpfs                         5.0M     0  5.0M   0% /run/lock
tmpfs                          32G     0   32G   0% /sys/fs/cgroup
tmpfs                         6.3G     0  6.3G   0% /run/user/0
root@dnsmasq:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           1024          92         760         434         171         760
Swap:             0           0           0

Notice that the size of the tmpfs filesystems is not restricted by the memory constraints of the container.
Is this a problem with the Ubuntu 16.04 template?
 
Yep, can confirm this issue with Ubuntu 16.04 CTs that would generate a lot of logs. Environment slightly differing - Proxmox 4.4.

This was a real head scratcher until I came across this thread. Capping the journal size inside the CT has successfully worked around the issue.

Does this need a bug opened?
 
I would think so. The overall problem is that the tmpfs percentage limit isn't working properly.
 
No, i didn't because i'm not quite sure who's in charge of the standard Ubuntu Template. Is it maintained by Proxmox?
 
  • Like
Reactions: matiaspecchia
How can we manually fix this for a running template?
Or asked differently, how can we reduce the available tmpfs in a cases like this?
 
How can we manually fix this for a running template?
Or asked differently, how can we reduce the available tmpfs in a cases like this?

I used the following approach:

In `/etc/systemd/journald.conf`, set:

Code:
SystemMaxUse=128M
RuntimeMaxUse=128M

followed by:

Code:
journalctl --vacuum-size=256M; systemctl force-reload systemd-journald
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!