I am running proxmox in a cluster with HA enabled, from time to time(2-5 day intervals) every lxc container gets killed for no apparent reason. Containers seem to get restarted without any logs, I am only seeing restarts from uptimes.
When looking at dmesg output I've found that some containers seem to run out of memory, which is understandable why that container would get killed, but why does it kill all of them?
There is one particular container which gets out of memory most of the time, it's running ELK stack and have 2G memory which should be plenty enough.
I've added my pve version and relevant dmesg logs. Any insight as to what might be going wrong would be highly appreciated.
When looking at dmesg output I've found that some containers seem to run out of memory, which is understandable why that container would get killed, but why does it kill all of them?
There is one particular container which gets out of memory most of the time, it's running ELK stack and have 2G memory which should be plenty enough.
I've added my pve version and relevant dmesg logs. Any insight as to what might be going wrong would be highly appreciated.
Code:
proxmox-ve: 4.2-56 (running kernel: 4.4.13-1-pve)
pve-manager: 4.2-15 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.8-1-pve: 4.4.8-52
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-42
qemu-server: 4.0-83
pve-firmware: 1.1-8
libpve-common-perl: 4.0-70
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-55
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-19
pve-container: 1.0-70
pve-firewall: 2.0-29
pve-ha-manager: 1.0-32
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5.7-pve10~bpo80
Code:
[1489174.926977] java invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0
[1489174.926981] java cpuset=110 mems_allowed=0
[1489174.926987] CPU: 3 PID: 1749 Comm: java Tainted: P O 4.4.13-1-pve #1
[1489174.926989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[1489174.926991] 0000000000000286 00000000fda67529 ffff88019b4cbc70 ffffffff813ed3f3
[1489174.926994] ffff88019b4cbd48 ffff8801f088a800 ffff88019b4cbcd8 ffffffff8120942b
[1489174.926996] ffff88019b4cbca8 ffffffff81190c2b ffff88008f0f0000 ffff88008f0f0000
[1489174.926999] Call Trace:
[1489174.927007] [<ffffffff813ed3f3>] dump_stack+0x63/0x90
[1489174.927011] [<ffffffff8120942b>] dump_header+0x67/0x1d5
[1489174.927015] [<ffffffff81190c2b>] ? find_lock_task_mm+0x3b/0x80
[1489174.927017] [<ffffffff811911f5>] oom_kill_process+0x205/0x3c0
[1489174.927021] [<ffffffff811fd1a0>] ? mem_cgroup_iter+0x1d0/0x380
[1489174.927024] [<ffffffff811ff158>] mem_cgroup_out_of_memory+0x2a8/0x2f0
[1489174.927027] [<ffffffff811ffef7>] mem_cgroup_oom_synchronize+0x347/0x360
[1489174.927047] [<ffffffff811fb230>] ? mem_cgroup_css_online+0x240/0x240
[1489174.927050] [<ffffffff811918f4>] pagefault_out_of_memory+0x44/0xc0
[1489174.927054] [<ffffffff8106af2f>] mm_fault_error+0x7f/0x160
[1489174.927056] [<ffffffff8106b733>] __do_page_fault+0x3e3/0x410
[1489174.927058] [<ffffffff8106b7c7>] trace_do_page_fault+0x37/0xe0
[1489174.927064] [<ffffffff81063f49>] do_async_page_fault+0x19/0x70
[1489174.927069] [<ffffffff8184d2a8>] async_page_fault+0x28/0x30
[1489174.927071] Task in /lxc/110 killed as a result of limit of /lxc/110
[1489174.927075] memory: usage 1046764kB, limit 1048576kB, failcnt 13595223
[1489174.927077] memory+swap: usage 1572864kB, limit 1572864kB, failcnt 132729882
[1489174.927078] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[1489174.927079] Memory cgroup stats for /lxc/110: cache:6464KB rss:1040300KB rss_huge:0KB mapped_file:2816KB dirty:0KB writeback:0KB swap:526100KB inactive_anon:521764KB active_anon:521360KB inactive_file:1852KB active_file:1576KB unevictable:0KB
[1489174.927090] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[1489174.927281] [ 4182] 0 4182 7078 271 19 3 91 0 systemd
[1489174.927284] [ 4539] 0 4539 14478 769 32 3 17 0 systemd-journal
[1489174.927287] [ 4847] 0 4847 6350 58 17 3 1672 0 dhclient
[1489174.927290] [ 5224] 0 5224 9270 15 23 3 84 0 rpcbind
[1489174.927292] [ 5333] 0 5333 4756 0 15 3 46 0 atd
[1489174.927295] [ 5350] 0 5350 6869 16 18 3 45 0 cron
[1489174.927297] [ 5372] 0 5372 13796 28 32 3 139 -1000 sshd
[1489174.927300] [ 5417] 0 5417 4964 23 15 3 38 0 systemd-logind
[1489174.927302] [ 5498] 102 5498 10558 53 25 3 61 -900 dbus-daemon
[1489174.927305] [ 5857] 0 5857 64668 47 29 3 159 0 rsyslogd
[1489174.927308] [ 5983] 0 5983 3559 1 12 3 36 0 agetty
[1489174.927310] [ 6003] 0 6003 3559 1 12 3 38 0 agetty
[1489174.927312] [ 6555] 0 6555 9042 21 23 3 121 0 master
[1489174.927315] [ 6580] 100 6580 9570 22 23 3 114 0 qmgr
[1489174.927444] [10071] 0 10071 54528 0 36 4 302 0 bacula-fd
[1489174.927490] [31528] 999 31528 525345 110987 952 860 118421 0 node
[1489174.927548] [14452] 100 14452 9558 18 24 3 116 0 pickup
[1489174.927563] [32337] 0 32337 151273 3237 42 6 83 0 filebeat
[1489174.927575] [ 1641] 107 1641 1016796 91057 414 8 6995 0 java
[1489174.927583] [20950] 998 20950 894071 52342 245 7 0 0 java
[1489174.927587] [25666] 0 25666 12229 157 27 3 0 0 sshd
[1489174.927591] [25980] 0 25980 12229 158 27 3 0 0 sshd
[1489174.927593] [26454] 0 26454 12199 56 26 3 0 0 sshd
[1489174.927595] Memory cgroup out of memory: Kill process 31528 (node) score 588 or sacrifice child
[1489174.928829] Killed process 31528 (node) total-vm:2101380kB, anon-rss:443948kB, file-rss:0kB