[SOLVED] Host OOM killing containers and KVM, but plenty of RAM available

Feb 27, 2020
40
3
28
51
Hi
I have a 3 node cluster as development environment; all the nodes are basically 200GB RAM, 48 cores dell R620. I have found recently that in one of the nodes, OOM is killing always the same daemon running inside an lxc container and also a VM.

The host has a lot of empty ram, taking a look at the memory summary graph in WEBUI it spikes up to 100GB, while having nearly 80GB free ram.
The memory section of the OOM report shows this, but I am unable to make something of it apart from seeing that it has ran out of swap

Can anyone give me a hint? ti does not look to me as a lack of free ram, but free -hm shows the empty memory as shared

root@proxmox-1:~# free -hm
total used free shared buff/cache available
Mem: 173Gi 73Gi 961Mi 90Gi 98Gi 7.9Gi
Swap: 8.0Gi 3.2Gi 4.8Gi

OOM
Jan 20 04:13:50 proxmox-1 kernel: [86074.425454] Mem-Info:
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] active_anon:29519471 inactive_anon:14942755 isolated_anon:0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] active_file:532 inactive_file:880 isolated_file:0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] unevictable:46700 dirty:0 writeback:0 unstable:0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] slab_reclaimable:229939 slab_unreclaimable:152941
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] mapped:15175352 shmem:22540992 pagetables:105914 bounce:0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425466] free:121030 free_pcp:1231 free_cma:0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425471] Node 0 active_anon:44693200kB inactive_anon:51957492kB active_file:316kB inactive_file:528kB unevictable:176632kB isolated(anon):0kB isolated(file):0kB mapped:47389748kB dirty:0kB writeback:0kB shmem:47394
372kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 22093824kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Jan 20 04:13:50 proxmox-1 kernel: [86074.425476] Node 1 active_anon:73384684kB inactive_anon:7813528kB active_file:1812kB inactive_file:2992kB unevictable:10168kB isolated(anon):0kB isolated(file):0kB mapped:13311660kB dirty:0kB writeback:0kB shmem:42769
596kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1910784kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jan 20 04:13:50 proxmox-1 kernel: [86074.425478] Node 0 DMA free:15896kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15896kB mlocked:0kB kernel_st
ack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425483] lowmem_reserve[]: 0 1882 96596 96596 96596
Jan 20 04:13:50 proxmox-1 kernel: [86074.425488] Node 0 DMA32 free:379812kB min:956kB low:2880kB high:4804kB active_anon:1571916kB inactive_anon:5156kB active_file:4kB inactive_file:0kB unevictable:0kB writepending:0kB present:2034624kB managed:1969088kB
mlocked:0kB kernel_stack:0kB pagetables:2536kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425493] lowmem_reserve[]: 0 0 94714 94714 94714
Jan 20 04:13:50 proxmox-1 kernel: [86074.425497] Node 0 Normal free:47548kB min:48160kB low:145144kB high:242128kB active_anon:43121284kB inactive_anon:51952336kB active_file:312kB inactive_file:528kB unevictable:176632kB writepending:0kB present:9856614
4kB managed:96987768kB mlocked:176632kB kernel_stack:9112kB pagetables:208888kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425502] lowmem_reserve[]: 0 0 0 0 0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425506] Node 1 Normal free:40864kB min:40984kB low:123520kB high:206056kB active_anon:73384684kB inactive_anon:7813528kB active_file:1812kB inactive_file:2992kB unevictable:10168kB writepending:0kB present:8388608
0kB managed:82544048kB mlocked:10168kB kernel_stack:9480kB pagetables:212232kB bounce:0kB free_pcp:4928kB local_pcp:252kB free_cma:0kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425511] lowmem_reserve[]: 0 0 0 0 0
Jan 20 04:13:50 proxmox-1 kernel: [86074.425514] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425526] Node 0 DMA32: 105*4kB (UME) 112*8kB (UMH) 124*16kB (UEH) 134*32kB (UE) 138*64kB (UMEH) 145*128kB (UME) 125*256kB (UEH) 117*512kB (UMEH) 75*1024kB (UEH) 30*2048kB (ME) 28*4096kB (M) = 379812kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425540] Node 0 Normal: 641*4kB (UM) 395*8kB (UMEH) 1676*16kB (UMEH) 469*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 47548kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425549] Node 1 Normal: 149*4kB (UMEH) 1899*8kB (UEH) 950*16kB (UEH) 276*32kB (UEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 39820kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425560] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425562] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425564] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425565] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425566] 22575802 total pagecache pages
Jan 20 04:13:50 proxmox-1 kernel: [86074.425570] 29334 pages in swap cache
Jan 20 04:13:50 proxmox-1 kernel: [86074.425571] Swap cache stats: add 3142908, delete 3113488, find 2096445/2333307
Jan 20 04:13:50 proxmox-1 kernel: [86074.425572] Free swap = 0kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425573] Total swap = 8388604kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425574] 46125707 pages RAM
Jan 20 04:13:50 proxmox-1 kernel: [86074.425575] 0 pages HighMem/MovableOnly
Jan 20 04:13:50 proxmox-1 kernel: [86074.425576] 746507 pages reserved
Jan 20 04:13:50 proxmox-1 kernel: [86074.425576] 0 pages cma reserved
Jan 20 04:13:50 proxmox-1 kernel: [86074.425577] 0 pages hwpoisoned
 
Shared memory isn't empty and can't be freed easily. So no wonder that OOM kicked in if only 1GB is free and 7.9Gi is available. The question is why is so much memory shared?
 
Last edited:
Hi,

i can´t give you technical details, but maybe it has to do with fragmentation.
As i understand, not only the sum of free RAM is important, but also the free RAM in the right category and in the right size.


Code:
Jan 20 04:13:50 proxmox-1 kernel: [86074.425514] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425526] Node 0 DMA32: 105*4kB (UME) 112*8kB (UMH) 124*16kB (UEH) 134*32kB (UE) 138*64kB (UMEH) 145*128kB (UME) 125*256kB (UEH) 117*512kB (UMEH) 75*1024kB (UEH) 30*2048kB (ME) 28*4096kB (M) = 379812kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425540] Node 0 Normal: 641*4kB (UM) 395*8kB (UMEH) 1676*16kB (UMEH) 469*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 47548kB
Jan 20 04:13:50 proxmox-1 kernel: [86074.425549] Node 1 Normal: 149*4kB (UMEH) 1899*8kB (UEH) 950*16kB (UEH) 276*32kB (UEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 39820kB

here you have some concrete pools exhausted

Code:
 ..... 0*2048kB 0*4096kB ....

and so on.

you can check with:

Code:
cat /proc/buddyinfo

Example Output:

Code:
Node 0, zone      DMA      0      1      0      1      1      1      1      0      1      1      3
Node 0, zone    DMA32  16157   6249   3454    866    965    401    238    143     68      0      0
Node 0, zone   Normal  17673  12838   5634    420      9      0      0      0      0      1      0
Node 1, zone   Normal 270429  11838    730    389    159     50     22      0      0      0      0


I think these information will help to find more information out there.
 
Thanks both.
I have been doing a lot of digging and test into this:
  1. The first conclusion is Memory Usage chart in Proxmox WebUI is misleading as it does not reflect shared memory. As @Dunuin mentioned shared memory is not free memory, but as not shown in the chart I was heavily overcommitting memory on the host, so OOM seemed to be my fault both for VM and lxc. Once addressed this and providing enough memory , VM is not being OOM, LXC still is.
  2. This lxc has been recreated using ansible in the other nodes of the cluster, exactly same config, same lxc base, the others LXC run exactly the same process, same load, but they do not OOM. I have migrated containers which work ok to the same node, and they do not OOM.
  3. I have migrated all LXC to a different node, same behavior. The original LXC OOM, the others do not.
So at the end, I have just deleted and recreated the original LXC and everything is fine, no oom, it is just weird and I can not find a reason for this, but now everything is working ok for at least 3 weeks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!