Q: OOM Killer > VM lockup/killed / RAM use confusion - any comments?

fortechitsolutions

Renowned Member
Jun 4, 2008
437
48
93
Hi, I wonder if someone can help me understand maybe this situation. I've got a proxmox node with a pretty classic config, ie,
OVH hardware hosting environment, MD Raid storage, 2 x 4Tb mirror NVME_SSD as my primary proxmox datastore
Host is a 6-core Xeon 12 thread, with 128gb physical ram
I've got 7 x VM running on here, with a total of 108gig ram allocated to all the VM as combined sum of RAM for the VMs. The ram is 'static allocation' and not as a range of min-max. I have no swap setup on proxmox level. The proxmox status view shows ~112gb ram used on the server as a flatline graph for the last day/week.

There is one VM with 64gb ram allocated, and then all the others have smaller allocation (4-8gb approx)

So, the weird thing. A few hours ago, the one VM with 64gb allocated - locked up. I can see traces in the dmesg output from proxmox that
OOM Killer came in and did its thing
and killed off process to free up ram.

But. I am puzzled why it had to do this. We have not added more VM, changed RAM allocation, any other config, etc. Looking in the proxmox main status it tells me our KSM sharing is basically zero. I do have guest agent installed in all these VM (most are Windows server 2016, a few Linux as well) so in theory there is possible activity by balloon driver within some of the VM.

I guess, high level, I am wondering
  • does it make sense for some situation to arise where proxmox demands significantly more Ram, cannot get it, so it kills off a guest to free up ram? note we have 12gig free on the physical host in theory as our baseline
  • note we don't have ZFS operating here for datastore, ie, no weird hidden ram requirement for our storage. Just vanilla ext4 filesystems and linux SW MD Raid
  • I am not sure? if having swap available to proxmox might help it buffer against this kind of thing? I generally am not a fan of having swap available since it tends to get randomly used for reasons I don't understand (ie, even though ram never goes above 95% allocated) so by baseline is to NOT have swap available but maybe that is in fact bad / problem here.
  • So - any hints or comments are greatly appreciated

Just in case it is helpful I will paste below the blob capture from dmesg which hints to me about the OOM Killer event.

thank you!


Tim

Code:
[2240751.813135] CPU 7/KVM invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
[2240751.831793] CPU: 8 PID: 3295028 Comm: CPU 7/KVM Tainted: P           O       6.2.16-6-pve #1
[2240751.843698] Hardware name: GIGABYTE MX33-BS1-V1/MX33-BS1-V1, BIOS F09c 06/09/2023
[2240751.854975] Call Trace:
[2240751.861854]  <TASK>
[2240751.868379]  dump_stack_lvl+0x48/0x70
[2240751.875430]  dump_stack+0x10/0x20
[2240751.881978]  dump_header+0x50/0x290
[2240751.888616]  oom_kill_process+0x10d/0x1c0
[2240751.895571]  out_of_memory+0x23c/0x570
[2240751.902210]  __alloc_pages+0x1180/0x13a0
[2240751.909083]  __folio_alloc+0x1d/0x60
[2240751.915748]  ? policy_node+0x69/0x80
[2240751.922546]  vma_alloc_folio+0x9f/0x3d0
[2240751.929624]  __handle_mm_fault+0x9c9/0x1070
[2240751.937000]  handle_mm_fault+0x119/0x330
[2240751.944191]  ? check_vma_flags+0xb4/0x190
[2240751.951352]  __get_user_pages+0x20c/0x6b0
[2240751.958401]  __gup_longterm_locked+0xc6/0xcc0
[2240751.965830]  get_user_pages_unlocked+0x76/0x100
[2240751.973520]  hva_to_pfn+0xb5/0x4d0 [kvm]
[2240751.980737]  __gfn_to_pfn_memslot+0xb5/0x150 [kvm]
[2240751.988777]  kvm_faultin_pfn+0xab/0x360 [kvm]
[2240751.996434]  direct_page_fault+0x331/0xa00 [kvm]
[2240752.004298]  ? kvm_hv_vapic_msr_write+0x33/0xf0 [kvm]
[2240752.012686]  ? kvm_hv_set_msr_common+0x7af/0x11c0 [kvm]
[2240752.021230]  kvm_tdp_page_fault+0x2d/0xb0 [kvm]
[2240752.029714]  kvm_mmu_page_fault+0x28a/0xb40 [kvm]
[2240752.038357]  ? kvm_set_msr_common+0x39a/0x11c0 [kvm]
[2240752.047293]  ? vmx_vmexit+0x6c/0xa5d [kvm_intel]
[2240752.055803]  ? vmx_vmexit+0x9a/0xa5d [kvm_intel]
[2240752.064311]  ? __pfx_handle_ept_violation+0x10/0x10 [kvm_intel]
[2240752.074166]  handle_ept_violation+0xcd/0x400 [kvm_intel]
[2240752.083516]  vmx_handle_exit+0x204/0xa40 [kvm_intel]
[2240752.092525]  kvm_arch_vcpu_ioctl_run+0xe02/0x1740 [kvm]
[2240752.101855]  ? _copy_to_user+0x25/0x60
[2240752.109683]  kvm_vcpu_ioctl+0x297/0x7c0 [kvm]
[2240752.118119]  ? kvm_on_user_return+0x89/0x100 [kvm]
[2240752.127017]  ? kvm_on_user_return+0x89/0x100 [kvm]
[2240752.135914]  ? __fget_light+0xa5/0x120
[2240752.143779]  __x64_sys_ioctl+0x9d/0xe0
[2240752.151641]  do_syscall_64+0x58/0x90
[2240752.159249]  ? do_syscall_64+0x67/0x90
[2240752.166995]  ? exit_to_user_mode_prepare+0x39/0x190
[2240752.175970]  ? syscall_exit_to_user_mode+0x29/0x50
[2240752.184893]  ? do_syscall_64+0x67/0x90
[2240752.192778]  ? do_syscall_64+0x67/0x90
[2240752.200453]  ? do_syscall_64+0x67/0x90
[2240752.207896]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[2240752.216646] RIP: 0033:0x7f3abcc3db3b
[2240752.223781] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[2240752.249977] RSP: 002b:00007f2a90bfa130 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[2240752.261783] RAX: ffffffffffffffda RBX: 000055f274c5edb0 RCX: 00007f3abcc3db3b
[2240752.272932] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000026
[2240752.284229] RBP: 000000000000ae80 R08: 000055f271d83e00 R09: 0000000000000000
[2240752.295439] R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000000000
[2240752.306678] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[2240752.317907]  </TASK>
[2240752.324153] Mem-Info:
[2240752.329558] active_anon:26656716 inactive_anon:5870219 isolated_anon:0
                  active_file:480 inactive_file:0 isolated_file:84
                  unevictable:36076 dirty:0 writeback:0
                  slab_reclaimable:15596 slab_unreclaimable:31805
                  mapped:20068 shmem:17965 pagetables:78864
                  sec_pagetables:2023 bounce:0
                  kernel_misc_reclaimable:0
                  free:145899 free_pcp:2 free_cma:0
[2240752.396326] Node 0 active_anon:106626864kB inactive_anon:23480876kB active_file:1708kB inactive_file:660kB unevictable:144304kB isolated(anon):0kB isolated(file):0kB mapped:80236kB dirty:0kB writeback:0kB shmem:71860kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 123996160kB writeback_tmp:0kB kernel_stack:7440kB pagetables:315456kB sec_pagetables:8092kB all_unreclaimable? no
[2240752.440925] Node 0 DMA free:11264kB boost:0kB min:4kB low:16kB high:28kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[2240752.478932] lowmem_reserve[]: 0 2372 128685 128685 128685
[2240752.488702] Node 0 DMA32 free:506452kB boost:0kB min:1212kB low:3580kB high:5948kB reserved_highatomic:0KB active_anon:1881180kB inactive_anon:33840kB active_file:0kB inactive_file:276kB unevictable:0kB writepending:0kB present:2495148kB managed:2429608kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[2240752.530412] lowmem_reserve[]: 0 0 126312 126312 126312
[2240752.540388] Node 0 Normal free:65996kB boost:0kB min:66360kB low:195692kB high:325024kB reserved_highatomic:0KB active_anon:104745468kB inactive_anon:23447036kB active_file:0kB inactive_file:1956kB unevictable:144304kB writepending:0kB present:131596288kB managed:129343848kB mlocked:144304kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB
[2240752.585436] lowmem_reserve[]: 0 0 0 0 0
[2240752.594471] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
[2240752.616571] Node 0 DMA32: 18*4kB (UME) 26*8kB (UME) 179*16kB (UME) 254*32kB (UME) 233*64kB (UME) 194*128kB (UME) 164*256kB (UME) 138*512kB (UME) 59*1024kB (UME) 6*2048kB (E) 66*4096kB (ME) = 506696kB
[2240752.645535] Node 0 Normal: 187*4kB (UME) 321*8kB (UME) 939*16kB (UME) 663*32kB (UME) 196*64kB (UME) 74*128kB (UME) 14*256kB (UM) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 66180kB
[2240752.673168] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[2240752.688115] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[2240752.702696] 21550 total pagecache pages
[2240752.712747] 0 pages in swap cache
[2240752.722197] Free swap  = 0kB
[2240752.731222] Total swap = 0kB
[2240752.740282] 33526857 pages RAM
[2240752.749636] 0 pages HighMem/MovableOnly
[2240752.759769] 579653 pages reserved
[2240752.769211] 0 pages hwpoisoned
[2240752.778248] Tasks state (memory values in pages):
[2240752.789120] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[2240752.803791] [    414]     0   414    35989      998   282624        0          -250 systemd-journal
[2240752.818981] [    461]     0   461     6656      907    73728        0         -1000 systemd-udevd
[2240752.834088] [    624]     0   624     1883      928    53248        0             0 mdadm
[2240752.848500] [    714]     0   714      615      288    45056        0             0 bpfilter_umh
[2240752.863437] [    749]   105   749     1997      864    57344        0             0 rpcbind
[2240752.878203] [    773]   101   773     2035      928    49152        0          -900 dbus-daemon
[2240752.893030] [    778]     0   778    38186      512    65536        0         -1000 lxcfs
[2240752.907319] [    796]     0   796    69539      864    81920        0             0 pve-lxc-syscall
[2240752.922474] [    805]     0   805     1328      384    49152        0             0 qmeventd
[2240752.937103] [    807]     0   807     2797     1056    61440        0             0 smartd
[2240752.951536] [    808]     0   808     4267     1088    69632        0             0 systemd-logind
[2240752.966581] [    809]     0   809      583      288    36864        0         -1000 watchdog-mux
[2240752.981414] [    813]     0   813    25347      800    77824        0             0 zed
[2240752.995483] [    868]   104   868     4715      810    57344        0             0 chronyd
[2240753.009889] [    869]   104   869     2633      562    57344        0             0 chronyd
[2240753.024248] [    871]     0   871   160247     8660   458752        0             0 fail2ban-server
[2240753.039121] [    872]     0   872     1256      544    49152        0             0 lxc-monitord
[2240753.053669] [    905]     0   905     7173     2784   102400        0             0 unattended-upgr
[2240753.068224] [    967]     0   967     3851     1056    73728        0         -1000 sshd
[2240753.081691] [   1181]     0  1181    10664      635    77824        0             0 master
[2240753.095245] [   1201]     0  1201      723      448    40960        0             0 agetty
[2240753.108832] [   1203]     0  1203      734      448    45056        0             0 agetty
[2240753.122149] [   1206]     0  1206   181841      869   167936        0             0 rrdcached
[2240753.135635] [   1215]     0  1215   159402    17917   479232        0             0 pmxcfs
[2240753.148698] [   1222]     0  1222   139329    41176   393216        0             0 corosync
[2240753.161872] [   1238]     0  1238    70165    23447   311296        0             0 pvestatd
[2240753.174920] [   1239]     0  1239    70234    23104   274432        0             0 pve-firewall
[2240753.188222] [   1263]     0  1263    88582    32048   376832        0             0 pvedaemon
[2240753.201126] [   1270]     0  1270    85640    26440   339968        0             0 pve-ha-crm
[2240753.213973] [   1298]    33  1298    88995    32850   434176        0             0 pveproxy
[2240753.226520] [   1305]    33  1305    19357    13312   196608        0             0 spiceproxy
[2240753.239142] [   1307]     0  1307    85507    26532   348160        0             0 pve-ha-lrm
[2240753.251616] [   1314]     0  1314    84419    26598   335872       32             0 pvescheduler
[2240753.264125] [  56552]     0 56552  4116449  2152349 31404032      224             0 kvm
[2240753.275797] [ 374534]  1000 374534     4812     1248    73728        0           100 systemd
[2240753.287740] [ 374535]  1000 374535    42345     1330    90112        0           100 (sd-pam)
[2240753.299739] [ 374666]     0 374666     1086      513    45056        0             0 screen
[2240753.311580] [ 374667]     0 374667     1152      672    45056        0             0 bash
[2240753.323114] [ 374718]     0 374718     2306      476    45056        0             0 rsync
[2240753.334767] [1780816]     0 1780816  3034689  2112726 22724608        0             0 kvm
[2240753.346371] [ 112854]     0 112854     1005      480    45056        0             0 cron
[2240753.358032] [ 999902]     0 999902     4413     1312    77824        0             0 sshd
[2240753.369701] [ 999905]     0 999905     4837     1280    77824        0           100 systemd
[2240753.381561] [ 999906]     0 999906    42709     1716    94208        0           100 (sd-pam)
[2240753.393520] [ 999927]     0 999927     1667      896    49152        0             0 bash
[2240753.405187] [1000658]   106 1000658    10684      832    77824        0             0 qmgr
[2240753.416926] [1005341]     0 1005341     4413     1280    77824        0             0 sshd
[2240753.428592] [1005349]     0 1005349     1390      928    45056        0             0 bash
[2240753.440133] [1066291]   106 1066291    12104     1056    86016        0             0 tlsmgr
[2240753.451882] [1098115]     0 1098115  2864374  2164728 21618688        0             0 kvm
[2240753.463405] [1098312]     0 1098312  2953060  1531121 17444864        0             0 kvm
[2240753.474908] [1098543]     0 1098543  2894262  2164229 21766144        0             0 kvm
[2240753.486404] [1098753]     0 1098753  5843613  4214546 44752896        0             0 kvm
[2240753.497914] [1224248]     0 1224248    90831    32561   421888        0             0 pvedaemon worke
[2240753.510735] [1226480]     0 1226480    90838    32657   421888        0             0 pvedaemon worke
[2240753.523285] [1419897]     0 1419897  3420482  2125884 25452544        0             0 kvm
[2240753.534788] [1425711]     0 1425711    90798    32529   421888        0             0 pvedaemon worke
[2240753.547347] [3085073]    33 3085073    21362    13332   196608        0             0 spiceproxy work
[2240753.559971] [3085074]    33 3085074    91182    32914   430080        0             0 pveproxy worker
[2240753.572631] [3085075]    33 3085075    91082    32722   421888        0             0 pveproxy worker
[2240753.585175] [3085076]    33 3085076    91187    32882   421888        0             0 pveproxy worker
[2240753.597750] [3085078]     0 3085078    19796      384    49152        0             0 pvefw-logger
[2240753.610178] [3266653]   106 3266653    10672      832    77824        0             0 pickup
[2240753.621942] [3294990]     0 3294990 17418716 15797859 128987136        0             0 kvm
[2240753.633573] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=qemu.slice,mems_allowed=0,global_oom,task_memcg=/qemu.slice/107.scope,task=kvm,pid=3294990,uid=0
[2240753.655631] Out of memory: Killed process 3294990 (kvm) total-vm:69674864kB, anon-rss:63189260kB, file-rss:2176kB, shmem-rss:0kB, UID:0 pgtables:125964kB oom_score_adj:0
 
I am not sure? if having swap available to proxmox might help it buffer against this kind of thing? I generally am not a fan of having swap available since it tends to get randomly used for reasons I don't understand (ie, even though ram never goes above 95% allocated) so by baseline is to NOT have swap available but maybe that is in fact bad / problem here.
There is also RAM fragmentation where you got free RAM but this then can't be used. So having 16GB of free RAM doesn't mean PVE can actually use all of that. I'm no RAM expert, but I heard swap would help there too to clean up the RAM.

I generally am not a fan of having swap available since it tends to get randomly used for reasons I don't understand
If you just want swap to be present to prevent OOM but you don't want PVE to actually use it, you could set the swappiness to 0. Then no swap will be used unless it is unavoidable.
 
Having some swap usage isn't bad, even if you don't understand why it does that. What @Dunuin said is one reason why you should have some even if it just a couple of GiB.

What is bad is swapping things in and out. That hurts performance. But just having little-used data in swap doesn't hurt anything and in some circumstances is helpful.
 
In those "i don't want swap on disk" situations, just install zram, which is compressed swap in RAM so that you can acutally swap out some things and get the OOM later.
 
Hi everyone, thank you for all the great replies. I must admit I had forgotten about the obvious baseline problem of "what if ram is heavily sliced up" - this was definitely a factor here, ie,

server was online for ~90 days
initially all VM were spun up, with less than 64gb allocated to my 'big RAM vm'
over a 3 week period I tweaked ram settings on some of the other VM, for example dropped a ram allocation on one VM from 16>8gb because it only appeared to be using 4-6gb approx. Rinse and repeat this process for many of the 'smaller RAM need VM hosts'
this freed up some ram "apparently" in proxmox so I allocated more to my big-ram-need-VM (SQL server with very big DB)
that seemed to go ok
but then ram drama / OOM fun begins

So. What I've done in the last 24 hours
- scheduled and done a reboot on proxmox after first review confirm my backups are good
- setup swap config so we have 64gb swapfile on NVME_SSD Backed raid storage.
- and also first did all current pending updates on proxmox prior to reboot / so those are all active and in place
- then did reboot.
- after doing this my ram used baseline with all VM spun up 'as before' was about 8-10gig lower. Which is nice-interesting.
- I also realized, my oversight, that KSM is not enabled by default (in the past I think it was turned on by default?) so I re-read the docs on KSM about why it might be undesirable (ie, security implication in some use case). For this use case KSM is a no-brainer good-thing I believe (ie, I have 6-8 windows server VM running here, most are same version of windows so good candidate for KSM Dedup).
- after turn on KSM and letting it do page monitor/dedup work for an ~hour or two. It made great progress and freed up approx 40-50gb ram
- after the SQL server was put back into full service, the amount of KSM savings was reduced, since we have large ram demand from this one SQL server VM - makes sense - but now I am still getting approx 24gb of ram saved by KSM being turned on now here. So that is still a win I think

Endgame for the moment - reboot was done last night - things are running smoothly this morning. Everything seems great. I will monitor and post back to the thread next week to 'check in'.

In general? I am assuming that KSM maybe has some 'clever' way to avoid causing memory fragmentation on proxmox as part of its operation (ie, since inherently it plays with available ram vs allocated ram, and I am assuming it does so in a way that does not cause 'infinite fragmentation of allcoated ram over time' on proxmox ? but ultimately if it works, I am not too fussed.

thank you again everyone for the help - it is greatly appreciated!

Tim
 
In general? I am assuming that KSM maybe has some 'clever' way to avoid causing memory fragmentation on proxmox as part of its operation (ie, since inherently it plays with available ram vs allocated ram, and I am assuming it does so in a way that does not cause 'infinite fragmentation of allcoated ram over time' on proxmox ? but ultimately if it works, I am not too fussed.
No, fragmention is omnipresent. Please read about the slab allocator and how to free / combine free slabs to have bigger chunks. I can also recommend to look into hugepages, which are PERFECT to get rid of such problems like you have, yet can be a bit harder to setup and maintain over time. They also only make sense in certain situations and you need to evaluate the use carefully, because you can run into similar problems easily, but different than before.
 
Hi LnxBil, thank you for this added detail. I just did some digging/reading online, so far the most-clear thing I could find to talk about HugePage and possible impact on KSM was here > here in redhat docs

If I am reading it correctly, HugePage enabled will likely give better performance to VM with large RAM needs.
but it may also reduce efficiency of KSM at reclaiming memory.

For now I will wait-and-see in my current config before rocking the boat with more change. So far at 1.5day uptime since my reboot/config adjustments Ram use has reached a pretty steady state around 90gb 'in use' according to proxmox graph (down from ~111gb used before turn on KSM, and up from ~80gb which is where things sat shortly after the 'enable KSM and reboot for clean slate'. ie, things are feeling pretty good right now.

Additionally, I don't specifically have a desire to squeeze in "more VM" onto this host - this is not a situation of "try to fit in more and more VM" so much as "fit the defined pool of VM onto the host, and have it run smoothly and entirely without drama". I think we were just a bit too close to 'full' for RAM in my old config pre-KSM and now with KSM turned on (and also some swap available) - the ram situation is simply - less pressured. I think it will become more clear after I wait a week to be sure 'it is good' now but so far - seems good.

anyhoo. Thanks for for the extra info on this. And I will post a followup in the coming week after more time elapses as a 'status update' on the thread in case of interest.

Tim
 
Hi LnxBil, thank you for this added detail. I just did some digging/reading online, so far the most-clear thing I could find to talk about HugePage and possible impact on KSM was here > here in redhat docs

If I am reading it correctly, HugePage enabled will likely give better performance to VM with large RAM needs.
but it may also reduce efficiency of KSM at reclaiming memory.
Yes, those two optimize different things and are mutually exclusive. Hugepages are a CPU feature, so KSM can only work if those hugepages (default 2M or 1G) are identical, which is nearly never the case, as it is for 4K pages. Hugepages do not fragment and need less space for the page table, yet they have to match the VMs in play. Hugepages are (or used to be) also non-swappable, so this may also be a problem in low-memory conditions.

Running e.g. 1G hugepages is awesome if you only have big VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!