OOM errors after upgrading to Proxmox 8

moxmox · Sep 10, 2023

I have been running Proxmox 7 and never had an issue, I have had OOM errors twice in the last two weeks and it ends up killing the VM thats assigned the most amount of memory

can anyone explain why this is?

I have 64GiB of RAM -

I have four containers

103 2.00GiB assigned
104 2.00GiB assigned
166 2.00GiB assigned
167 1.00GiB assigned

and three vms

100 16.00GiB assigned
101 8.00GiB assigned
102 6.00GiB assigned

= 37GiB total

moxmox · Sep 10, 2023

attached full sys log of when it happened ( I would have pasted it all in but it was too large)

Code:

Sep 09 18:40:27 nuc11propve kernel: systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250

Sep 09 18:40:27 nuc11propve kernel: CPU: 5 PID: 1856820 Comm: systemd-journal Tainted: P           O       6.2.16-10-pve #1

Sep 09 18:40:27 nuc11propve kernel: Hardware name: Intel(R) Client Systems NUC11TNHv5/NUC11TNBv5, BIOS TNTGLV57.0074.2023.0705.1856 07/05/2023

Sep 09 18:40:27 nuc11propve kernel: Call Trace:

Sep 09 18:40:27 nuc11propve kernel:  <TASK>

Sep 09 18:40:27 nuc11propve kernel:  dump_stack_lvl+0x48/0x70

Sep 09 18:40:27 nuc11propve kernel:  dump_stack+0x10/0x20

Sep 09 18:40:27 nuc11propve kernel:  dump_header+0x50/0x290

Sep 09 18:40:27 nuc11propve kernel:  oom_kill_process+0x10d/0x1c0

Sep 09 18:40:27 nuc11propve kernel:  out_of_memory+0x23c/0x570

Sep 09 18:40:27 nuc11propve kernel:  __alloc_pages+0x1180/0x13a0

Sep 09 18:40:28 nuc11propve kernel:  ? rrw_exit+0x72/0x170 [zfs]

Sep 09 18:40:28 nuc11propve kernel:  alloc_pages+0x90/0x1a0

Sep 09 18:40:28 nuc11propve kernel:  folio_alloc+0x1d/0x60

Sep 09 18:40:28 nuc11propve kernel:  filemap_alloc_folio+0xfd/0x110

Sep 09 18:40:28 nuc11propve kernel:  __filemap_get_folio+0x1d4/0x3c0

Sep 09 18:40:28 nuc11propve kernel:  filemap_fault+0x14a/0x940

Sep 09 18:40:28 nuc11propve kernel:  ? filemap_map_pages+0x14b/0x6f0

Sep 09 18:40:28 nuc11propve kernel:  __do_fault+0x36/0x150

Sep 09 18:40:28 nuc11propve kernel:  do_fault+0x1c7/0x430

Sep 09 18:40:28 nuc11propve kernel:  __handle_mm_fault+0x6d9/0x1070

Sep 09 18:40:28 nuc11propve kernel:  handle_mm_fault+0x119/0x330

Sep 09 18:40:28 nuc11propve kernel:  ? lock_mm_and_find_vma+0x43/0x230

Sep 09 18:40:28 nuc11propve kernel:  do_user_addr_fault+0x194/0x620

Sep 09 18:40:28 nuc11propve kernel:  exc_page_fault+0x80/0x1b0

Sep 09 18:40:28 nuc11propve kernel:  asm_exc_page_fault+0x27/0x30

Sep 09 18:40:28 nuc11propve kernel: RIP: 0033:0x7fadf5cf23a0

Sep 09 18:40:28 nuc11propve kernel: Code: Unable to access opcode bytes at 0x7fadf5cf2376.

Sep 09 18:40:28 nuc11propve kernel: RSP: 002b:00007ffd93e22cd8 EFLAGS: 00010202

Sep 09 18:40:28 nuc11propve kernel: RAX: 0000000000000007 RBX: 000055a0aef8dd20 RCX: 00007fadf5d69780

Sep 09 18:40:28 nuc11propve kernel: RDX: 0000000000000001 RSI: 0000000000000007 RDI: 000055a0aef8dd20

Sep 09 18:40:28 nuc11propve kernel: RBP: 000055a0aef7dbc0 R08: 00000000000000a7 R09: 0000000006758900

Sep 09 18:40:29 nuc11propve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000007

Sep 09 18:40:29 nuc11propve kernel: R13: 0000000000218420 R14: 0000000000000000 R15: 00007ffd93e22d00

Sep 09 18:40:29 nuc11propve kernel:  </TASK>

Sep 09 18:40:29 nuc11propve kernel: Mem-Info:

Sep 09 18:40:29 nuc11propve kernel: active_anon:6812440 inactive_anon:736168 isolated_anon:0

 active_file:2170 inactive_file:2100 isolated_file:3

 unevictable:46 dirty:132 writeback:360

 slab_reclaimable:111802 slab_unreclaimable:2808331

 mapped:26809 shmem:92999 pagetables:21159

 sec_pagetables:14533 bounce:0

 kernel_misc_reclaimable:0

 free:123875 free_pcp:251 free_cma:0

moxmox · Sep 10, 2023

The memory usage seems to creep up over the days - surely if this is the ZFS cache it should be released when in a low memory situation?

moxmox · Sep 10, 2023

root@nuc11propve:~# arc_summary | grep "ARC size (current)"
ARC size (current): 93.7 % 29.3 GiB

_gabriel · Sep 10, 2023

you can set down arc size.
iirc, pve8 have problem with ksm.

LnxBil · Sep 10, 2023

What about adding swap? Running without is always a bit risky. You can also used compressed RAM swap zram-tools.

moxmox · Sep 10, 2023

_gabriel said:
you can set down arc size.
iirc, pve8 have problem with ksm.

Do you have any more info re these issues? I may try disabling ksm for a bit

emunt6 · Sep 11, 2023

moxmox said:
Do you have any more info re these issues? I may try disabling ksm for a bit

Code:

$> apt-get install sysfsutils

/etc/sysfs.d/ksm.conf
   kernel/mm/ksm/run = 0

$> systemctl restart sysfsutils

REBOOT

# check settings #
$> cat /sys/kernel/mm/ksm/run
?
0 -> to disable ksm, read 0 while ksm is disabled. *******
1 -> to run ksm, read 1 while ksm is running.
2 -> to disable ksm and unmerge all its pages.
#################

Neobin · Sep 11, 2023

moxmox said:
The memory usage seems to creep up over the days - surely if this is the ZFS cache it should be released when in a low memory situation?

It should be, yes; but often it is simply to slow in doing this.
Would suggest to limit the ARC (as a first and maybe even only step):
https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage

moxmox said:
Do you have any more info re these issues? I may try disabling ksm for a bit

https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM) -> "Disabling KSM"

LnxBil · Sep 11, 2023

What are the problems exactly with KSM and it is known upstream? I cannot think of anything but lower memory usage with KSM.

_gabriel · Sep 11, 2023

_gabriel said:
you can set down arc size.
iirc, pve8 have problem with ksm.

I mean , KSM works in PVE7 so more RAM is merged then in PVE8 (+6.2 Kernel) it cannot get same result then you get OOM.
The issue is more on specific case.
Reduce ARC + add SWAP as LnxBill said.

LnxBil · Sep 11, 2023

_gabriel said:
PVE8 (+6.2 Kernel) it cannot get same result then you get OOM.

Okay, not as good as before, but it is not wasting memory in term of "if it's enabled, then you have less than if it would not be enabled". Got it. Thanks for the link.

moxmox · Sep 13, 2023

thanks for all the help I have reduced the ARC cache to 16GiB and seems to be all good so far, will update here if it crashes again.

Seems like an issue that the arc cache does not free up automatically though when memory is low.

fabian · Sep 13, 2023

LnxBil said:
What are the problems exactly with KSM and it is known upstream? I cannot think of anything but lower memory usage with KSM.

the problem becomes quite apparent if you think about what KSM does - if two processes (and process here usually means VM) happen to have the same memory page mapped, they get merged. the two processes/VMs might not be related at all (e.g., one might be user A, and the other, attacker B). this means that the attacker has a side channel that can leak the existence of memory pages in other contexts (there are multiple ways, one is to load and measure the cached access time, flush the page out of caches, wait, then reload the page and compare against the previous measurement - you now know whether somebody else accessed the page in the meantime or not). depending on how data is layed out in memory, this might make it trivial to leak sensitive information (the usual target would be crypto keys that are currently in use).

"the" famous paper regarding this is "FLUSH+RELOAD: a High Resolution, Low Noise,L3 Cache Side-Channel Attack" (PDF), ending with:

Preventing page sharing also blocks the FLUSH+RELOAD technique. Given the strength of the attack, we believe that the memory saved by sharing pages in a virtualised environment does not justify the breach in the isolation between guests. We, therefore, recommend that memory de-duplication be switched off.

LnxBil · Sep 14, 2023

fabian said:
the problem becomes quite apparent ...

Thank you @fabian, I already knew this and it was not my intented question. My question was targeted to the acual problem in this thread, or more precisely the problem in the linked thread (as OP pointed out) in which the memory is really not merged as good as before and that is the problem. Yet it is still good to shed a light on the security-problems of KSM, like you did.

fabian · Sep 15, 2023

LnxBil said:
Thank you @fabian, I already knew this and it was not my intented question. My question was targeted to the acual problem in this thread, or more precisely the problem in the linked thread (as OP pointed out) in which the memory is really not merged as good as before and that is the problem. Yet it is still good to shed a light on the security-problems of KSM, like you did.

ha - sorry. yeah, see the linked thread - a likely upstream fix was identified and should trickle down into PVE kernels via the stable trees.

Search

Search

OOM errors after upgrading to Proxmox 8

moxmox

Well-Known Member

moxmox

Well-Known Member

Attachments

moxmox

Well-Known Member

moxmox

Well-Known Member

_gabriel

Famous Member

LnxBil

Distinguished Member

moxmox

Well-Known Member

emunt6

Active Member

Neobin

Distinguished Member

LnxBil

Distinguished Member

_gabriel

Famous Member

LnxBil

Distinguished Member

moxmox

Well-Known Member

fabian

Proxmox Staff Member

LnxBil

Distinguished Member

fabian

Proxmox Staff Member

We value your privacy