[PARTIALLY SOLVED] Following suggestions of hugeadmin --explain breaks oom watchdog and system crashes when starting qemu vms w/ iommu

FuriousGeorge · Aug 2, 2023

UPDATE

My main problem was that I was setting vm.nr_hugepages to 1024*2 in sysctl, without understanding that this was overriding my cmdline argument of hugepages=1. I thought I was defining the pagefile size. As a result, hugepages would allocate 119 or 118 x 1Gb pagefiles, which the system clearly did not like.

Shortly after, I realized that with proxmox 8 my vms start fine on boot without hugepages at all (I'm not sure if this was the case in proxmox 6 or 7, since I'm pretty sure I set up hugepages under proxmox 5).

The stress tester (mprime) would still crash my system, as opposed to being killed by the oom watchdog. I tested with an ubuntoo livecd and it would get killed by oom. I tested with an archlinux liveusb, and it would just run as per normally.

Over in the #proxmox IRC channel on liberachat, a person who goes by alian5687, recommended a short C program which would spam memory allocation (via malloc) to see if that would also crash my system, and it would (as suspected).

I then noticed that /proc/sys/kernel/shmmax was still giving me a value in the tens of exobytes:

Code:

# cat /proc/sys/kernel/shmmax:
18446744073692774399

(hugeadm --explain was still telling me the value was only 9,223,372,036,854,775,807, so I uninstalled libehugetblfs)

I manually set it to 16 gb and set min_free_kybytes to ~1.28 gb.

Now mprime runs without triggering oom killing, and doesn't always immediately crash the system, although the system becomes totally unresponsive when it doesn't.

It would be nice to go back to a time when the oom watchdog worked, but since this is just my personal PC, I think I'll leave well enough for the moment.

What follows is the original post:

My PC runs two VMs, to which I've allocated 96 GB of ram combined. Due to the amount of ram they use, I enabled hugepages, as otherwise they would timeout (under a previous version of proxmox) on startup. At first all I did was edit /etc/kernel/cmdline as follows:

Code:

root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet amd_iommu=on hugepagesz=1G default_hugepagesz=1G

As well as adding the following line to /etc/sysctl.conf:

Code:

vm.hugetlb_shm_group = 0

Then I could start both vms simultaneously on boot.

Many months later, I tried to run prime95 on the host in order to verify system stability, and it quickly crashed during the memory intensive test, due to the oom watchdog. I came to learn that prime95 does not understand hugepages, and that could cause this.

Instead of leaving well enough alone, I ran hugeadm --explain, and followed the suggestions it gave me, to see if I could get prime95 to play nicer. Some of the values made no sense, such as setting min_free_kbytes to an amount larger than my total ram, which I did blindly, thinking that the value was in bytes. This of course led to a system that would hang on boot. I reverted that change and put in a number that made more sense, but it still complained that I should set the value to a number larger than my total ram.

My sysctl.conf now looked like this:

Code:

kernel.shmmax = 120000000
vm.nr_hugepages = 1048576
vm.hugetlb_shm_group = 102
vm.min_free_kbytes = 3145728

At this point, prime95 would no longer get killed by the oom watchdog, Instead the system would crash within a few seconds. I reverted the changes, but for some reason some seem to have become persistent, even if they are no longer in sysctl.conf.

Worse still, starting either vm on its own will crash the system in the same way, although they are not using nearly enough ram to do so on that basis alone.

On top of that, hugeadm --explain is now recommending values that are not sane:

Code:

# hugeadm --explain
Total System Memory: 128741 MB


Mount Point          Options
/dev/hugepages       rw,relatime,pagesize=1024M


Huge page pools:
      Size  Minimum  Current  Maximum  Default
   2097152        0        0        0
1073741824      118      118      118        *


Huge page sizes with configured pools:
1073741824


The /proc/sys/vm/min_free_kbytes of 90112 is too small. To maximiuse efficiency
of fragmentation avoidance, there should be at least one huge page free per zone
in the system which minimally requires a min_free_kbytes value of 230686720


A /proc/sys/kernel/shmmax value of 9223372036854775807 bytes may be sub-optimal. To maximise
shared memory usage, this should be set to the size of the largest shared memory
segment size you want to be able to use. Alternatively, set it to a size matching
the maximum possible allocation size of all huge pages. This can be done
automatically, using the --set-recommended-shmmax option.


The recommended shmmax for your currently allocated huge pages is 126701535232 bytes.
To make shmmax settings persistent, add the following line to /etc/sysctl.conf:
  kernel.shmmax = 126701535232


To make your hugetlb_shm_group settings persistent, add the following line to /etc/sysctl.conf:
  vm.hugetlb_shm_group = 102


Note: Permanent swap space should be preferred when dynamic huge page pools are used.

EDIT: I didn't even notice it was allocating 119 pagefiles until after I wrote this post.

So now I have set shmmax in sysctl.conf again, because the value above makes no sense. It is once again recommending a min_free_kbytes larger than my total ram.

I am running the latest version of proxmox:

Code:

# pveversion
pve-manager/8.0.3/bbf3993334bfa916 (running kernel: 6.2.16-5-pve)

Is there anything I can do, short of reinstalling the OS, to get my vms and the oom monitor working again?

Search

Search

[PARTIALLY SOLVED] Following suggestions of hugeadmin --explain breaks oom watchdog and system crashes when starting qemu vms w/ iommu

FuriousGeorge

Renowned Member

We value your privacy