kvm invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=2, oom_score_adj=0 Mar 18 13:09:20 proxmox2 kernel: [99143.764060] CPU: 1

Benoît Fontaine · Mar 18, 2021

Hi,
My KMV got an out of memory, 3 times since yesterday, any idea ?
syslog: https://0bin.net/paste/9hcAzeJa#Wm976GOGKHVrMIZBk1yqz5ptcFqUHQ4824uwcQoxeNs

tom · Mar 18, 2021

Benoît Fontaine said:
any idea ?

you run out of memory.

HP ProLiant ML350 G6, BIOS D22 12/02/2012
you run a 9+ year old server. not related to your question but just my 2 cents - if you need a stable server, do not run an ancient server.

Benoît Fontaine · Mar 18, 2021

https://miro.medium.com/max/3724/1*NTvc8323iiclGBN4O2ZGnA.jpeg

Would I've given too much ram to the vm? (6Go, the host have 12)

tom · Mar 18, 2021

Just monitor your ram usage and find the process eating your ram.

Benoît Fontaine · Mar 22, 2021

Hi,
I got the same issue on a second node : https://0bin.net/paste/++4Q1++1#Z9wbvOTP71W4pZylTMoWELlMJEAG0GWg5sLlz-tce0t

I think there is a problem with the kernel version 5.4.103-1-pve. I'll try to downgrade on 5.4.78-2-pve since I have a node on this version without kmv crashes...

chrcoluk · Aug 22, 2021

OOM is broken in some way.

Some background.

32 gig physical ram
ZFS capped to 8gig, but not even using 2 gig.
1 gig debian VM running
8 gig windows VM running
19 gig available for allocation.

2 minutes after window VM started, it was killed by oom-killer.

Currently investigating but it looks like there is something broken in the kernel in regards to memory allocation.

chrcoluk · Aug 22, 2021

Ok I diagnosed it, I am not 100% on this yet, but have gone from 100% OOM, to 100% no OOM.

Summary of issue again.

Host machine
Proxmox 6.4
5.11 kernel but also happened on 5.4 kernel when retested
32 gig ram
No other running VMs
zvol backed storage
sync=standard for writes
qemu cache=nocache
os type set to other

Guest
Windows 10 21H2
40 gig virtual disk
default write cache enabled with flushes for sync writes
8 gig memory

ARC
Capped to 8 gig

So as above was getting OOM every time the write test was running on crystal diskmark, oddly this doesnt happen on smaller virtual disks, only the boot disk.

When I changed the os type to windows 10 on the machine config, it got even worse. The entire proxmox server went down instead of just the guest with a kernel panic, memory deadlock. (this makes me wonder how much the os type setting adjusts things on the host to cause this behaviour)

I decided this was down to memory fragmentation issues, it had to be, as there 10s of gigs of free ram when this was occurring.

cat /proc/buddyinfo showed there was little free continuous regions of memory when the write part of crystal diskmark was running. I observed it was using transparent hugepages, so I added 'transparent_hugepage=never' to the kernel cmdline.

After the reboot I retested with both os type other and os type windows and the problem is gone.

I also discovered if I cap the max txg size for the zfs dirty cache it will also prevent the problem even with transparent hugepages enabled. By default on this machine it was set to 4 gig, I was able to go as big as 2 gig without issues.

There is serious write amplification as is been shown in the ssd benchmark thread, This amplification may well be contributing to the memory fragmentation. The volblock size is 4k on the guest drive as ntfs is 4k clusters, I dont know if a bigger volblocksize would have eased up the issue, I may do more testing in the future.

Note the kernel documentation says this, so when allocation fails it is supposed to fallback to standard pages, however something isnt right given the issues I diagnosed so this might be raised as a kernel bug by myself later today. Might also raise on the openzfs github.

if a hugepage allocation fails because of memory fragmentation,
regular pages should be gracefully allocated instead and mixed in
the same vma without any failure or significant delay and without
userland noticing

MoxProxxer · Aug 22, 2021

tom said:
you run out of memory.

HP ProLiant ML350 G6, BIOS D22 12/02/2012
you run a 9+ year old server. not related to your question but just my 2 cents - if you need a stable server, do not run an ancient server.

My 2 cents would be, that the opposite is the case. Server hardware design principles target endurance. You replace servers because they do not fulfill your performance or security (CPU bugs) requirements, not because they break down. Sure, hard disk and PSU may break down, you change these - often on-the-fly. Mainboard, CPU, RAM server hardware causing these problems quite unlikely. And if, that'd not rain down on you as OOM, but maybe as an SIG11/SIGSEGV.

So if anything to help the thread here: Highly unlikely the hardware is responsible for the OOM.

PS: Anyone wants to buy a fully functional 20y old Dual-Tualatin Tyan Server Mainboard?

Benoît Fontaine · Aug 23, 2021

@chrcoluk thanks, I'll find out about transport_hugepage and txg.
For 5 months my servers have been running with a bad workaround: a cron job which drops the memory cache every 15 min

chrcoluk · Aug 24, 2021

Since you replied, I will pass on some further info.

First have a read of this, it took me a while to find as the old zfsonlinux mailing list was taken down.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/T483de7de1f6d08e4

The only question I have left is any performance impact, it seems the defragging of huge pages has a cpu utilisation hit, how much probably depends on the amount of ram and workload. But I am also looking into the performance benefits of having them enabled. If I find anything useful I will post a link here. So basically you may also see cpu utilisation gains as well from the lack of a need to defrag the huge page's.

In the mean time it is now disabled on my local proxmox, I havent observed any visible performance hit, just many more free fragments. Also since I forgot to post, a swap device did fix all these problems. Adding a swap even when it doesnt get utilised fixed the OOM's, so this is probably only a big concern if you have no swap device. But if you are having memory overload issues, then disabling THP is a potential workaround, also alongside shrinking the zfs dirty cache.

Code:

cat /sys/module/zfs/parameters/zfs_dirty_data_max

to check the limit of the dirty cache, it seems to default to 10% of ram up to a maximum of 4 gig or whatever

Code:

cat /sys/module/zfs/parameters/zfs_dirty_data_max_max

is set to (mine is 4 gig on both 32 gig and 64 gig machine's).

Summary of potential fixes/workarounds.

Disable write caching either in guest or zfs set sync=always on dataset/volume.
Cap ZFS dirty cache to lower amount e.g. 5% instead of 10%.
Cap VM drive write speeds to closer to physical storage write speeds to restrict how much dirty cache fills up.
Disable transparent huge pages.
Enable swap device.

Search

Search

kvm invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=2, oom_score_adj=0 Mar 18 13:09:20 proxmox2 kernel: [99143.764060] CPU: 1

Benoît Fontaine

Member

tom

Proxmox Staff Member

Benoît Fontaine

Member

tom

Proxmox Staff Member

Benoît Fontaine

Member

chrcoluk

Renowned Member

Attachments

chrcoluk

Renowned Member

MoxProxxer

Well-Known Member

Benoît Fontaine

Member

chrcoluk

Renowned Member