Proxmox 4.4.5 kernel: Out of memory: Kill process 8543 (kvm) score or sacrifice child

ozgurerdogan

Renowned Member
May 2, 2010
613
5
83
Bursa, Turkey, Turkey
O have enough amount of ram. But one kvm stops and I see these in syslog.:

Code:
Jan 01 01:34:01 vztlfr6 kernel: sh invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 01 01:34:01 vztlfr6 kernel: sh cpuset=/ mems_allowed=0
Jan 01 01:34:01 vztlfr6 kernel: CPU: 3 PID: 4117 Comm: sh Tainted: G IO 4.4.35-1-pve #1
Jan 01 01:34:01 vztlfr6 kernel: Hardware name: Supermicro X8STi/X8STi, BIOS 2.0 09/17/10
Jan 01 01:34:01 vztlfr6 kernel: 0000000000000286 000000004afdee85 ffff88000489fb50 ffffffff813f9743
Jan 01 01:34:01 vztlfr6 kernel: ffff88000489fd40 0000000000000000 ffff88000489fbb8 ffffffff8120adcb
Jan 01 01:34:01 vztlfr6 kernel: ffff88040f2dada0 ffffea0004f99300 0000000100000001 0000000000000000
Jan 01 01:34:01 vztlfr6 kernel: Call Trace:
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813f9743>] dump_stack+0x63/0x90
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8120adcb>] dump_header+0x67/0x1d5
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff811925c5>] oom_kill_process+0x205/0x3c0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81192a17>] out_of_memory+0x237/0x4a0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198d0e>] __alloc_pages_nodemask+0xcee/0xe20
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198e8b>] alloc_kmem_pages_node+0x4b/0xd0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8107f053>] copy_process+0x1c3/0x1c00
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813941b0>] ? apparmor_file_alloc_security+0x60/0x240
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813494b3>] ? security_file_alloc+0x33/0x50
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080c20>] _do_fork+0x80/0x360
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff810917ff>] ? sigprocmask+0x6f/0xa0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080fa9>] SyS_clone+0x19/0x20
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75
Jan 01 01:34:01 vztlfr6 kernel: Mem-Info:
Jan 01 01:34:01 vztlfr6 kernel: active_anon:2535826 inactive_anon:377038 isolated_anon:0
active_file:444477 inactive_file:444280 isolated_file:0
unevictable:880 dirty:17 writeback:0 unstable:0
slab_reclaimable:162931 slab_unreclaimable:58813
mapped:20826 shmem:21040 pagetables:10173 bounce:0
free:38866 free_pcp:111 free_cma:0
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA free:15852kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15884kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 3454 15995 15995 15995
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32 free:107940kB min:3492kB low:4364kB high:5236kB active_anon:1922068kB inactive_anon:480552kB active_file:383152kB inactive_file:382624kB unevictable:780kB isolated(anon):0kB isolated(file):0kB present:3644928kB managed:3564040kB mlocked:780kB dirty:8kB writeback:0kB mapped:20576kB shmem:21772kB slab_reclaimable:219488kB slab_unreclaimable:38272kB kernel_stack:528kB pagetables:8100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 12541 12541 12541
Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal free:31672kB min:12684kB low:15852kB high:19024kB active_anon:8221236kB inactive_anon:1027600kB active_file:1394756kB inactive_file:1394496kB unevictable:2740kB isolated(anon):0kB isolated(file):0kB present:13107200kB managed:12842072kB mlocked:2740kB dirty:60kB writeback:0kB mapped:62728kB shmem:62388kB slab_reclaimable:432236kB slab_unreclaimable:196980kB kernel_stack:4016kB pagetables:32592kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:84 all_unreclaimable? no
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15852kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32: 826*4kB (UME) 13128*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 108328kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal: 7676*4kB (UMEH) 86*8kB (UMEH) 5*16kB (H) 1*32kB (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31504kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 01 01:34:01 vztlfr6 kernel: 910214 total pagecache pages
Jan 01 01:34:01 vztlfr6 kernel: 0 pages in swap cache
Jan 01 01:34:01 vztlfr6 kernel: Swap cache stats: add 376, delete 376, find 0/0
Jan 01 01:34:01 vztlfr6 kernel: Free swap = 1046044kB
Jan 01 01:34:01 vztlfr6 kernel: Total swap = 1047548kB
Jan 01 01:34:01 vztlfr6 kernel: 4192024 pages RAM
Jan 01 01:34:01 vztlfr6 kernel: 0 pages HighMem/MovableOnly
Jan 01 01:34:01 vztlfr6 kernel: 86525 pages reserved
Jan 01 01:34:01 vztlfr6 kernel: 0 pages cma reserved
Jan 01 01:34:01 vztlfr6 kernel: 0 pages hwpoisoned
Jan 01 01:34:01 vztlfr6 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Jan 01 01:34:01 vztlfr6 kernel: [ 297] 0 297 12434 4291 30 3 0 0 systemd-journal
Jan 01 01:34:01 vztlfr6 kernel: [ 300] 0 300 10391 862 23 3 0 -1000 systemd-udevd
Jan 01 01:34:01 vztlfr6 kernel: [ 570] 0 570 2511 29 9 3 0 0 rdnssd
Jan 01 01:34:01 vztlfr6 kernel: [ 571] 104 571 4614 373 14 3 0 0 rdnssd
Jan 01 01:34:01 vztlfr6 kernel: [ 576] 100 576 25011 594 20 3 0 0 systemd-timesyn
Jan 01 01:34:01 vztlfr6 kernel: [ 1011] 0 1011 9270 666 23 3 0 0 rpcbind
Jan 01 01:34:01 vztlfr6 kernel: [ 1028] 0 1028 1272 374 8 3 0 0 iscsid
Jan 01 01:34:01 vztlfr6 kernel: [ 1029] 0 1029 1397 881 8 3 0 -17 iscsid
Jan 01 01:34:01 vztlfr6 kernel: [ 1036] 107 1036 9320 721 22 3 0 0 rpc.statd
Jan 01 01:34:01 vztlfr6 kernel: [ 1050] 0 1050 5839 49 16 3 0 0 rpc.idmapd
Jan 01 01:34:01 vztlfr6 kernel: [ 1207] 0 1207 13796 1320 31 3 0 -1000 sshd
Jan 01 01:34:01 vztlfr6 kernel: [ 1212] 0 1212 6146 916 17 3 0 0 smartd
Jan 01 01:34:01 vztlfr6 kernel: [ 1214] 109 1214 191484 9777 72 3 0 0 named
Jan 01 01:34:01 vztlfr6 kernel: [ 1216] 0 1216 58709 460 17 4 0 0 lxcfs
Jan 01 01:34:01 vztlfr6 kernel: [ 1218] 0 1218 1022 161 7 3 0 -1000 watchdog-mux
Jan 01 01:34:01 vztlfr6 kernel: [ 1219] 0 1219 4756 418 14 3 0 0 atd
Jan 01 01:34:01 vztlfr6 kernel: [ 1222] 0 1222 5459 649 13 3 0 0 ksmtuned
Jan 01 01:34:01 vztlfr6 kernel: [ 1227] 0 1227 4964 596 15 3 0 0 systemd-logind
Jan 01 01:34:01 vztlfr6 kernel: [ 1235] 106 1235 10558 825 27 3 0 -900 dbus-daemon
Jan 01 01:34:01 vztlfr6 kernel: [ 1271] 0 1271 206547 749 63 4 0 0 rrdcached
Jan 01 01:34:01 vztlfr6 kernel: [ 1287] 0 1287 64668 822 28 3 0 0 rsyslogd
Jan 01 01:34:01 vztlfr6 kernel: [ 1312] 0 1312 1064 386 8 3 0 0 acpid

Jan 01 01:34:01 vztlfr6 kernel: Out of memory: Kill process 8543 (kvm) score 279 or sacrifice child
Jan 01 01:34:01 vztlfr6 kernel: Killed process 8543 (kvm) total-vm:5808216kB, anon-rss:5007352kB, file-rss:10792kB
Jan 01 01:34:01 vztlfr6 CRON[4094]: pam_unix(cron:session): session closed for user root
Jan 01 01:34:02 vztlfr6 kernel: vmbr0: port 3(tap112i0) entered disabled state
 
I do use ZFS, but I also have the ARC limited to 2GB or 4GB (on 16GB and 28GB servers respectively - I haven't seen the error on any of the 48G nodes yet).
I have been seriously suspicious of ZFS lately, its performance under heavy write conditions is utterly abysmal no matter what tweaking I do... Actually, I can get it to go fast by disabling the write throttle, but then the kernel crashes under heavy write, so that's no better.

In any case, this appears to be a regression, since it wasn't happening previously.
 
I recently upgraded kernel from 4.4.13-2-pve to 4.4.35-1-pve
After the upgrade CEPH osds would randomly get killed by OOM even when there was plenty of RAM available.

Typically nearly all of the free ram was consumed by cache when the OOM event occurs

So far since doing this everything has been running stable:
Code:
echo 262144 > /proc/sys/vm/min_free_kbytes
 
Hello,
i have the same problem for a few days. A kvm guest (Windows 2012r2) is killed by OOM.
Proxmox host have 32 Go ram. 20 Go free ram.
Storage is drbd 8.4 (compiled with http://coolsoft.altervista.org/it/b...rnel-panic-downgrade-drbd-resources-drbd-9-84).
Kernel 4.4.35-1.pve.

I have 3 agency. Difference is memory :
32 Go (with the problem) 64 and 128 Go (without problem).
Same kernel.
I don't use ZFS or Ceph.
I had a NFS storage for backup.

I encounter VM killed when backup starting one time and other during the day with no particular activity.

I encounter this problem since i updated proxmox with this kernel.
Today, i had migrated VM on other host and see what happen.
Maybie i will test with kernel version before 4.4.35-1.pve.

For this
Code:
echo 262144 > /proc/sys/vm/min_free_kbytes
is it must be done every boot ?
 
This seems to fix the issue for now about OOM. But one of my nodes is having similar problem. During or right after backup of all vms, kvm loose disk connection and I have to drop cache with echo 1 > /proc/sys/vm/drop_caches so how can I heal the cache usage?
If I backup that vm only, it does not loose connection with kvm disk. Only if all 4 vms are backed up this is happening.
 
I encountered the problem this night on 2 of my servers, it was also during backup, I do not use ZFS or CEPH

One of those server worked perfectly with kernel 4.4.35 from 2016-12-20 until that minor upgrade :

Start-Date: 2017-01-03 08:14:02
Commandline: apt-get dist-upgrade
Upgrade: libpve-common-perl:amd64 (4.0-84, 4.0-85), pve-kernel-4.4.35-1-pve:amd64 (4.4.35-76, 4.4.35-77), libpve-storage-perl:amd64 (4.0-70, 4.0-71), pve-manager:amd64 (4.4-2, 4.4-5), libgd3:amd64 (2.1.0-5+deb8u7, 2.1.0-5+deb8u8), lxcfs:amd64 (2.0.5-pve1, 2.0.5-pve2), pve-qemu-kvm:amd64 (2.7.0-9, 2.7.0-10), pve-container:amd64 (1.0-89, 1.0-90), lxc-pve:amd64 (2.0.6-2, 2.0.6-5), proxmox-ve:amd64 (4.4-76, 4.4-77)
End-Date: 2017-01-03 08:15:08

please note the 4.4.35-76 to 4.4.35-77 kernel upgrade, and since I did not see any mention about oom modification in kernel.org changelogs, is that a custom proxmox patch? Or is that something related to backup behaviour change? please note that oom kill happened while backuping lxc container
 
I encountered the problem this night on 2 of my servers, it was also during backup, I do not use ZFS or CEPH

One of those server worked perfectly with kernel 4.4.35 from 2016-12-20 until that minor upgrade :

Start-Date: 2017-01-03 08:14:02
Commandline: apt-get dist-upgrade
Upgrade: libpve-common-perl:amd64 (4.0-84, 4.0-85), pve-kernel-4.4.35-1-pve:amd64 (4.4.35-76, 4.4.35-77), libpve-storage-perl:amd64 (4.0-70, 4.0-71), pve-manager:amd64 (4.4-2, 4.4-5), libgd3:amd64 (2.1.0-5+deb8u7, 2.1.0-5+deb8u8), lxcfs:amd64 (2.0.5-pve1, 2.0.5-pve2), pve-qemu-kvm:amd64 (2.7.0-9, 2.7.0-10), pve-container:amd64 (1.0-89, 1.0-90), lxc-pve:amd64 (2.0.6-2, 2.0.6-5), proxmox-ve:amd64 (4.4-76, 4.4-77)
End-Date: 2017-01-03 08:15:08

please note the 4.4.35-76 to 4.4.35-77 kernel upgrade, and since I did not see any mention about oom modification in kernel.org changelogs, is that a custom proxmox patch? Or is that something related to backup behaviour change? please note that oom kill happened while backuping lxc container

there have been some OOM related cherry-picks from 4.7 into the Ubuntu kernel to fix https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 , those might be at fault:
https://git.kernel.org/cgit/linux/k.../?id=0a0337e0d1d134465778a16f5cbea95086e8e9e0
https://git.kernel.org/cgit/linux/k.../?id=ede37713737834d98ec72ed299a305d53e909f73
 
can't reproduce this issue so far (even with very high memory pressure and load) - so any more information to narrow down the contributing factors would help:
  • used hardware
  • used storage plugins
  • memory and swap sizes
  • circumstances triggering the OOM, ideally together with system logs and fine-grained atop or similar data

edit: I can trigger the OOM-killer and produce the stacktrace mentioned earlier in this thread, but only when disabling swap and having less than a few hundred MB of actual free memory - i.e., the very situation where the OOM-killer has to act to prevent a total system crash.. are you sure that you are not simply running out of memory?
 
Last edited:
Hello Fabian. For me :
Motherboard : Supermicro X9DR3-F.
Storage : DRBD v8.4 (compiled with the link said above) for VM.
NFS on a synology RS2212 for backup.
Memory & Swap size : 32 Go & 31 Go. Memtest OK.

Here some files in attachment.

Only the VM killed on the node concerned was running.
 

Attachments

can't reproduce this issue so far (even with very high memory pressure and load) - so any more information to narrow down the contributing factors would help:
  • used hardware
  • used storage plugins
  • memory and swap sizes
  • circumstances triggering the OOM, ideally together with system logs and fine-grained atop or similar data

edit: I can trigger the OOM-killer and produce the stacktrace mentioned earlier in this thread, but only when disabling swap and having less than a few hundred MB of actual free memory - i.e., the very situation where the OOM-killer has to act to prevent a total system crash.. are you sure that you are not simply running out of memory?
Hi,
how looks "cat /proc/sys/vm/swappiness" on the effected systems? Perhaps 0 instead of 1?

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!