Proxmox 4.4.5 kernel: Out of memory: Kill process 8543 (kvm) score or sacrifice child

ozgurerdogan · Jan 1, 2017

O have enough amount of ram. But one kvm stops and I see these in syslog.:

Code:

Jan 01 01:34:01 vztlfr6 kernel: sh invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 01 01:34:01 vztlfr6 kernel: sh cpuset=/ mems_allowed=0
Jan 01 01:34:01 vztlfr6 kernel: CPU: 3 PID: 4117 Comm: sh Tainted: G IO 4.4.35-1-pve #1
Jan 01 01:34:01 vztlfr6 kernel: Hardware name: Supermicro X8STi/X8STi, BIOS 2.0 09/17/10
Jan 01 01:34:01 vztlfr6 kernel: 0000000000000286 000000004afdee85 ffff88000489fb50 ffffffff813f9743
Jan 01 01:34:01 vztlfr6 kernel: ffff88000489fd40 0000000000000000 ffff88000489fbb8 ffffffff8120adcb
Jan 01 01:34:01 vztlfr6 kernel: ffff88040f2dada0 ffffea0004f99300 0000000100000001 0000000000000000
Jan 01 01:34:01 vztlfr6 kernel: Call Trace:
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813f9743>] dump_stack+0x63/0x90
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8120adcb>] dump_header+0x67/0x1d5
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff811925c5>] oom_kill_process+0x205/0x3c0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81192a17>] out_of_memory+0x237/0x4a0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198d0e>] __alloc_pages_nodemask+0xcee/0xe20
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198e8b>] alloc_kmem_pages_node+0x4b/0xd0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8107f053>] copy_process+0x1c3/0x1c00
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813941b0>] ? apparmor_file_alloc_security+0x60/0x240
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813494b3>] ? security_file_alloc+0x33/0x50
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080c20>] _do_fork+0x80/0x360
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff810917ff>] ? sigprocmask+0x6f/0xa0
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080fa9>] SyS_clone+0x19/0x20
Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75
Jan 01 01:34:01 vztlfr6 kernel: Mem-Info:
Jan 01 01:34:01 vztlfr6 kernel: active_anon:2535826 inactive_anon:377038 isolated_anon:0
active_file:444477 inactive_file:444280 isolated_file:0
unevictable:880 dirty:17 writeback:0 unstable:0
slab_reclaimable:162931 slab_unreclaimable:58813
mapped:20826 shmem:21040 pagetables:10173 bounce:0
free:38866 free_pcp:111 free_cma:0
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA free:15852kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15884kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 3454 15995 15995 15995
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32 free:107940kB min:3492kB low:4364kB high:5236kB active_anon:1922068kB inactive_anon:480552kB active_file:383152kB inactive_file:382624kB unevictable:780kB isolated(anon):0kB isolated(file):0kB present:3644928kB managed:3564040kB mlocked:780kB dirty:8kB writeback:0kB mapped:20576kB shmem:21772kB slab_reclaimable:219488kB slab_unreclaimable:38272kB kernel_stack:528kB pagetables:8100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 12541 12541 12541
Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal free:31672kB min:12684kB low:15852kB high:19024kB active_anon:8221236kB inactive_anon:1027600kB active_file:1394756kB inactive_file:1394496kB unevictable:2740kB isolated(anon):0kB isolated(file):0kB present:13107200kB managed:12842072kB mlocked:2740kB dirty:60kB writeback:0kB mapped:62728kB shmem:62388kB slab_reclaimable:432236kB slab_unreclaimable:196980kB kernel_stack:4016kB pagetables:32592kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:84 all_unreclaimable? no
Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 0 0 0
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15852kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32: 826*4kB (UME) 13128*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 108328kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal: 7676*4kB (UMEH) 86*8kB (UMEH) 5*16kB (H) 1*32kB (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31504kB
Jan 01 01:34:01 vztlfr6 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 01 01:34:01 vztlfr6 kernel: 910214 total pagecache pages
Jan 01 01:34:01 vztlfr6 kernel: 0 pages in swap cache
Jan 01 01:34:01 vztlfr6 kernel: Swap cache stats: add 376, delete 376, find 0/0
Jan 01 01:34:01 vztlfr6 kernel: Free swap = 1046044kB
Jan 01 01:34:01 vztlfr6 kernel: Total swap = 1047548kB
Jan 01 01:34:01 vztlfr6 kernel: 4192024 pages RAM
Jan 01 01:34:01 vztlfr6 kernel: 0 pages HighMem/MovableOnly
Jan 01 01:34:01 vztlfr6 kernel: 86525 pages reserved
Jan 01 01:34:01 vztlfr6 kernel: 0 pages cma reserved
Jan 01 01:34:01 vztlfr6 kernel: 0 pages hwpoisoned
Jan 01 01:34:01 vztlfr6 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Jan 01 01:34:01 vztlfr6 kernel: [ 297] 0 297 12434 4291 30 3 0 0 systemd-journal
Jan 01 01:34:01 vztlfr6 kernel: [ 300] 0 300 10391 862 23 3 0 -1000 systemd-udevd
Jan 01 01:34:01 vztlfr6 kernel: [ 570] 0 570 2511 29 9 3 0 0 rdnssd
Jan 01 01:34:01 vztlfr6 kernel: [ 571] 104 571 4614 373 14 3 0 0 rdnssd
Jan 01 01:34:01 vztlfr6 kernel: [ 576] 100 576 25011 594 20 3 0 0 systemd-timesyn
Jan 01 01:34:01 vztlfr6 kernel: [ 1011] 0 1011 9270 666 23 3 0 0 rpcbind
Jan 01 01:34:01 vztlfr6 kernel: [ 1028] 0 1028 1272 374 8 3 0 0 iscsid
Jan 01 01:34:01 vztlfr6 kernel: [ 1029] 0 1029 1397 881 8 3 0 -17 iscsid
Jan 01 01:34:01 vztlfr6 kernel: [ 1036] 107 1036 9320 721 22 3 0 0 rpc.statd
Jan 01 01:34:01 vztlfr6 kernel: [ 1050] 0 1050 5839 49 16 3 0 0 rpc.idmapd
Jan 01 01:34:01 vztlfr6 kernel: [ 1207] 0 1207 13796 1320 31 3 0 -1000 sshd
Jan 01 01:34:01 vztlfr6 kernel: [ 1212] 0 1212 6146 916 17 3 0 0 smartd
Jan 01 01:34:01 vztlfr6 kernel: [ 1214] 109 1214 191484 9777 72 3 0 0 named
Jan 01 01:34:01 vztlfr6 kernel: [ 1216] 0 1216 58709 460 17 4 0 0 lxcfs
Jan 01 01:34:01 vztlfr6 kernel: [ 1218] 0 1218 1022 161 7 3 0 -1000 watchdog-mux
Jan 01 01:34:01 vztlfr6 kernel: [ 1219] 0 1219 4756 418 14 3 0 0 atd
Jan 01 01:34:01 vztlfr6 kernel: [ 1222] 0 1222 5459 649 13 3 0 0 ksmtuned
Jan 01 01:34:01 vztlfr6 kernel: [ 1227] 0 1227 4964 596 15 3 0 0 systemd-logind
Jan 01 01:34:01 vztlfr6 kernel: [ 1235] 106 1235 10558 825 27 3 0 -900 dbus-daemon
Jan 01 01:34:01 vztlfr6 kernel: [ 1271] 0 1271 206547 749 63 4 0 0 rrdcached
Jan 01 01:34:01 vztlfr6 kernel: [ 1287] 0 1287 64668 822 28 3 0 0 rsyslogd
Jan 01 01:34:01 vztlfr6 kernel: [ 1312] 0 1312 1064 386 8 3 0 0 acpid

Jan 01 01:34:01 vztlfr6 kernel: Out of memory: Kill process 8543 (kvm) score 279 or sacrifice child
Jan 01 01:34:01 vztlfr6 kernel: Killed process 8543 (kvm) total-vm:5808216kB, anon-rss:5007352kB, file-rss:10792kB
Jan 01 01:34:01 vztlfr6 CRON[4094]: pam_unix(cron:session): session closed for user root
Jan 01 01:34:02 vztlfr6 kernel: vmbr0: port 3(tap112i0) entered disabled state

athompso · Jan 1, 2017

+1 : I've been seeing this on a nightly basis, too, recently. Only since 4.4.x.

spirit · Jan 1, 2017

do you use zfs ? if yes, it can eat half of memory by default
https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage

mir · Jan 1, 2017

spirit said:
do you use zfs ? if yes, it can eat half of memory by default
https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage

I do not see this here and I don't use ZFS on my proxmox nodes so this seems like a plausible explanation.

athompso · Jan 1, 2017

I do use ZFS, but I also have the ARC limited to 2GB or 4GB (on 16GB and 28GB servers respectively - I haven't seen the error on any of the 48G nodes yet).
I have been seriously suspicious of ZFS lately, its performance under heavy write conditions is utterly abysmal no matter what tweaking I do... Actually, I can get it to go fast by disabling the write throttle, but then the kernel crashes under heavy write, so that's no better.

In any case, this appears to be a regression, since it wasn't happening previously.

ozgurerdogan · Jan 1, 2017

I do not use ZFS also. It is an issue with OOM Killer

spirit · Jan 1, 2017

ozgurerdogan said:
I do not use ZFS also. It is an issue with OOM Killer

yes, sure, so what's is eating your memory ? do you have a process list with memory usage before oom occur ?

absent · Jan 2, 2017

ozgurerdogan, try set:
vm.min_free_kbytes = 131072

ozgurerdogan · Jan 3, 2017

thank you I will give it a try. It mostly kill kvm when it is backing up. But even during backup, system has at leat %10 free memory..

e100 · Jan 3, 2017

I recently upgraded kernel from 4.4.13-2-pve to 4.4.35-1-pve
After the upgrade CEPH osds would randomly get killed by OOM even when there was plenty of RAM available.

Typically nearly all of the free ram was consumed by cache when the OOM event occurs

So far since doing this everything has been running stable:

Code:

echo 262144 > /proc/sys/vm/min_free_kbytes

whitewater · Jan 3, 2017

Hello,
i have the same problem for a few days. A kvm guest (Windows 2012r2) is killed by OOM.
Proxmox host have 32 Go ram. 20 Go free ram.
Storage is drbd 8.4 (compiled with http://coolsoft.altervista.org/it/b...rnel-panic-downgrade-drbd-resources-drbd-9-84).
Kernel 4.4.35-1.pve.

I have 3 agency. Difference is memory :
32 Go (with the problem) 64 and 128 Go (without problem).
Same kernel.
I don't use ZFS or Ceph.
I had a NFS storage for backup.

I encounter VM killed when backup starting one time and other during the day with no particular activity.

I encounter this problem since i updated proxmox with this kernel.
Today, i had migrated VM on other host and see what happen.
Maybie i will test with kernel version before 4.4.35-1.pve.

For this

Code:

echo 262144 > /proc/sys/vm/min_free_kbytes

is it must be done every boot ?

udo · Jan 3, 2017

whitewater said:
For this

Code:

echo 262144 > /proc/sys/vm/min_free_kbytes

is it must be done every boot ?

Hi,
you can put this in /etc/sysctl.d/pve.conf (or /etc/sysctl.d/90-my.conf) like:

Code:

vm.swappiness = 1
vm.min_free_kbytes = 262144

Udo

whitewater · Jan 3, 2017

ok, thank you Udo. I will test if i need, after testing other kernel version (i think)

ozgurerdogan · Jan 4, 2017

This seems to fix the issue for now about OOM. But one of my nodes is having similar problem. During or right after backup of all vms, kvm loose disk connection and I have to drop cache with echo 1 > /proc/sys/vm/drop_caches so how can I heal the cache usage?
If I backup that vm only, it does not loose connection with kvm disk. Only if all 4 vms are backed up this is happening.

opty · Jan 4, 2017

I encountered the problem this night on 2 of my servers, it was also during backup, I do not use ZFS or CEPH

One of those server worked perfectly with kernel 4.4.35 from 2016-12-20 until that minor upgrade :

Start-Date: 2017-01-03 08:14:02
Commandline: apt-get dist-upgrade
Upgrade: libpve-common-perl:amd64 (4.0-84, 4.0-85), pve-kernel-4.4.35-1-pve:amd64 (4.4.35-76, 4.4.35-77), libpve-storage-perl:amd64 (4.0-70, 4.0-71), pve-manager:amd64 (4.4-2, 4.4-5), libgd3:amd64 (2.1.0-5+deb8u7, 2.1.0-5+deb8u8), lxcfs:amd64 (2.0.5-pve1, 2.0.5-pve2), pve-qemu-kvm:amd64 (2.7.0-9, 2.7.0-10), pve-container:amd64 (1.0-89, 1.0-90), lxc-pve:amd64 (2.0.6-2, 2.0.6-5), proxmox-ve:amd64 (4.4-76, 4.4-77)
End-Date: 2017-01-03 08:15:08

please note the 4.4.35-76 to 4.4.35-77 kernel upgrade, and since I did not see any mention about oom modification in kernel.org changelogs, is that a custom proxmox patch? Or is that something related to backup behaviour change? please note that oom kill happened while backuping lxc container

fabian · Jan 4, 2017

opty said:
I encountered the problem this night on 2 of my servers, it was also during backup, I do not use ZFS or CEPH

One of those server worked perfectly with kernel 4.4.35 from 2016-12-20 until that minor upgrade :

Start-Date: 2017-01-03 08:14:02
Commandline: apt-get dist-upgrade
Upgrade: libpve-common-perl:amd64 (4.0-84, 4.0-85), pve-kernel-4.4.35-1-pve:amd64 (4.4.35-76, 4.4.35-77), libpve-storage-perl:amd64 (4.0-70, 4.0-71), pve-manager:amd64 (4.4-2, 4.4-5), libgd3:amd64 (2.1.0-5+deb8u7, 2.1.0-5+deb8u8), lxcfs:amd64 (2.0.5-pve1, 2.0.5-pve2), pve-qemu-kvm:amd64 (2.7.0-9, 2.7.0-10), pve-container:amd64 (1.0-89, 1.0-90), lxc-pve:amd64 (2.0.6-2, 2.0.6-5), proxmox-ve:amd64 (4.4-76, 4.4-77)
End-Date: 2017-01-03 08:15:08

please note the 4.4.35-76 to 4.4.35-77 kernel upgrade, and since I did not see any mention about oom modification in kernel.org changelogs, is that a custom proxmox patch? Or is that something related to backup behaviour change? please note that oom kill happened while backuping lxc container

there have been some OOM related cherry-picks from 4.7 into the Ubuntu kernel to fix https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 , those might be at fault:
https://git.kernel.org/cgit/linux/k.../?id=0a0337e0d1d134465778a16f5cbea95086e8e9e0
https://git.kernel.org/cgit/linux/k.../?id=ede37713737834d98ec72ed299a305d53e909f73

fabian · Jan 4, 2017

can't reproduce this issue so far (even with very high memory pressure and load) - so any more information to narrow down the contributing factors would help:

used hardware
used storage plugins
memory and swap sizes
circumstances triggering the OOM, ideally together with system logs and fine-grained atop or similar data

edit: I can trigger the OOM-killer and produce the stacktrace mentioned earlier in this thread, but only when disabling swap and having less than a few hundred MB of actual free memory - i.e., the very situation where the OOM-killer has to act to prevent a total system crash.. are you sure that you are not simply running out of memory?

whitewater · Jan 4, 2017

Hello Fabian. For me :
Motherboard : Supermicro X9DR3-F.
Storage : DRBD v8.4 (compiled with the link said above) for VM.
NFS on a synology RS2212 for backup.
Memory & Swap size : 32 Go & 31 Go. Memtest OK.

Here some files in attachment.

Only the VM killed on the node concerned was running.

udo · Jan 4, 2017

fabian said:
can't reproduce this issue so far (even with very high memory pressure and load) - so any more information to narrow down the contributing factors would help:

used hardware

used storage plugins

memory and swap sizes

circumstances triggering the OOM, ideally together with system logs and fine-grained atop or similar data

edit: I can trigger the OOM-killer and produce the stacktrace mentioned earlier in this thread, but only when disabling swap and having less than a few hundred MB of actual free memory - i.e., the very situation where the OOM-killer has to act to prevent a total system crash.. are you sure that you are not simply running out of memory?

Hi,
how looks "cat /proc/sys/vm/swappiness" on the effected systems? Perhaps 0 instead of 1?

Udo

whitewater · Jan 4, 2017

Hi Udo, 60 :

Code:

root@mtp-prox02:~# cat /proc/sys/vm/swappiness
60

I had done this on several proxmox host. All 60.

Proxmox 4.4.5 kernel: Out of memory: Kill process 8543 (kvm) score or sacrifice child

Renowned Member

Renowned Member

Distinguished Member

Famous Member

Renowned Member

Renowned Member

Distinguished Member

New Member

Renowned Member

Famous Member

Member

Distinguished Member

Member

Renowned Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Member

Attachments

Distinguished Member

Member

We value your privacy