O have enough amount of ram. But one kvm stops and I see these in syslog.: Code: Jan 01 01:34:01 vztlfr6 kernel: sh invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0 Jan 01 01:34:01 vztlfr6 kernel: sh cpuset=/ mems_allowed=0 Jan 01 01:34:01 vztlfr6 kernel: CPU: 3 PID: 4117 Comm: sh Tainted: G IO 4.4.35-1-pve #1 Jan 01 01:34:01 vztlfr6 kernel: Hardware name: Supermicro X8STi/X8STi, BIOS 2.0 09/17/10 Jan 01 01:34:01 vztlfr6 kernel: 0000000000000286 000000004afdee85 ffff88000489fb50 ffffffff813f9743 Jan 01 01:34:01 vztlfr6 kernel: ffff88000489fd40 0000000000000000 ffff88000489fbb8 ffffffff8120adcb Jan 01 01:34:01 vztlfr6 kernel: ffff88040f2dada0 ffffea0004f99300 0000000100000001 0000000000000000 Jan 01 01:34:01 vztlfr6 kernel: Call Trace: Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813f9743>] dump_stack+0x63/0x90 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8120adcb>] dump_header+0x67/0x1d5 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff811925c5>] oom_kill_process+0x205/0x3c0 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81192a17>] out_of_memory+0x237/0x4a0 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198d0e>] __alloc_pages_nodemask+0xcee/0xe20 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81198e8b>] alloc_kmem_pages_node+0x4b/0xd0 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8107f053>] copy_process+0x1c3/0x1c00 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813941b0>] ? apparmor_file_alloc_security+0x60/0x240 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff813494b3>] ? security_file_alloc+0x33/0x50 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080c20>] _do_fork+0x80/0x360 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff810917ff>] ? sigprocmask+0x6f/0xa0 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff81080fa9>] SyS_clone+0x19/0x20 Jan 01 01:34:01 vztlfr6 kernel: [<ffffffff8185c276>] entry_SYSCALL_64_fastpath+0x16/0x75 Jan 01 01:34:01 vztlfr6 kernel: Mem-Info: Jan 01 01:34:01 vztlfr6 kernel: active_anon:2535826 inactive_anon:377038 isolated_anon:0 active_file:444477 inactive_file:444280 isolated_file:0 unevictable:880 dirty:17 writeback:0 unstable:0 slab_reclaimable:162931 slab_unreclaimable:58813 mapped:20826 shmem:21040 pagetables:10173 bounce:0 free:38866 free_pcp:111 free_cma:0 Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA free:15852kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15884kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 3454 15995 15995 15995 Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32 free:107940kB min:3492kB low:4364kB high:5236kB active_anon:1922068kB inactive_anon:480552kB active_file:383152kB inactive_file:382624kB unevictable:780kB isolated(anon):0kB isolated(file):0kB present:3644928kB managed:3564040kB mlocked:780kB dirty:8kB writeback:0kB mapped:20576kB shmem:21772kB slab_reclaimable:219488kB slab_unreclaimable:38272kB kernel_stack:528kB pagetables:8100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 12541 12541 12541 Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal free:31672kB min:12684kB low:15852kB high:19024kB active_anon:8221236kB inactive_anon:1027600kB active_file:1394756kB inactive_file:1394496kB unevictable:2740kB isolated(anon):0kB isolated(file):0kB present:13107200kB managed:12842072kB mlocked:2740kB dirty:60kB writeback:0kB mapped:62728kB shmem:62388kB slab_reclaimable:432236kB slab_unreclaimable:196980kB kernel_stack:4016kB pagetables:32592kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:84 all_unreclaimable? no Jan 01 01:34:01 vztlfr6 kernel: lowmem_reserve[]: 0 0 0 0 0 Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15852kB Jan 01 01:34:01 vztlfr6 kernel: Node 0 DMA32: 826*4kB (UME) 13128*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 108328kB Jan 01 01:34:01 vztlfr6 kernel: Node 0 Normal: 7676*4kB (UMEH) 86*8kB (UMEH) 5*16kB (H) 1*32kB (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31504kB Jan 01 01:34:01 vztlfr6 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jan 01 01:34:01 vztlfr6 kernel: 910214 total pagecache pages Jan 01 01:34:01 vztlfr6 kernel: 0 pages in swap cache Jan 01 01:34:01 vztlfr6 kernel: Swap cache stats: add 376, delete 376, find 0/0 Jan 01 01:34:01 vztlfr6 kernel: Free swap = 1046044kB Jan 01 01:34:01 vztlfr6 kernel: Total swap = 1047548kB Jan 01 01:34:01 vztlfr6 kernel: 4192024 pages RAM Jan 01 01:34:01 vztlfr6 kernel: 0 pages HighMem/MovableOnly Jan 01 01:34:01 vztlfr6 kernel: 86525 pages reserved Jan 01 01:34:01 vztlfr6 kernel: 0 pages cma reserved Jan 01 01:34:01 vztlfr6 kernel: 0 pages hwpoisoned Jan 01 01:34:01 vztlfr6 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name Jan 01 01:34:01 vztlfr6 kernel: [ 297] 0 297 12434 4291 30 3 0 0 systemd-journal Jan 01 01:34:01 vztlfr6 kernel: [ 300] 0 300 10391 862 23 3 0 -1000 systemd-udevd Jan 01 01:34:01 vztlfr6 kernel: [ 570] 0 570 2511 29 9 3 0 0 rdnssd Jan 01 01:34:01 vztlfr6 kernel: [ 571] 104 571 4614 373 14 3 0 0 rdnssd Jan 01 01:34:01 vztlfr6 kernel: [ 576] 100 576 25011 594 20 3 0 0 systemd-timesyn Jan 01 01:34:01 vztlfr6 kernel: [ 1011] 0 1011 9270 666 23 3 0 0 rpcbind Jan 01 01:34:01 vztlfr6 kernel: [ 1028] 0 1028 1272 374 8 3 0 0 iscsid Jan 01 01:34:01 vztlfr6 kernel: [ 1029] 0 1029 1397 881 8 3 0 -17 iscsid Jan 01 01:34:01 vztlfr6 kernel: [ 1036] 107 1036 9320 721 22 3 0 0 rpc.statd Jan 01 01:34:01 vztlfr6 kernel: [ 1050] 0 1050 5839 49 16 3 0 0 rpc.idmapd Jan 01 01:34:01 vztlfr6 kernel: [ 1207] 0 1207 13796 1320 31 3 0 -1000 sshd Jan 01 01:34:01 vztlfr6 kernel: [ 1212] 0 1212 6146 916 17 3 0 0 smartd Jan 01 01:34:01 vztlfr6 kernel: [ 1214] 109 1214 191484 9777 72 3 0 0 named Jan 01 01:34:01 vztlfr6 kernel: [ 1216] 0 1216 58709 460 17 4 0 0 lxcfs Jan 01 01:34:01 vztlfr6 kernel: [ 1218] 0 1218 1022 161 7 3 0 -1000 watchdog-mux Jan 01 01:34:01 vztlfr6 kernel: [ 1219] 0 1219 4756 418 14 3 0 0 atd Jan 01 01:34:01 vztlfr6 kernel: [ 1222] 0 1222 5459 649 13 3 0 0 ksmtuned Jan 01 01:34:01 vztlfr6 kernel: [ 1227] 0 1227 4964 596 15 3 0 0 systemd-logind Jan 01 01:34:01 vztlfr6 kernel: [ 1235] 106 1235 10558 825 27 3 0 -900 dbus-daemon Jan 01 01:34:01 vztlfr6 kernel: [ 1271] 0 1271 206547 749 63 4 0 0 rrdcached Jan 01 01:34:01 vztlfr6 kernel: [ 1287] 0 1287 64668 822 28 3 0 0 rsyslogd Jan 01 01:34:01 vztlfr6 kernel: [ 1312] 0 1312 1064 386 8 3 0 0 acpid Jan 01 01:34:01 vztlfr6 kernel: Out of memory: Kill process 8543 (kvm) score 279 or sacrifice child Jan 01 01:34:01 vztlfr6 kernel: Killed process 8543 (kvm) total-vm:5808216kB, anon-rss:5007352kB, file-rss:10792kB Jan 01 01:34:01 vztlfr6 CRON[4094]: pam_unix(cron:session): session closed for user root Jan 01 01:34:02 vztlfr6 kernel: vmbr0: port 3(tap112i0) entered disabled state
do you use zfs ? if yes, it can eat half of memory by default https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage
I do not see this here and I don't use ZFS on my proxmox nodes so this seems like a plausible explanation.
I do use ZFS, but I also have the ARC limited to 2GB or 4GB (on 16GB and 28GB servers respectively - I haven't seen the error on any of the 48G nodes yet). I have been seriously suspicious of ZFS lately, its performance under heavy write conditions is utterly abysmal no matter what tweaking I do... Actually, I can get it to go fast by disabling the write throttle, but then the kernel crashes under heavy write, so that's no better. In any case, this appears to be a regression, since it wasn't happening previously.
yes, sure, so what's is eating your memory ? do you have a process list with memory usage before oom occur ?
thank you I will give it a try. It mostly kill kvm when it is backing up. But even during backup, system has at leat %10 free memory..
I recently upgraded kernel from 4.4.13-2-pve to 4.4.35-1-pve After the upgrade CEPH osds would randomly get killed by OOM even when there was plenty of RAM available. Typically nearly all of the free ram was consumed by cache when the OOM event occurs So far since doing this everything has been running stable: Code: echo 262144 > /proc/sys/vm/min_free_kbytes
Hello, i have the same problem for a few days. A kvm guest (Windows 2012r2) is killed by OOM. Proxmox host have 32 Go ram. 20 Go free ram. Storage is drbd 8.4 (compiled with http://coolsoft.altervista.org/it/b...rnel-panic-downgrade-drbd-resources-drbd-9-84). Kernel 4.4.35-1.pve. I have 3 agency. Difference is memory : 32 Go (with the problem) 64 and 128 Go (without problem). Same kernel. I don't use ZFS or Ceph. I had a NFS storage for backup. I encounter VM killed when backup starting one time and other during the day with no particular activity. I encounter this problem since i updated proxmox with this kernel. Today, i had migrated VM on other host and see what happen. Maybie i will test with kernel version before 4.4.35-1.pve. For this Code: echo 262144 > /proc/sys/vm/min_free_kbytes is it must be done every boot ?
Hi, you can put this in /etc/sysctl.d/pve.conf (or /etc/sysctl.d/90-my.conf) like: Code: vm.swappiness = 1 vm.min_free_kbytes = 262144 Udo
This seems to fix the issue for now about OOM. But one of my nodes is having similar problem. During or right after backup of all vms, kvm loose disk connection and I have to drop cache with echo 1 > /proc/sys/vm/drop_caches so how can I heal the cache usage? If I backup that vm only, it does not loose connection with kvm disk. Only if all 4 vms are backed up this is happening.
I encountered the problem this night on 2 of my servers, it was also during backup, I do not use ZFS or CEPH One of those server worked perfectly with kernel 4.4.35 from 2016-12-20 until that minor upgrade : Start-Date: 2017-01-03 08:14:02 Commandline: apt-get dist-upgrade Upgrade: libpve-common-perl:amd64 (4.0-84, 4.0-85), pve-kernel-4.4.35-1-pve:amd64 (4.4.35-76, 4.4.35-77), libpve-storage-perl:amd64 (4.0-70, 4.0-71), pve-manager:amd64 (4.4-2, 4.4-5), libgd3:amd64 (2.1.0-5+deb8u7, 2.1.0-5+deb8u8), lxcfs:amd64 (2.0.5-pve1, 2.0.5-pve2), pve-qemu-kvm:amd64 (2.7.0-9, 2.7.0-10), pve-container:amd64 (1.0-89, 1.0-90), lxc-pve:amd64 (2.0.6-2, 2.0.6-5), proxmox-ve:amd64 (4.4-76, 4.4-77) End-Date: 2017-01-03 08:15:08 please note the 4.4.35-76 to 4.4.35-77 kernel upgrade, and since I did not see any mention about oom modification in kernel.org changelogs, is that a custom proxmox patch? Or is that something related to backup behaviour change? please note that oom kill happened while backuping lxc container
there have been some OOM related cherry-picks from 4.7 into the Ubuntu kernel to fix https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1647400 , those might be at fault: https://git.kernel.org/cgit/linux/k.../?id=0a0337e0d1d134465778a16f5cbea95086e8e9e0 https://git.kernel.org/cgit/linux/k.../?id=ede37713737834d98ec72ed299a305d53e909f73
can't reproduce this issue so far (even with very high memory pressure and load) - so any more information to narrow down the contributing factors would help: used hardware used storage plugins memory and swap sizes circumstances triggering the OOM, ideally together with system logs and fine-grained atop or similar data edit: I can trigger the OOM-killer and produce the stacktrace mentioned earlier in this thread, but only when disabling swap and having less than a few hundred MB of actual free memory - i.e., the very situation where the OOM-killer has to act to prevent a total system crash.. are you sure that you are not simply running out of memory?
Hello Fabian. For me : Motherboard : Supermicro X9DR3-F. Storage : DRBD v8.4 (compiled with the link said above) for VM. NFS on a synology RS2212 for backup. Memory & Swap size : 32 Go & 31 Go. Memtest OK. Here some files in attachment. Only the VM killed on the node concerned was running.
Hi Udo, 60 : Code: root@mtp-prox02:~# cat /proc/sys/vm/swappiness 60 I had done this on several proxmox host. All 60.