Proxmox 6.1 OOM killed a VM

Thomas Plant

Member
Mar 28, 2018
93
1
13
54
Hello,

Friday evening one of four hosts in our PVE cluster got out of memory. It killed the qemu process of one of our VMs, fortunately no apparent damage, restared without problem (apart the obligatory disk check). The killed VM is a CentOS 6 fully patched, 64bit. At the moment the host is using 74 of 128 available GB memory, so should be plenty space left free.
Do not know if this could be of interest, the VM was life migrated from another host a few weeks ago.

How can we diagnose what happened? I attach the OOM message from the logs and the package versions the host ist running actually.

Regards,
Thomas
 

Attachments

  • oom.txt
    17.1 KB · Views: 36
  • packages.txt
    1.1 KB · Views: 9
it does look like you were OOM, but there are no other heavy memory users listed besides your VMs. are you using ZFS on this system? if so, you probably want to limit its cache (ARC).
 
Found another VM on this host which kvm process is using way to much memory. It has configured 8 GB of RAM, but as of 'htop' it is using 32GB...

1578906172218.png
 
At the moment the host is using 74 of 128 available GB memory
At the moment = at the time of writing? Because when your VM was killed it alone consumed 64 GB of memory:
pve7 kernel: [2692473.134678] [ 1578] 0 1578 31575117 16921602 224509952 130079 0 kvm




so should be plenty space left free.
Swap was full
pve7 kernel: [2692473.134511] Free swap = 0kB
and your memory was fragmented, no more big pages were available.
pve7 kernel: [2692473.134483] Node 0 Normal: 4448*4kB (UMEH) 3655*8kB (UMEH) 900*16kB (UEH) 203*32kB (UEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 67928kB
pve7 kernel: [2692473.134491] Node 1 Normal: 7012*4kB (UMEH) 2353*8kB (UMEH) 1210*16kB (UMEH) 160*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 71736kB

Can you post your VM configs?
Code:
cat /etc/pve/qemu-server/147.conf
 
Yes, memory consumption I wrote was at the moment of writing.

Here the config of the VM:
Code:
root@pve7:~# cat /etc/pve/qemu-server/147.conf
agent: 1
boot: cdn
bootdisk: scsi0
cores: 2
cpu: Broadwell
ide2: none,media=cdrom
memory: 8192
name: VS04
net0: virtio=FE:B8:3F:6C:90:35,bridge=vmbr0
net1: virtio=86:D2:BE:60:3F:23,bridge=vmbr1
numa: 1
onboot: 1
ostype: win10
protection: 1
scsi0: NFS01:147/vm-147-disk-0.qcow2,discard=on,size=64G
scsi1: NFS01:147/vm-147-disk-1.qcow2,discard=on,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=a00bb877-3e48-4621-a511-4cfa1030ce63
sockets: 2
vmgenid: 31000dd5-312a-4562-a3c1-aee6d9ad111c
 
Does this happen with any VMs that do not use NFS? Could you temporarily move (or clone) your VM to another storage?
 
All our VMs are on NFS. And all are in production, so fiddling a lot around is problem. I could clone it to local storage and start it without network, is this sufficient for a test?
 
Live migrated two VMs with very high memory consumption to another host, now (obviously) RAM usage is in balance with the configured RAM. Will keep an eye on them. How much overhead of memory is normal?
 
The VM we had to restart on last friday now has again 6,4 GB of RAM in use, see attached screenshot. Are all those threads normal that show up in the screenshot (made with htop)?
 

Attachments

  • 2020-01-13_14h04_33.png
    2020-01-13_14h04_33.png
    184.6 KB · Views: 21
I could clone it to local storage and start it without network, is this sufficient for a test?
I think it could be worth a try. See here for some examples of other NFS problems.

Are all those threads normal that show up in the screenshot (made with htop)?
Yes. You can use F5 in htop for a tree view.

now has again 6,4 GB of RAM in use
I assume this VM has the same amount of memory set as VM 147? In this case it still below the 8GB.
 
Sorry, wrote to fast about local storage, it is not possible, not enough space on local disks for the VMs, are all to big.
The VM from the screenshot has 4 GB RAM configured. Now it raised to 7,3 GB used RAM, still raising slowly.

The VM i migrated to another host, hast risen to 8,6 GB used RAM on the host and has configured 8 GB. I will wait another little bit and report if it raises further.
 
Could it be some remains from XenServer where the VMs where imported from? But other VMs which are working normally where also imported from XenServer.....
 
Installed the latest updates for PVE as I have seen that qemu was updated, hoped that it helped.
Added a little script which wrote the 'rss' column of the 'ps -eo rss,command' output to a file, every second. I made a primitive graph attached to this post where we see clearly that around 4 or 5 o'clock in the morning RAM usage jumps up a ~150 Mbytes. This is approximately the time we do the backups of this machine. On the other side, there a +50 VMs which do not show this behaviour.

How can we debug this issue? Moving to another storage is not an option, as for us the only other method accessing our SAN would be iSCSI and iSCSI on LVM on Proxmox does not support snapshots.
 

Attachments

  • 2020-01-22_09h16_14.png
    2020-01-22_09h16_14.png
    20 KB · Views: 22
at first glance that would point at a memory leak somewhere in our backup code (or somewhere related in Qemu). has this system been running stable before? if so, it would be interesting to see whether the change in stability happened with an upgrade, and which packages got upgraded (/var/log/apt/history.log)
 
Ah, no we are not backing up with Proxmox. We do a backup internal on the VM at this time. Sorry for the confusion.
 
Have you been able to track the problem further down in the meanwhile? A user in the German forum could solve a similar problem by (among other things) using CIFS instead of NFS.
 
I'm having the same problem. Certain CT (CT ONLY! Not a VM!) gets killed periodically on a Proxmox server, no server logs, and only a dmesg message! Server is used very and very lightly.
I have 128G RAM, 1TB ZFS NVME (ARC limited to 24G), periodic memory defragment, even enabled swap, to test if helps! etc.
NOTHING HELPS.

One of the VM's periodically killed, even do not have real load. It's only a router, which is a kernel stuff, and event that VM get's killed.

I have never had any OOM either in LXC/LXD (on ZFS TOO), or old OPENVZ on whatever server/memory loads . This issue is the issue of the PROXMOX only, you guys doing something custom, this "something" is really nasty. And that exists atleast since version 5, which i dropped using, just because this issue. Hoped 6, and different kernels will cure the issue, but it is not cured.

Setup now 3xProxmox HA CLUSTER, with 2x10core Xeon CPU, 128G ram, 1TB mirror NVME.

For now, HA recovers killed VM, but that is a kind of nasty workarround.

- cron
30 2 * * * root echo 1 > /proc/sys/vm/compact_memory

- systemctl.conf
vm.swappiness = 10
vm.max_map_count=262144

Code:
Aug 13 16:41:01 fr2 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-3,oom_memcg=/lxc/114,task_memcg=/lxc/114/ns,task=systemd,pid=12720,uid=100
Aug 13 16:41:01 fr2 kernel: Memory cgroup out of memory: Killed process 12720 (systemd) total-vm:169288kB, anon-rss:2120kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:100kB o
Aug 13 16:41:01 fr2 kernel: oom_reaper: reaped process 12720 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 13 16:41:01 fr2 kernel: Memory cgroup out of memory: Killed process 21813 (rsyslogd) total-vm:156184kB, anon-rss:1464kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:80kB o
Aug 13 16:41:01 fr2 kernel: oom_reaper: reaped process 21813 (rsyslogd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
 
Last edited:
Please also post
Code:
pct config 114
journalctl -u pve-container@114
cat /var/lib/lxc/114/config
from the host and syslog from within the container. Is there maybe a huge amount of logging going on in the container? Would it be possible to use a virtual machine instead of the container?
 
Please also post
Code:
pct config 114
journalctl -u pve-container@114
cat /var/lib/lxc/114/config
from the host and syslog from within the container. Is there maybe a huge amount of logging going on in the container? Would it be possible to use a virtual machine instead of the container?

arch: amd64
cores: 2
hostname: turnserver1
memory: 16384
net0: name=eth0,bridge=vmbr1,gw=HIDDEN.251,hwaddr=E6:D1:C8:2F:2E:F5,ip=HIDDEN/27,tag=1,type=veth
net1: name=eth1,bridge=vmbr0,gw6=HIDDEN,hwaddr=B2:1C:57:1B:2D:89,ip6=HIDDEN/64,type=veth
onboot: 1
ostype: debian
rootfs: spindle-containers:subvol-114-disk-0,size=5G
swap: 512
unprivileged: 1

lxc.cgroup.relative = 0
lxc.cgroup.dir.monitor = lxc.monitor/114
lxc.cgroup.dir.container = lxc/114
lxc.cgroup.dir.container.inner = ns
lxc.arch = amd64
lxc.include = /usr/share/lxc/config/debian.common.conf
lxc.include = /usr/share/lxc/config/debian.userns.conf
lxc.seccomp.profile = /usr/share/lxc/config/pve-userns.seccomp
lxc.apparmor.profile = generated
lxc.apparmor.raw = deny mount -> /proc/,
lxc.apparmor.raw = deny mount -> /sys/,
lxc.mount.auto = sys:mixed
lxc.monitor.unshare = 1
lxc.idmap = u 0 100000 65536
lxc.idmap = g 0 100000 65536
lxc.tty.max = 2
lxc.environment = TERM=linux
lxc.uts.name = turnserver1
lxc.cgroup.memory.limit_in_bytes = 17179869184
lxc.cgroup.memory.memsw.limit_in_bytes = 17716740096
lxc.cgroup.cpu.shares = 1024
lxc.rootfs.path = /var/lib/lxc/114/rootfs
lxc.net.0.type = veth
lxc.net.0.veth.pair = veth114i0
lxc.net.0.hwaddr = E6:D1:C8:2F:2E:F5
lxc.net.0.name = eth0
lxc.net.0.script.up = /usr/share/lxc/lxcnetaddbr
lxc.net.1.type = veth
lxc.net.1.veth.pair = veth114i1
lxc.net.1.hwaddr = B2:1C:57:1B:2D:89
lxc.net.1.name = eth1
lxc.net.1.script.up = /usr/share/lxc/lxcnetaddbr
lxc.cgroup.cpuset.cpus = 6-7


Logging - there is indeed quite a bit of logging via syslog, say 10-30 log lines per sec. Why should that matter?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!