[SOLVED] OOMs and soft lockups while enough RAM and CPU is available

May 20, 2017
172
18
58
Netherlands
cyberfusion.nl
Problem description

Tonight, one of my VMs started having issues:

  • %sys increased
  • CPU steal increased
  • Finally, the VM froze with this message:
    Code:
    Aug  8 03:37:51 x kernel: [28047588.295379] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lvectl:1511420]
    (CPU# would differ - not just CPU#1)

Screenshot_20200808_060128.png
CPU and load statistics

The VM froze completely, so no logs are available. I thought the hypervisor may not have enough CPU, but it was using just 60% of 48 threads at the time of the freeze. The VM had been assigned 35 cores (1 socket).

Then, I (live) migrated the VM to another hypervisor that was using just 14% of 40 threads. After migration finished, %sys did not decrease; making it almost impossible for this to be a hardware issue. So, I force stopped the VM and tried to start it again, but now started getting OOMs directly on boot:

Screenshot_20200808_123250.png
OOM

Screenshot_20200808_121904.png
OOM

I tried decreasing the VM's specs to 12 GB RAM and 8 CPU cores, but kept getting OOMs; although the messages did not start appearing directly on boot. It would take about 30 seconds for the VM to show these errors and freeze again, as opposed to ~5 seconds when the VM had more RAM and CPU cores assigned.

I then restored the original specs and disabled ballooning. This did not help. However, disabling ballooning and changing the CPU type to 'host' (instead of kvm64), seems to have helped. The VM has now been running for about two hours, while without both ballooning and 'host' CPU, the VM would freeze and OOM after seconds. I'm sure this is not a hardware problem, as the same problem occurred on two hypervisors.

Environment details

Versions

  • The VM is running CloudLinux OS with kernel 3.10.0-962.3.2.lve1.5.38.el7.x86_64
  • Proxmox cluster nodes are running Proxmox VE 6.2-6
  • The Proxmox cluster has last been upgraded several weeks ago
  • The VM has last been upgraded on Augustus 3 (just ClamAV updates)

One of the hypervisors:

Code:
root@proxmox03:~# pveversion
pve-manager/6.2-6/ee1d7754 (running kernel: 5.4.44-1-pve)

root@proxmox03:~# uname -a
Linux proxmox03 5.4.44-1-pve #1 SMP PVE 5.4.44-1 (Fri, 12 Jun 2020 08:18:46 +0200) x86_64 GNU/Linux


VM's current config file

Code:
root@proxmox03:~# cat /etc/pve/qemu-server/138.conf
agent: 1
balloon: 0
bootdisk: scsi0
cores: 30
cpu: host
cpuunits: 2048
ide2: none,media=cdrom
memory: 49152
name: xxx
net0: virtio=6E:4A:F3:A2:9D:3A,bridge=vmbr172,firewall=1
net1: virtio=46:CB:96:E2:BC:02,bridge=vmbr999,link_down=1,tag=69
numa: 0
ostype: l26
protection: 1
scsi0: rbd_vm:vm-138-disk-1,size=1T
scsihw: virtio-scsi-pci
smbios1: uuid=xxx
sockets: 1


Question

This issue started occurring seemingly out of the blue.

I understand a VM with lots of RAM having ballooning enabled may be problematic. I know OOMs at boot may occur because not enough RAM is allocated; I've seen this before.

But something else seems to be wrong, because of the 'soft lockups' earlier that do not make sense (enough CPU was and is available). And, of course, disabling ballooning alone did not solve the issue. Only disabling ballooning and setting the CPU type to 'host' seem to have helped.

Question: what is wrong here?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!