[SOLVED] OOMs and soft lockups while enough RAM and CPU is available

William Edwards · Aug 8, 2020

Problem description

Tonight, one of my VMs started having issues:

%sys increased
CPU steal increased

Finally, the VM froze with this message:

Code:

Aug  8 03:37:51 x kernel: [28047588.295379] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lvectl:1511420]

(CPU# would differ - not just CPU#1)

CPU and load statistics

The VM froze completely, so no logs are available. I thought the hypervisor may not have enough CPU, but it was using just 60% of 48 threads at the time of the freeze. The VM had been assigned 35 cores (1 socket).

Then, I (live) migrated the VM to another hypervisor that was using just 14% of 40 threads. After migration finished, %sys did not decrease; making it almost impossible for this to be a hardware issue. So, I force stopped the VM and tried to start it again, but now started getting OOMs directly on boot:

OOM

OOM

I tried decreasing the VM's specs to 12 GB RAM and 8 CPU cores, but kept getting OOMs; although the messages did not start appearing directly on boot. It would take about 30 seconds for the VM to show these errors and freeze again, as opposed to ~5 seconds when the VM had more RAM and CPU cores assigned.

I then restored the original specs and disabled ballooning. This did not help. However, disabling ballooning and changing the CPU type to 'host' (instead of kvm64), seems to have helped. The VM has now been running for about two hours, while without both ballooning and 'host' CPU, the VM would freeze and OOM after seconds. I'm sure this is not a hardware problem, as the same problem occurred on two hypervisors.

Environment details

Versions

The VM is running CloudLinux OS with kernel 3.10.0-962.3.2.lve1.5.38.el7.x86_64
Proxmox cluster nodes are running Proxmox VE 6.2-6
The Proxmox cluster has last been upgraded several weeks ago
The VM has last been upgraded on Augustus 3 (just ClamAV updates)

One of the hypervisors:

Code:

root@proxmox03:~# pveversion
pve-manager/6.2-6/ee1d7754 (running kernel: 5.4.44-1-pve)

root@proxmox03:~# uname -a
Linux proxmox03 5.4.44-1-pve #1 SMP PVE 5.4.44-1 (Fri, 12 Jun 2020 08:18:46 +0200) x86_64 GNU/Linux

VM's current config file

Code:

root@proxmox03:~# cat /etc/pve/qemu-server/138.conf
agent: 1
balloon: 0
bootdisk: scsi0
cores: 30
cpu: host
cpuunits: 2048
ide2: none,media=cdrom
memory: 49152
name: xxx
net0: virtio=6E:4A:F3:A2:9D:3A,bridge=vmbr172,firewall=1
net1: virtio=46:CB:96:E2:BC:02,bridge=vmbr999,link_down=1,tag=69
numa: 0
ostype: l26
protection: 1
scsi0: rbd_vm:vm-138-disk-1,size=1T
scsihw: virtio-scsi-pci
smbios1: uuid=xxx
sockets: 1

Question

This issue started occurring seemingly out of the blue.

I understand a VM with lots of RAM having ballooning enabled may be problematic. I know OOMs at boot may occur because not enough RAM is allocated; I've seen this before.

But something else seems to be wrong, because of the 'soft lockups' earlier that do not make sense (enough CPU was and is available). And, of course, disabling ballooning alone did not solve the issue. Only disabling ballooning and setting the CPU type to 'host' seem to have helped.

Question: what is wrong here?

William Edwards · Aug 9, 2020

This seems to be caused by a kernel issue (guest OS's). Topic closed.

Search

Search

[SOLVED] OOMs and soft lockups while enough RAM and CPU is available

William Edwards

Renowned Member

William Edwards

Renowned Member

We value your privacy