Problem description
Tonight, one of my VMs started having issues:
CPU and load statistics
The VM froze completely, so no logs are available. I thought the hypervisor may not have enough CPU, but it was using just 60% of 48 threads at the time of the freeze. The VM had been assigned 35 cores (1 socket).
Then, I (live) migrated the VM to another hypervisor that was using just 14% of 40 threads. After migration finished, %sys did not decrease; making it almost impossible for this to be a hardware issue. So, I force stopped the VM and tried to start it again, but now started getting OOMs directly on boot:
OOM
OOM
I tried decreasing the VM's specs to 12 GB RAM and 8 CPU cores, but kept getting OOMs; although the messages did not start appearing directly on boot. It would take about 30 seconds for the VM to show these errors and freeze again, as opposed to ~5 seconds when the VM had more RAM and CPU cores assigned.
I then restored the original specs and disabled ballooning. This did not help. However, disabling ballooning and changing the CPU type to 'host' (instead of kvm64), seems to have helped. The VM has now been running for about two hours, while without both ballooning and 'host' CPU, the VM would freeze and OOM after seconds. I'm sure this is not a hardware problem, as the same problem occurred on two hypervisors.
Environment details
Versions
One of the hypervisors:
VM's current config file
Question
This issue started occurring seemingly out of the blue.
I understand a VM with lots of RAM having ballooning enabled may be problematic. I know OOMs at boot may occur because not enough RAM is allocated; I've seen this before.
But something else seems to be wrong, because of the 'soft lockups' earlier that do not make sense (enough CPU was and is available). And, of course, disabling ballooning alone did not solve the issue. Only disabling ballooning and setting the CPU type to 'host' seem to have helped.
Question: what is wrong here?
Tonight, one of my VMs started having issues:
- %sys increased
- CPU steal increased
- Finally, the VM froze with this message:
Code:
Aug 8 03:37:51 x kernel: [28047588.295379] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lvectl:1511420]
CPU and load statistics
The VM froze completely, so no logs are available. I thought the hypervisor may not have enough CPU, but it was using just 60% of 48 threads at the time of the freeze. The VM had been assigned 35 cores (1 socket).
Then, I (live) migrated the VM to another hypervisor that was using just 14% of 40 threads. After migration finished, %sys did not decrease; making it almost impossible for this to be a hardware issue. So, I force stopped the VM and tried to start it again, but now started getting OOMs directly on boot:
OOM
OOM
I tried decreasing the VM's specs to 12 GB RAM and 8 CPU cores, but kept getting OOMs; although the messages did not start appearing directly on boot. It would take about 30 seconds for the VM to show these errors and freeze again, as opposed to ~5 seconds when the VM had more RAM and CPU cores assigned.
I then restored the original specs and disabled ballooning. This did not help. However, disabling ballooning and changing the CPU type to 'host' (instead of kvm64), seems to have helped. The VM has now been running for about two hours, while without both ballooning and 'host' CPU, the VM would freeze and OOM after seconds. I'm sure this is not a hardware problem, as the same problem occurred on two hypervisors.
Environment details
Versions
- The VM is running CloudLinux OS with kernel 3.10.0-962.3.2.lve1.5.38.el7.x86_64
- Proxmox cluster nodes are running Proxmox VE 6.2-6
- The Proxmox cluster has last been upgraded several weeks ago
- The VM has last been upgraded on Augustus 3 (just ClamAV updates)
One of the hypervisors:
Code:
root@proxmox03:~# pveversion
pve-manager/6.2-6/ee1d7754 (running kernel: 5.4.44-1-pve)
root@proxmox03:~# uname -a
Linux proxmox03 5.4.44-1-pve #1 SMP PVE 5.4.44-1 (Fri, 12 Jun 2020 08:18:46 +0200) x86_64 GNU/Linux
VM's current config file
Code:
root@proxmox03:~# cat /etc/pve/qemu-server/138.conf
agent: 1
balloon: 0
bootdisk: scsi0
cores: 30
cpu: host
cpuunits: 2048
ide2: none,media=cdrom
memory: 49152
name: xxx
net0: virtio=6E:4A:F3:A2:9D:3A,bridge=vmbr172,firewall=1
net1: virtio=46:CB:96:E2:BC:02,bridge=vmbr999,link_down=1,tag=69
numa: 0
ostype: l26
protection: 1
scsi0: rbd_vm:vm-138-disk-1,size=1T
scsihw: virtio-scsi-pci
smbios1: uuid=xxx
sockets: 1
Question
This issue started occurring seemingly out of the blue.
I understand a VM with lots of RAM having ballooning enabled may be problematic. I know OOMs at boot may occur because not enough RAM is allocated; I've seen this before.
But something else seems to be wrong, because of the 'soft lockups' earlier that do not make sense (enough CPU was and is available). And, of course, disabling ballooning alone did not solve the issue. Only disabling ballooning and setting the CPU type to 'host' seem to have helped.
Question: what is wrong here?
Last edited: