VM CPU issues: watchdog: BUG: soft lockup - CPU#7 stuck for 22s!

shalak

Member
May 9, 2021
52
2
13
39
Hello!

Every couple of weeks various VMs tend to become problematic and their consoles show the following errors:

vm_cpu_issues.png


Rebooting the VM fixes the issue.

How do I diagnose what's going wrong? In that state of the VM, I'm unable to SSH into it.

The Proxmox is installed on HP Proliant DL380e Gen8. I'm running pve-manager/7.1-11/8d529482 (running kernel: 5.13.19-6-pve)
 
Last edited:
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
 
  • Like
Reactions: Kingneutron
Seen that here too when using USB HDDs and/or a faulty USB-SSD. I guess the CPU locks up when a process it is waiting for IO but there is a problem with the storage.

Is your PVE up to date? There was a bug last year causing these errors too but that got fixed so a PVE 7.X shouldn't be effected.
 
Last edited:
I'm on ProxMox 7.1. Funny enough, on the same hardware we had zero issues on Proxmox 6.X. It was only after the 7 upgrade that these issues started.
 
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
I have one VM that has a RAID volume passed-through to it and that VM creates CIFS shares to be used on another VM.

Both VMs have similar CPU issues.

My PVE is up to date.
 
On another Proxmox installation I have this issue:

Screenshot 2022-04-06 at 00.08.58.png
Again - rebooting VM solves the issue.

How can I fix this?
 
I am getting this on a host with a single VM using local SSD storage. I don't think network load has anything to do with it at all. This only stated happening to our cluster with the upgrade to 7.
I am regretting doing that now.
 
I am getting this on a host with a single VM using local SSD storage. I don't think network load has anything to do with it at all. This only stated happening to our cluster with the upgrade to 7.
I am regretting doing that now.
I don't think it is network related. Any high disk IO causes havoc. Noticed it with both NFS based storage and local ZFS. I'm sure adding faster storage would mask the issue, but it would only mask the issue.
 
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
I'm using Proxmox 8 (no-subscription repos and up to date as of Oct 19, 2023) and I see this issue in a VM that is on an NFS storage (10GB connection to a TrueNAS Scale NFS share) and when i deleted a snapshot.

Server that i'm running is:
Dell R620, 24 x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz (2 Sockets) with 256GB RAM.
CPU load is low and RAM usage is only 30GB.

I think this is I/O load on the NFS.
I have a 6 hypervisors using this datastore with 3 VM's each with 3 DISKS totaling 500GB's.
2 of those VMs are using QCow and 1 of the VM's is using RAW as the disk format

I wanted it on NFS because of QUICK Migration times when I do an update on one of the nodes on the cluster.
However, I'm thinking perhaps that I'll move all BIG disk usage VMs to the ZFS datastore and keep small VMs on the NFS.

1697779502967.png

If anybody has any suggestions about if I can optimize, i'd be glad to receive it.
 
  • Like
Reactions: Kingneutron
I saw the same errors in dmesg, so I found the followings:
https://ubuntu.com/blog/real-time-kernel-tuning
https://access.redhat.com/sites/def...-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
https://h50146.www5.hpe.com/product.../support/whitepaper/pdfs/emr_na-c01804533.pdf

nosoftlockup - disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold (default 120 seconds).
mce=ignore_ce - ignores corrected errors and associated scans that can cause periodic latency spikes.

Code:
Add kernel parameter to Proxmox OS or VM:
( /etc/default/grub  -> GRUB_CMDLINE_LINUX="..." )

"nosoftlockup mce=ignore_ce"
Code:
$> update-grub

Reboot
 
Last edited:
I saw the same errors in dmesg, so I found the followings:
https://ubuntu.com/blog/real-time-kernel-tuning
https://access.redhat.com/sites/def...-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
https://h50146.www5.hpe.com/product.../support/whitepaper/pdfs/emr_na-c01804533.pdf

nosoftlockup - disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold (default 120 seconds).
mce=ignore_ce - ignores corrected errors and associated scans that can cause periodic latency spikes.

Code:
Add kernel parameter to Proxmox OS or VM:
( /etc/default/grub  -> GRUB_CMDLINE_LINUX="..." )

"nosoftlockup mce=ignore_ce"
Code:
$> update-grub

Reboot
I don't think masking this is what most people are after though.
 
Same here, after update from 6 to 7 problem start, pass to 8 and problem persist i don't know what more to do, before was 0 problem. my system run zfs. And it's the same when backup or transfer get high io and soft lockup.

Anybody knows the last kernel without this problem?
 
The problem also shows up whenever you have any sort of high io it seems. I just transferred 60 Gigs between two VMs on the same server and immediately hung the entire machine up until one of the VMs crashed.

Is there any sort of mitigation for this? The VMs are already using virt-io-scsi single and async io....
 
The problem also shows up whenever you have any sort of high io it seems. I just transferred 60 Gigs between two VMs on the same server and immediately hung the entire machine up until one of the VMs crashed.

Is there any sort of mitigation for this? The VMs are already using virt-io-scsi single and async io....

Use something SAN ( "enterprise grade = hardware offload supported" ) storage: INFINIBAND, FC , ISCSI,
Forget NAS Storage ( SMB/CIFS, NFS, ... )
 
  • Like
Reactions: Kingneutron
This is a problem I've seen with other KVM-based virtualization, and iothreads for the IO and network should solve this. The issue is that the main qemu thread holds some mutexes/semaphores while dealing with IO, which is sometimes slow/blocking.
(if it's the same issue, but hopefully it's easy to test)
 
  • Like
Reactions: RolandK
This is a problem I've seen with other KVM-based virtualization, and iothreads for the IO and network should solve this. The issue is that the main qemu thread holds some mutexes/semaphores while dealing with IO, which is sometimes slow/blocking.
(if it's the same issue, but hopefully it's easy to test)
yes

https://bugzilla.kernel.org/show_bug.cgi?id=199727

"i tried "virtio scsi single" with "aio=threads" and "iothread=1" in proxmox, and after that, even with totally heavy read/write io inside 2 VMs (located on the same spinning hdd on top of zfs lz4 + zstd dataset and qcow) and severe write starvation (some ioping >>30s), even while live migrating both vm disks in parallel to another zfs dataset on the same hdd, i get absolutely NO jitter in ping anymore. ping to both VMs is constantly at <0.2ms"


oh, just saw that i already mentioned this a while ago.
 
Last edited:
I had similar issues in about half of my 30 VMs
- CPU stuck messages
- NMI received for unknown reason
- VM's are lagging
- VM's are freezing where only power off is possible
The Host is an Dell R630 56 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (2 Sockets) with 768 GB RAM.

System Profile was "Performance" with C states disabled, different CPU types in the VM's.

For me did the trick to disable Hyperthreading (or in Dell terms "Logical Processor") in the BIOS. So now its just "28 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (2 Sockets)". But it is enough for the workload in my home lab. Since doing this I never had such a message, laggs or VM hung ups again.

Fun fact beside: After disabling HT and reducing the vCPU's in one VM from 20 to 16, I had a performance grow on a regular CPU intensive task in this VM from 12 min before to 8 min now.

I do not know for sure, but for me it looks like Dell's BIOS implementation for HT (Logical Processor) in some Models can cause this problems.

I have also a Dell T1700 with a Xeon processor, that is not a real Server but a powerfull PC. Here HT is enabled, C states enabled and the VM's had never had such problems at all.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!