VM CPU issues: watchdog: BUG: soft lockup - CPU#7 stuck for 22s!

shalak

Member
May 9, 2021
44
0
11
38
Hello!

Every couple of weeks various VMs tend to become problematic and their consoles show the following errors:

vm_cpu_issues.png


Rebooting the VM fixes the issue.

How do I diagnose what's going wrong? In that state of the VM, I'm unable to SSH into it.

The Proxmox is installed on HP Proliant DL380e Gen8. I'm running pve-manager/7.1-11/8d529482 (running kernel: 5.13.19-6-pve)
 
Last edited:
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
 
Seen that here too when using USB HDDs and/or a faulty USB-SSD. I guess the CPU locks up when a process it is waiting for IO but there is a problem with the storage.

Is your PVE up to date? There was a bug last year causing these errors too but that got fixed so a PVE 7.X shouldn't be effected.
 
Last edited:
I'm on ProxMox 7.1. Funny enough, on the same hardware we had zero issues on Proxmox 6.X. It was only after the 7 upgrade that these issues started.
 
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
I have one VM that has a RAID volume passed-through to it and that VM creates CIFS shares to be used on another VM.

Both VMs have similar CPU issues.

My PVE is up to date.
 
On another Proxmox installation I have this issue:

Screenshot 2022-04-06 at 00.08.58.png
Again - rebooting VM solves the issue.

How can I fix this?
 
I am getting this on a host with a single VM using local SSD storage. I don't think network load has anything to do with it at all. This only stated happening to our cluster with the upgrade to 7.
I am regretting doing that now.
 
I am getting this on a host with a single VM using local SSD storage. I don't think network load has anything to do with it at all. This only stated happening to our cluster with the upgrade to 7.
I am regretting doing that now.
I don't think it is network related. Any high disk IO causes havoc. Noticed it with both NFS based storage and local ZFS. I'm sure adding faster storage would mask the issue, but it would only mask the issue.
 
What type of storage are you using for that VM? We're seeing similar issues when there is heavy IO on a NFS storage. Machines will lock up and go unresponsive.
I'm using Proxmox 8 (no-subscription repos and up to date as of Oct 19, 2023) and I see this issue in a VM that is on an NFS storage (10GB connection to a TrueNAS Scale NFS share) and when i deleted a snapshot.

Server that i'm running is:
Dell R620, 24 x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz (2 Sockets) with 256GB RAM.
CPU load is low and RAM usage is only 30GB.

I think this is I/O load on the NFS.
I have a 6 hypervisors using this datastore with 3 VM's each with 3 DISKS totaling 500GB's.
2 of those VMs are using QCow and 1 of the VM's is using RAW as the disk format

I wanted it on NFS because of QUICK Migration times when I do an update on one of the nodes on the cluster.
However, I'm thinking perhaps that I'll move all BIG disk usage VMs to the ZFS datastore and keep small VMs on the NFS.

1697779502967.png

If anybody has any suggestions about if I can optimize, i'd be glad to receive it.
 
  • Like
Reactions: Kingneutron
I saw the same errors in dmesg, so I found the followings:
https://ubuntu.com/blog/real-time-kernel-tuning
https://access.redhat.com/sites/def...-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
https://h50146.www5.hpe.com/product.../support/whitepaper/pdfs/emr_na-c01804533.pdf

nosoftlockup - disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold (default 120 seconds).
mce=ignore_ce - ignores corrected errors and associated scans that can cause periodic latency spikes.

Code:
Add kernel parameter to Proxmox OS or VM:
( /etc/default/grub  -> GRUB_CMDLINE_LINUX="..." )

"nosoftlockup mce=ignore_ce"
Code:
$> update-grub

Reboot
 
Last edited:
I saw the same errors in dmesg, so I found the followings:
https://ubuntu.com/blog/real-time-kernel-tuning
https://access.redhat.com/sites/def...-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
https://h50146.www5.hpe.com/product.../support/whitepaper/pdfs/emr_na-c01804533.pdf

nosoftlockup - disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold (default 120 seconds).
mce=ignore_ce - ignores corrected errors and associated scans that can cause periodic latency spikes.

Code:
Add kernel parameter to Proxmox OS or VM:
( /etc/default/grub  -> GRUB_CMDLINE_LINUX="..." )

"nosoftlockup mce=ignore_ce"
Code:
$> update-grub

Reboot
I don't think masking this is what most people are after though.
 
Same here, after update from 6 to 7 problem start, pass to 8 and problem persist i don't know what more to do, before was 0 problem. my system run zfs. And it's the same when backup or transfer get high io and soft lockup.

Anybody knows the last kernel without this problem?
 
The problem also shows up whenever you have any sort of high io it seems. I just transferred 60 Gigs between two VMs on the same server and immediately hung the entire machine up until one of the VMs crashed.

Is there any sort of mitigation for this? The VMs are already using virt-io-scsi single and async io....
 
The problem also shows up whenever you have any sort of high io it seems. I just transferred 60 Gigs between two VMs on the same server and immediately hung the entire machine up until one of the VMs crashed.

Is there any sort of mitigation for this? The VMs are already using virt-io-scsi single and async io....

Use something SAN ( "enterprise grade = hardware offload supported" ) storage: INFINIBAND, FC , ISCSI,
Forget NAS Storage ( SMB/CIFS, NFS, ... )
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!