CPU soft lockup: Watchdog: Bug: soft lockup - CPU#0 stock for 24s!

you need to use virtio-scsi-single to make iothread=1 to be effectively used at all. iothread=1 without virtio-scsi-single is meaningless (though it's possible to configure in the gui)

using virtio-scsi-single does not mean that you use a single virtual disk

https://qemu-devel.nongnu.narkive.com/I59Sm5TH/lock-contention-in-qemu
<snip>
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
<snip>
Yes, there is a way to reduce jitter caused by the QEMU global mutex:

qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0

Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.

This feature is called virtio-blk dataplane.
<snip>

https://forum.proxmox.com/threads/virtio-scsi-vs-virtio-scsi-single.28426/
Ahhh, I guess I only halfway understood. This helps. Thank you!
 
apparently, setting virtio-scsi-single & iothread & aio=threads cured all our vm freeze & hiccup issues.

i added this information to:

https://bugzilla.kernel.org/show_bug.cgi?id=199727#c8
https://bugzilla.proxmox.com/show_bug.cgi?id=1453

apparently, in ordinary/default qemu io processing, there is chances to get into larger locking conditions which block entire vm execution and thus entirely freezing the guest cpu for a while . this also explains why ping jitters that much.

when virtio-scsi-single & iothread & aio=native, ping jitter gets cured, too, but the jitter/freeze moves into the iothread instead and i'm still getting kernel traces/oopses regarding stuck processes/cpu.

adding aio=threads solves this entirely.

the following information sends some light into the whole picture, apparently the "qemu_global_mutex" can slam hard in your face and this seems to be very unknown:

https://docs.openeuler.org/en/docs/.../best-practices.html#i-o-thread-configuration

"The QEMU global lock (qemu_global_mutex) is used when VM I/O requests are processed by the QEMU main thread. If the I/O processing takes a long time, the QEMU main thread will occupy the global lock for a long time. As a result, the VM vCPU cannot be scheduled properly, affecting the overall VM performance and user experience."


i have never seen a problem again with virtio-scsi-single & iothread & aio=threads again, ping is absolutely stable with that,also ioping in VM during vm migration or virtual disk move is within reasonable range. it's slow on high io pressure, but no errors in kernel dmesg inside the guests.

i'm really curious, why this problem doesn't affect more people and why it is so hard to find information, that even proxmox folks won't give a hint into this direction (at least i didn't find one, and i searched really long and hard)

I'm still searching for some deeper information/knowledge what exactly happens in qemu/kvm and what is going on in detail that freezes for several tens of secends occur. even in qemu project detailed information on "virtio dataplane is curing vm hiccup/freezing and removing big qemu locking problem" is near to non existing. main context is "it improves performance and user experience".

anywhay, i consider this finding important enough to be added to the docs/faqs. for us, this finding is sort of essential for survival, our whole xen proxmox migration was delayed for months because of those vm hiccup/freeze issues.

what do you think @proxmox-team ?
Finally!!! Thank you!
 
Are you guys setting *all* of your VMs with virtio-scsi-single/iothread/threads or just the problem machines?
 
i set all VMs with this params
Thank you to all involved with this thread and the research work. We have seen freezes on three systems in recent days (on PVE 7.2) which were all rock solid before. We have now applied the combination of settings that RolandK suggests (virtio-scsi-single & iothread & aio=threads) and hope for the best.
An excellent, helpful community here.
 
I know it is an old thread, but seeing a fresh posting just days ago, I decided to pitch in.

I will open a fresh thread at later time with that has been happening in our environment. Just wanted to say that I too have this VM lock up issue. Applying the virtio-single, iothread and async=threads did not solve the problem. In my case multiple VMs (CentOS/Debian) gets locked up during backup to a PBS server.

I did not test backup using a local disk, because for me it must work using shared storage. After lock up takes place, backup does finish after an extended period of time. Once backup is finished, VM needs to be rebooted to be fully functional again. I am not certain exactly which update caused the issue to start occurring.
 
I know it is an old thread, but seeing a fresh posting just days ago, I decided to pitch in.

I will open a fresh thread at later time with that has been happening in our environment. Just wanted to say that I too have this VM lock up issue. Applying the virtio-single, iothread and async=threads did not solve the problem. In my case multiple VMs (CentOS/Debian) gets locked up during backup to a PBS server.

I did not test backup using a local disk, because for me it must work using shared storage. After lock up takes place, backup does finish after an extended period of time. Once backup is finished, VM needs to be rebooted to be fully functional again. I am not certain exactly which update caused the issue to start occurring.
Hi,
do you have the 5.19 kernel running? In case of the migration issues, this helps for me. Not sure, if this is the same on your side.

Udo
 
This normally just means that the PVE host had a (way) too high load, CPU or IO wise.

I can observe such things when running very intensive compile tasks on an already loaded host, there's not much one can do besides increasing resources available or limiting the high load.
I this known issue? I am trying to rsync around 1TB of data files from one DAS HDD to another. Keeping in mind they are Direct Attached Storage, on a mostly idle home PC with Xeon CPU, 8GB RAM in VM and 32GB on host. What constitutes high-load?

As far as I can see, this is a showstopper even for home lab like mine. Any other suggestions to improve the situation? I am currently on PVE 7.4-13.
 
I this known issue? I am trying to rsync around 1TB of data files from one DAS HDD to another. Keeping in mind they are Direct Attached Storage, on a mostly idle home PC with Xeon CPU, 8GB RAM in VM and 32GB on host. What constitutes high-load?

As far as I can see, this is a showstopper even for home lab like mine. Any other suggestions to improve the situation? I am currently on PVE 7.4-13.
I'm also having issues on a home lab trying to use all cores in a vm. I've tried the scsi/threads hack as mentioned above, but I'm getting softlockups everytime I compile the kernel with all threads in a VM. (88 threads total, 2 cpus). Doesn't matter how I configure the processor, either, numa on or off, cpu type selection, etc.

I agree this is a showstopper. I see no reason why I shouldn't be able to compile a kernel with all cores in a vm. It's not even doing a lot of IO.

Running on a Dell R730, 2x Xeon-2699v4, 256G ECC (memtest86 passed).
 
I'm also having issues on a home lab trying to use all cores in a vm. I've tried the scsi/threads hack as mentioned above, but I'm getting softlockups everytime I compile the kernel with all threads in a VM. (88 threads total, 2 cpus). Doesn't matter how I configure the processor, either, numa on or off, cpu type selection, etc.

I agree this is a showstopper. I see no reason why I shouldn't be able to compile a kernel with all cores in a vm. It's not even doing a lot of IO.

Running on a Dell R730, 2x Xeon-2699v4, 256G ECC (memtest86 passed).
I should add I have no problem compiling the kernel on the host, using all cores. Takes about 64 seconds (expected). No errors or other glitches.
 
Last edited:
Hi,
I'm also having issues on a home lab trying to use all cores in a vm. I've tried the scsi/threads hack as mentioned above, but I'm getting softlockups everytime I compile the kernel with all threads in a VM. (88 threads total, 2 cpus). Doesn't matter how I configure the processor, either, numa on or off, cpu type selection, etc.
please share the output of pveversion -v and the VM configuration. Did the issue start happening after a certain kernel update or has it been there for longer? Is there anything in the host's system logs/journal?

(88 threads total, 2 cpus)
Is there a typo? How do you get to 88 threads/how are you invoking kernel compilation?
 
Hi,

please share the output of pveversion -v and the VM configuration. Did the issue start happening after a certain kernel update or has it been there for longer? Is there anything in the host's system logs/journal?


Is there a typo? How do you get to 88 threads/how are you invoking kernel compilation?

Thanks for Responding fiona!

I just purchased a dual xeon 2699v4 Dell R730 system (22 core, 44 threads each, so 88 threads total). Here's the commands I use for compiling:
Code:
make defconfig
make clean
time make -s -j88

I've just started using proxmox. I've used zfs as root fs, now currently using btrfs as root, with zfs pool for vms. Same issue happens each time. I've also tried disabling primarycache on the zfs volumes to no avail.

The only thing that allows me to compile kernel with 88 threads in a vm with no softlockups is limiting my vm memory to 32GB or less. I have no trouble compiling kernel in host proxmox, with 88 threads as given above. My system also passes memtest86 and dell's iDRAC memory tester.

I'm kind of at a loss how to handle this. I was hoping to use this setup for AI research, which requires a ton of RAM, so I'm not sure how I should proceed. My only hunch right now is to try to remove ZFS from the picture completely and see if that helps. Any other thoughts would be appreciated!

Code:
root@pve:~# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 2.99.0-1
proxmox-backup-file-restore: 2.99.0-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.5
pve-cluster: 8.0.1
pve-container: 5.0.3
pve-docs: 8.0.3
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.4
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
 
The only thing that allows me to compile kernel with 88 threads in a vm with no softlockups is limiting my vm memory to 32GB or less. I have no trouble compiling kernel in host proxmox, with 88 threads as given above. My system also passes memtest86 and dell's iDRAC memory tester.
How does the memory/CPU load on the host look like when the issue occurs? Can you share the VM configuration?

I'm kind of at a loss how to handle this. I was hoping to use this setup for AI research, which requires a ton of RAM, so I'm not sure how I should proceed. My only hunch right now is to try to remove ZFS from the picture completely and see if that helps. Any other thoughts would be appreciated!
ZFS can need a lot of memory, you can try and limit it's usage: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage
 
How does the memory/CPU load on the host look like when the issue occurs? Can you share the VM configuration?

The memory cpu load is fairly close to 0 in all my testing. This is a new system so there's not a lot going on. I have 256GB of ECC ram.

I could send you the VM configuration, but I'm constantly changing it to see if something works. So far, the only thing that seems to allow it to work is reducing memory (ballooning doesn't seem to matter). One thing I haven't tried yet is CPU affinity.

The guest VM is a Debian 12 minimal install (no graphical shell). This is what my current failed test looks like.

1694522758622.png

1694522791276.png
I will try to remove ZFS from the picture entirely and see if that matters.
 
I am unable to upload a 5.5GB Windows 11 ISO to Proxmox. There's absolutely no load at the time of file transfer.
I tried three times, afterwards, I used FileZilla to transfer file to /var/lib/vz/template/iso

This is really a show stopper.

Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.116-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-6
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
Last edited:
I am unable to upload a 5.5GB Windows 11 ISO to Proxmox. There's absolutely no load at the time of file transfer.
I tried three times, afterwards, I used FileZilla to transfer file to /var/lib/vz/template/iso

This is really a show stopper.
you can reproducably soft lockup your cpu by transferring an ISO file to proxmox ?
 
Just to update everyone, I successfully stopped the softlockups by disabling mitigations in the host and in the client vm.

added this to the kernel command line in /etc/default/grub
Code:
mitigations=off
then ran
Code:
update-grub
Thanks for the help,
-Schmerbs
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!