[SOLVED] VMs freeze with 100% CPU

Tested with PVE 7.4 with kernel 6.2.16-11-bpo11-pve and VMs did not freeze, even though mmu_invalidate_seq was over 2815072170. At last, a light by the end of the tunnel!

Have no PVE8 that I can test with at the moment, but will upgrade the test system asap.
 
Tested with PVE 7.4 with kernel 6.2.16-11-bpo11-pve and VMs did not freeze, even though mmu_invalidate_seq was over 2815072170. At last, a light by the end of the tunnel!

Have no PVE8 that I can test with at the moment, but will upgrade the test system asap.
Same here.
Test on an 7.4-Cluster run without trouble with the 6.2.16-11-bpo11-pve.

Udo
 
i can't seem to upgrade past 6.2.16-4-bpo11.
How do you manually install 6.2.16-11-bpo11-pve on 7.4 ?
 
i can't seem to upgrade past 6.2.16-4-bpo11.
How do you manually install 6.2.16-11-bpo11-pve on 7.4 ?
It definitely is among the packages, so probably just apt update first.

root@VSAMD1:~# apt-cache search kernel-6.2
pve-kernel-6.2.11-1-pve - Proxmox Kernel Image
pve-kernel-6.2.11-2-pve - Proxmox Kernel Image
pve-kernel-6.2.16-11-bpo11-pve - Proxmox Kernel Image
pve-kernel-6.2.16-4-bpo11-pve - Proxmox Kernel Image
pve-kernel-6.2.2-1-pve - Proxmox Kernel Image
pve-kernel-6.2.6-1-pve - Proxmox Kernel Image
pve-kernel-6.2.9-1-pve - Proxmox Kernel Image
pve-kernel-6.2 - Latest Proxmox VE Kernel Image
 
i can't seem to upgrade past 6.2.16-4-bpo11.
How do you manually install 6.2.16-11-bpo11-pve on 7.4 ?
The day before yesterday, the package was in pvetest only.
Download and install:
Code:
wget http://download.proxmox.com/debian/pve/dists/bullseye/pvetest/binary-amd64/pve-kernel-6.2.16-11-bpo11-pve_6.2.16-11~bpo11%2B2_amd64.deb
dpkg -i pve-kernel-6.2.16-11-bpo11-pve_6.2.16-11~bpo11+2_amd64.deb
Udo
 
  • Like
Reactions: proxminent
This kernel is in repo pve-no-subscription, enable it and you ready to go (to install this kernel). I'm really impressed for this job done and i'm waiting to land this kernel in proxmox repo for 7.4 PVE.

Good job and kudos for You all!
 
Last edited:
i can't seem to upgrade past 6.2.16-4-bpo11.
How do you manually install 6.2.16-11-bpo11-pve on 7.4 ?
I'm on travel and don't have access to my lab. Still, AFAIK, the patched kernel is still on the "testing" repository only (info about Proxmox repositories [0]). I suggest that you:

- Enable the testing repository from the webUI. You can do it manually, of course.
- Click "update" to refresh the package list or apt update from CLI.
- Install the updated kernel *only* from CLI: apt install 6.2.16-11-bpo11-pve
- Disable the testing repository from the webUI so you don't get testing packages.

As suggested, downloading the .deb file and installing with dpkg is good option too. I just do it the other way just to help me remember that that host had some "testing" package installed at some point for some reason, just in case it helps in the future.

[0] https://pve.proxmox.com/wiki/Package_Repositories
 
I'm on travel and don't have access to my lab. Still, AFAIK, the patched kernel is still on the "testing" repository only (info about Proxmox repositories [0]). I suggest that you:
No, those kernels are already on no-subscription, i.e., the repo in between testing and production recommended, for both Proxmox VE 8 and Proxmox VE 7, since Tuesday, 2023-09-05, roughly 17:00 CEST.

The rest of your suggestion of how to get that update is sound though!

FYI: We plan to move those packages to the enterprise repo at start of next week, if nothing comes up that is.
 
Last edited:
  • Like
Reactions: VictorSTS
I am running the test to see if I can reproduce the freeze as well. It will take a while I think, since running it for 2 hours only got the mmu_invalidate_seq value to ~350M. I also found that for some reason I have to repeat the "balloon $amount" command within every loop iteration, otherwise the memory usage does not go down. I suppose in the background proxmox keeps informing the VM that it does *not* have to balloon? Is there some setting I can disable to prevent this behavior?
 
I am running the test to see if I can reproduce the freeze as well. It will take a while I think, since running it for 2 hours only got the mmu_invalidate_seq value to ~350M. I also found that for some reason I have to repeat the "balloon $amount" command within every loop iteration, otherwise the memory usage does not go down. I suppose in the background proxmox keeps informing the VM that it does *not* have to balloon? Is there some setting I can disable to prevent this behavior?
Does the VM have ballooning enabled (meaning: are "minimum memory" and "memory" set to different values)? If yes, you can disable it (by setting them to the same value). The "ballooning device" has to stay enabled. These settings will prevent PVE (more specifically: pvestatd) from issuing its own balloon commands to the QEMU monitor, but will still allow the reproducer to perform ballooning.
 
  • Like
Reactions: coenvl
I can confirm the bug on my cluster as well on my cluster with kernel 6.2.11-1, not sure exactly how long it took to trigger it, but it was less than 7 hours. That is, for the Debian VMs, when I keep switching ballooning between 4GB and 100GB of RAM.
 
I can't see
6.2.16-11-bpo11-pve

Is this all in this kernel now?
proxmox-kernel-6.2 (6.2.16-12)

EDIT: I read over the last few pages, looks like yep
 
Last edited:
I can't see
6.2.16-11-bpo11-pve

Is this all in this kernel now?
proxmox-kernel-6.2 (6.2.16-12)

EDIT: I read over the last few pages, looks like yep

The fix is in:
  • 6.2.16-11~bpo11+2 of the 6.2 opt-in kernel on PVE 7/Bullseye: [1]
  • 6.2.16-12 of the 6.2 (current) default kernel on PVE 8/Bookworm: [2]

Both are currently available in their corresponding pve-no-subscription repository.

[1] https://git.proxmox.com/?p=pve-kern...;hpb=44f3d669a57e2f0ae2a1835f48ee37bab7e06d21
[2] https://git.proxmox.com/?p=pve-kern...;hpb=6810c247a180f3bb1492873cc571c3edd517d8a3
 
Thanks for the testing! It seems the ballooning reproducer can trigger freezes fairly reliably within a couple of hours in the unpatched kernel. However, the ballooning performed by the reproducer is very extreme, so the setup is not realistic. I tried to find out more about how the freezes could trigger in more realistic setups (with the unpatched kernel), and there seem to be at least the following factors that increase the likelihood of freezes:
  • Normal ballooning: If there are VMs with ballooning enabled, pvestatd runs the balloning algorithm [1] every 10 seconds, and the algorithm determines any VM balloons that should be inflated or deflated. Inflates/deflates are done in 100MiB steps. On my machine, reclaiming 100MiB from a Linux VM increments the mmu_invalidate_seq counter by ~25600. So if we make a guess and assume that pvestatd inflates the VM balloon by 100MiB every 20 seconds, it only takes ~20 days until the counter reaches 2^31. And interestingly, reclaiming 100MiB from a Windows VM increments the counter by the double amount (~51200) which means that a Windows VM could freeze after ~10 days already. This is consistent with the observations by @VictorSTS. So it seems plausible that normal ballooning might cause freezes with the unpatched kernel in real-world setups.
  • Kernel Samepage Merging (KSM [3]): As already hinted in the KVM mailing list thread [2], KSM activity also increases the counter. I set up two identical (idle) Windows VMs, and the following script made one of them freeze with a counter > 2^31 after ~16 hours (no ballooning involved):
    Code:
    echo 8192 > /sys/kernel/mm/ksm/pages_to_scan
    while true; do
            echo 2 > /sys/kernel/mm/ksm/run
            sleep 120
            echo 1 > /sys/kernel/mm/ksm/run
            sleep 600
    done
    This is of course also an unrealistic setup, but since the amount of KSM activity can be expected to vary depending on the guest workloads, it seems plausible to me that KSM could cause freezes in a real-world setup with the unpatched kernel after a couple of days/weeks.
So it seems like either ballooning or KSM could trigger the freezes within a couple of days/weeks on an unpatched kernel, and Windows VMs seem to be more impacted. Enabling both ballooning and KSM probably increases the likelihood of freezes even more. Of course, there may also be other factors that also accelerate the counter increase.

Anyway, the fix is available in the kernels mentioned by @Neobin [4], and it now seems very likely it fixes the freeze issues.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_memory
[2] https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameservers.com/T/
[3] https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
[4] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-587633
 
Last edited:
we are also having troubles with some VM, having about 3 freezes a week, with =~ 50VM having memory ballooning.

now we are upgrading 2 nodes with latest kernel and hope no more freezing on thoses two nodes. We are planing to move VM with ballooning on upgraded nodes.

thanks a lot for all this work.
 
Hello everyone,

I am using Proxmox 8.0.4 with kernel 6.2.16-12-pve VM continue has freezes and below errors.
I am tried to disable ballooning but without success.
On Proxmox 7.4-16 with kernel 5.15.116-1-pve all is ok.

Code:
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nginx:32986]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [nscd:1582]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [crond:9423]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [php-fpm:32674]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [kworker/32:0:28235]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#34 stuck for 22s! [kworker/34:1H:832]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [nginx:32969]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [cpsrvd (SSL) - :4009]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#40 stuck for 23s! [cpsrvd (SSL) - :32985]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#43 stuck for 22s! [cpaneld - servi:29987]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [nginx:32973]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#48 stuck for 22s! [kworker/48:1:532]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#49 stuck for 22s! [nginx:32907]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#31 stuck for 22s! [imunify360-unif:1799]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [nginx:32964]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#13 stuck for 33s! [queueprocd - pr:32823]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [imap-login:5436]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#41 stuck for 21s! [mysqld:29857]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#47 stuck for 39s! [nginx:32975]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#12 stuck for 39s! [imunify-realtim:6188]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#20 stuck for 21s! [nginx:32974]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#26 stuck for 39s! [imunify-auditd-:6240]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 28s! [nginx:32836]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 39s! [nginx:5238]
Sep 11 10:48:47 hosting2 kernel: watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [lve_loadavg:4219]
Sep 11 10:48:55 hosting2 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u100:22:334]
 
Last edited:
Hi,
Hello everyone,

I am using Proxmox 8.0.4 with kernel 6.2.16-12-pve VM continue has freezes and below errors.
I am tried to disable ballooning but without success.
On Proxmox 7.4-16 with kernel 5.15.116-1-pve all is ok.

Code:
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nginx:32986]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [nscd:1582]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [crond:9423]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [php-fpm:32674]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [kworker/32:0:28235]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#34 stuck for 22s! [kworker/34:1H:832]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [nginx:32969]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [cpsrvd (SSL) - :4009]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#40 stuck for 23s! [cpsrvd (SSL) - :32985]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#43 stuck for 22s! [cpaneld - servi:29987]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [nginx:32973]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#48 stuck for 22s! [kworker/48:1:532]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#49 stuck for 22s! [nginx:32907]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#31 stuck for 22s! [imunify360-unif:1799]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [nginx:32964]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#13 stuck for 33s! [queueprocd - pr:32823]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [imap-login:5436]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#41 stuck for 21s! [mysqld:29857]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#47 stuck for 39s! [nginx:32975]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#12 stuck for 39s! [imunify-realtim:6188]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#20 stuck for 21s! [nginx:32974]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#26 stuck for 39s! [imunify-auditd-:6240]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 28s! [nginx:32836]
Sep 11 10:46:54 hosting2 kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 39s! [nginx:5238]
Sep 11 10:48:47 hosting2 kernel: watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [lve_loadavg:4219]
Sep 11 10:48:55 hosting2 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u100:22:334]
that's not the same kind of freeze as described in this thread. The VM is still able to output those messages after all. Please open a new thread and provide the VM configuration, output of pveversion -v and tell us what your host's CPU model is. How does the load in the VM and on the host look like?
On Proxmox 7.4-16 with kernel 5.15.116-1-pve all is ok.
With the same VM on the same host?

EDIT: another important question: does the freeze happen after you do some special operation like e.g. migration/snapshot?
 
Last edited:
  • Like
Reactions: irekpias
Hi,

that's not the same kind of freeze as described in this thread. The VM is still able to output those messages after all. Please open a new thread and provide the VM configuration, output of pveversion -v and tell us what your host's CPU model is. How does the load in the VM and on the host look like?

With the same VM on the same host?

I re-imaged the host to Proxmox 7.4-16, and cannot provide output of command pveversion -v. :(
CPU Intel(R) Xeon(R) CPU E5-2683 v4

VM config
Code:
agent: 1
boot: cdn
bootdisk: virtio0
cores: 25
cpu: host
memory: 460800
name: hosting2
net0: virtio=72:1B:8F:22:D3:12,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=d951597e-0808-4cd6-8e9f-312af3ddcf80
sockets: 2
virtio0: datastore3:100/vm-100-disk-0.qcow2,size=1550G

How does the load in the VM and on the host look like?
VM is very load on Proxmox 8.0.4 and has a multiple freezes and generate below errors.
With the same VM on the same host?
Same VM but on another host. (Both hosts have same CPU model). We checked and all hardware components.



PS: The VM has an important resource and cannot have a possibility nu many experiments :)
 
Well, you didn't open a new thread, so you're forcing everybody to follow this discussion.
Code:
cores: 25
That's an odd number of CPUs. While it shouldn't be an issue, I think guests are more comfortable with an even number.
Code:
memory: 460800
How much memory does the host have?
Code:
virtio0: datastore3:100/vm-100-disk-0.qcow2,size=1550G
What kind of storage is this? If it's a network storage, is the connection stable?

Please also check the host's system log/journal when the issue occurs next time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!