[SOLVED] Proxmox 8.0 / Kernel 6.2.x 100%CPU issue with Windows Server 2019 VMs

As the main author of this thread / issue here I can also state that after updating to latest Proxmox 8.2.x with kernel 6.8 I was able to re-enable KSM and the Windows Terminlserver VM which previously ended up in 100%CPU usage and working nicely again on the same hardware/environment.

So, Proxmox 8.2.x with with kernel 6.8 seem to have finally solved this issue. So thanks to everyone contributing to this thread and especially thanks to the Proxmox crew/devs for taking our issues here serious and listening to us.

And if a proxmox fellow walks over this thread here, I think you can finally flag it as solved.
 
Thank you all for testing the 6.8 kernel and reporting back. Judging from the positive feedback, it indeed seems very likely that the 6.8 kernel finally resolves the issue.

Thanks everyone, especially @jens-maus @Whatever @Jorge Teixeira @spirit @Ramalama, for your help tracking this down!

For anyone who finds this thread, a summary of the issue and resolution:

Symptoms: Proxmox VE running with kernel 5.19, 6.2 or 6.5 on a host with multiple NUMA nodes (you can check this using lscpu). VMs frequently become unresponsive (freeze) with high CPU usage for some time ranging from ~1 seconds to >60 seconds. During that time, the VMs do not respond to pings. After the freeze, the VM comes back on its own and continues to run (without manual intervention). All guest OSs are affected in principle, though Windows VMs seem to be most affected. On Windows VMs, the freezes are often long enough to provoke RDP session timeouts. On Linux VMs, the guest OS may report watchdog: BUG: soft lockup The freezes can happen regardless of KSM being enabled or disabled, but become more frequent if KSM is enabled.

Resolution:
  • Preferred solution on Proxmox VE 8.x: Upgrade to at least kernel 6.8, which includes an upstream patch [1] that appears to resolve the issue.
    • The easiest way is to upgrade to at least Proxmox VE 8.2, which includes kernel 6.8. Make sure to read the "known issues" section of the release notes [2] before you upgrade.
    • If you cannot upgrade to Proxmox VE 8.2 completely yet, you can install the opt-in kernel 6.8 [3].
  • Workaround if you cannot upgrade to kernel 6.8: In most cases, the freezes can be avoided by disabling the NUMA balancer [4]. You can disable the NUMA balancer for the current boot by running the following command:
    Code:
    echo 0 > /proc/sys/kernel/numa_balancing
    After a reboot, the NUMA balancer will be active again.

    If you want to disable the NUMA balancer permanently, you need to add numa_balancing=disable to the kernel command line and reboot. See the admin guide [5] for information how to modify the kernel command line.
[1] https://git.kernel.org/pub/scm/linu.../?id=d02c357e5bfa7dfd618b7b3015624beb71f58f1f
[2] https://pve.proxmox.com/wiki/Roadmap#Known_Issues_&_Breaking_Changes
[3] https://forum.proxmox.com/threads/144557/
[4] https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-numactl.html
[5] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_edit_kernel_cmdline
 
Last edited:
Confirm. I have same problem and helped
Bash:
echo 0 > /proc/sys/kernel/numa_balancing
.
Upgrade when you get a chance, especially if you're using something like ZFS RAID10 on multiple sockets. The NUMA load balancer helps a lot there, just hits this bug. With the latest kernel it can stay enabled and it's much more efficient too. I had 100% IO utilization to 40%.
 
I've been struggling with the same issues after upgraded to 8.4.1. I originally thought it was my HP P440 storage controller but turns out it was a combination of CPU and SCSI Controller emulation. The setup ran fine for several months, then I decided to be a good admin and install the most recent updates.

After reading everything here and other posts, I've spent about 4 X 10 hour days trying different combinations of CPU, Machine and SCSI Controller combinations to find the ones above. I have even reloaded the physical host with Windows server to confirm the storage controller and other components were functioning normally. While I had Windows loaded, I updated the BIOS and firmware as an extra measure.

The settings that worked for me are:
  1. Memory Ballooning = Off
  2. Processors = Host (tried lots but this performs best with current version)
  3. Display = VirtIO-GPU
  4. Machine = pc-i440fx-9.0
  5. SCSI Controller = VMware PVSCSI
  6. Hard Disk = Cache (Default No cache), Discard = On, SSD Emulation = On
  7. Network Device = virtio
  8. KSMtuned = disabled
More about my setup:
  1. HP Proliant DL360 G9 with 24 x Intel Xeon CPU E5-2620 v3 @ 2.40 GHz 2 sockets
  2. Windows 2019 Standard VM's converted from Hyper-V.
  3. 3 Node cluster, all different CPU's so live migration is my sacrifice at this point since I'm use Host CPU Emulation
I Hope this helps some.