[SOLVED] Proxmox 8.0 / Kernel 6.2.x 100%CPU issue with Windows Server 2019 VMs

_gabriel · Jul 22, 2023

But,

_gabriel said:
try with mitigations off and ksm disabled

mean "Disable mitigations as kernel options" , performance not dropped , but Vulnerable , it's personnal choice

Code:

Model name:                      Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
Stepping:                        4
CPU MHz:                         1197.122
CPU max MHz:                     2600.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4189.95
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        15 MiB
NUMA node0 CPU(s):               0-11
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX vulnerable
Vulnerability Mds:               Vulnerable; SMT vulnerable
Vulnerability Meltdown:          Vulnerable
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Ramalama · Jul 22, 2023

_gabriel said:
Vulnerability Itlb multihit: KVM: Vulnerable
Vulnerability L1tf: Mitigation; PTE Inversion; VMX vulnerable
Vulnerability Mds: Vulnerable; SMT vulnerable
Vulnerability Meltdown: Vulnerable
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected[/CODE]

Thats Vulnarable by everything, except Retbleed
Srbds is not affected, because its already patched with microcode and Tsx is anyway disabled, so it shows as "Not Affected".

However, there is absolutely Zero difference between an Xeon v2 and an v3/v4.

So yeah you're right.

emunt6 · Jul 22, 2023

Ramalama said:

It's not about your words, i meant really lscpu....

Remember: spectre v1/v2 is almost impossible to fix, because every cpu is affected by that that has a cache...

Everything else is fixable i think, at least check my 13th gen, i mean intel did there probably what they could?
And for reference an Ryzen Zen3...
I think i could even find an E5 v2, if i ssh into my company and check all servers, but ugh :-(

- Intel(R) Xeon(R) CPU E3-1275 v5

Code:

Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: Split huge pages
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vul                         nerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitizat                         ion
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB fillin                         g, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

- 13th Gen Intel(R) Core(TM) i3-1315U

Code:

Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitizat                         ion
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-e                         IBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

- AMD Ryzen 7 5800X 8-Core Processor

Code:

Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitizat                         ion
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-o                         n, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

@emunt6
Your E5 v2 lscpu output pls.

https://forum.proxmox.com/threads/p...th-windows-server-2019-vms.130727/post-574852

SMT is disabled

emunt6 · Jul 22, 2023

Back to the original problem.

It seems this is "SeaBIOS" problem that included with the Proxmox PVE8 under the /usr/share/kvm/ - directory.
example: /usr/share/kvm/bios.bin
(SeaBIOS - version rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org)

I tested another machine (which has different SeaBIOS version) and doesn't have this lagg problem.
( Booting from Windows Server 2019 ISO: en_windows_server_2019_updated_jan_2021_x64_dvd_5ef22372.iso )

Whatever · Jul 23, 2023

emunt6 said:
Back to the original problem.

It seems this is "SeaBIOS" problem that included with the Proxmox PVE8 under the /usr/share/kvm/ - directory.
example: /usr/share/kvm/bios.bin
(SeaBIOS - version rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org)

I tested another machine (which has different SeaBIOS version) and doesn't have this lagg problem.
( Booting from Windows Server 2019 ISO: en_windows_server_2019_updated_jan_2021_x64_dvd_5ef22372.iso )

My VMs have "bios: ovmf"
In my case setting "mitigations=off" did the trick but didn't make me happy

szelezola · Jul 25, 2023

It's the same for me.

HP Proliant DL380 G9 (2x Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz, 128 GB RAM)

Windows Server 2012R2 + SQL server 2008 R2 with 32 (2x16) CPU, 64 GB RAM does the same stalls as you mention.

Back to the previous kernel, the problem is solved.

The dots in the pictures show the kernel changes.

The problem only occurs under load, but you can see that in my case the server does not get extra load by default.

jens-maus · Jul 25, 2023

Whatever said:
When I setup 1 VM per host I used to disable KSM on the node following this guide: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)

After having finally tested our Windows2019 RDS cluster with disabled KSM sharing and mitigations=off I think I can finally state that our Windows2019 VMs are finally working smoothly now.

To summarize:
As stated in my initial post we had quite severe 100% CPU spikes with temporary noVNC Proxmox console stalling and ICMP Ping request reply times up to 100 seconds which seem to have been related to increase load/memory pressure on these systems. These issues all began with having updated our Proxmox Cluster to Proxmox 8 which comes with kernel 6.2. After having downgraded to kernel 5.15 (still Proxmox8) the issues immediately vanished. After some more investigation and tipps from @Whatever we booted all our systems with mitigations=off and could recognize a slight improvement. However, we still saw sporadic 100%CPU spikes, noVNC Proxmox console stalling and high ICMP ping request times against all our VMs with high memory pressure on the vhosts. Just after some more investigations we could then see that with booted kernel 5.15 the KSM sharing statistics in the VM info display showed significant differences compare to having kernel 6.2 booted. This finally brought us to the idea that KSM sharing might play an additional role here regarding the VM stalling. So after having switched off KSM sharing for all our nodes and booting kernel 6.2 we could finally notice that the systems finally started to work as smoothly as with our old Proxmox7+kernel5.15 environment. So in our case we could now solve the issues by having booted our vhosts with mitigations=off and more importantly with KSM sharing completely disabled. Why with kernel 6.2 KSM sharing seems to result in these 100%CPU spikes and temporary unresponsive VMs while with kernel 5.15 the issues vanishes right away isn't clear yet. This might be something the Proxmox support/devs (@fiona) might have to investigate, but in our case this is 100% reproducible. All I have to do is enable KSM sharing again and then the Windows2019 VMs with high memory pressure so that KSM sharing starts to play a role end up in temporary stalling while with kernel 5.15 this isn't the case.

So thanks again to everyone, but especially to @Whatever who pinpointed me to the mitigations=off and disabling KSM sharing idea which finally seem to have solved our problems so that we are now happily running a Proxmox8+kernel6 environment without any performance impacts anymore.

_gabriel · Jul 25, 2023

imo, it's just old cpus that require tweaks

jens-maus · Jul 25, 2023

_gabriel said:
imo, it's just old cpus that require tweaks

This can't be true since we also saw these issues on newer hosts which are e.g. equipped with "Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz" or "AMD EPYC 7313 16-Core Processor"

_gabriel · Jul 25, 2023

can you test without mitigations=off but only ksm disabled on epyc and gold ?

Neobin · Jul 25, 2023

jens-maus said:
Just after some more investigations we could then see that with booted kernel 5.15 the KSM sharing statistics in the VM info display showed significant differences compare to having kernel 6.2 booted.

Another report about (heavily) changed KSM behavior with the 6.2 kernel:
https://forum.proxmox.com/threads/ksm-memory-sharing-not-working-as-expected-on-6-2-x-kernel.131082

jens-maus · Jul 25, 2023

_gabriel said:
can you test without mitigations=off but only ksm disabled on epyc and gold ?

I can test that at a later point, yes. But right now we are happy that the systems work as expected and we anyway believe that in our context permanently switching off mitigations is fine and wanted.

But as I tried to outline in my summary, I also think that disabling KSM might be simply enough to get the issue resolved and that mitigations=off simply helped us a bit to partly get more resources to get the memory pressure down more quickly. Still, I think Promox devs should definitely look in to the point that KSM sharing under kernel 6 seems to work different than under kernel 5.15 and that this seems to be the main reason for our 100%CPU spike issues we had.

pgcent2023 · Jul 26, 2023

jens-maus said:
After having finally tested our Windows2019 RDS cluster with disabled KSM sharing and mitigations=off I think I can finally state that our Windows2019 VMs are finally working smoothly now.

To summarize:
As stated in my initial post we had quite severe 100% CPU spikes with temporary noVNC Proxmox console stalling and ICMP Ping request reply times up to 100 seconds which seem to have been related to increase load/memory pressure on these systems. These issues all began with having updated our Proxmox Cluster to Proxmox 8 which comes with kernel 6.2. After having downgraded to kernel 5.15 (still Proxmox8) the issues immediately vanished. After some more investigation and tipps from @Whatever we booted all our systems with mitigations=off and could recognize a slight improvement. However, we still saw sporadic 100%CPU spikes, noVNC Proxmox console stalling and high ICMP ping request times against all our VMs with high memory pressure on the vhosts. Just after some more investigations we could then see that with booted kernel 5.15 the KSM sharing statistics in the VM info display showed significant differences compare to having kernel 6.2 booted. This finally brought us to the idea that KSM sharing might play an additional role here regarding the VM stalling. So after having switched off KSM sharing for all our nodes and booting kernel 6.2 we could finally notice that the systems finally started to work as smoothly as with our old Proxmox7+kernel5.15 environment. So in our case we could now solve the issues by having booted our vhosts with mitigations=off and more importantly with KSM sharing completely disabled. Why with kernel 6.2 KSM sharing seems to result in these 100%CPU spikes and temporary unresponsive VMs while with kernel 5.15 the issues vanishes right away isn't clear yet. This might be something the Proxmox support/devs (@fiona) might have to investigate, but in our case this is 100% reproducible. All I have to do is enable KSM sharing again and then the Windows2019 VMs with high memory pressure so that KSM sharing starts to play a role end up in temporary stalling while with kernel 5.15 this isn't the case.

So thanks again to everyone, but especially to @Whatever who pinpointed me to the mitigations=off and disabling KSM sharing idea which finally seem to have solved our problems so that we are now happily running a Proxmox8+kernel6 environment without any performance impacts anymore.

I am running Proxmox 8 with Kernel 6.2.16-4 and had the exact same problem. For me mitigations=off was what solved the issue for me. I first disabled KSM sharing per the directions and that did not make much of a difference. I then booted with mitigations=off and now my VM no longer hangs with the high cpu issue. My question is though... I assume this is a problem with the 6.2.16-4 kernel they need to resolve? I am using a Xeon E5-4657L v2. Does anyone know if a resolution is in the works?

Jorge Teixeira · Jul 26, 2023

I was having this same problem and have posted a few months ago https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-560869 what i had done that somehow mitigated the problem. One of them was disabling KSM. Since then the problem have not repeated. I would like to try with the kernel 6.4 to see if the problem persist so maybe Proxmox team will publish a optional kernel 6.4 soon for us to test???

67firebird455 · Sep 2, 2023

I ran across this after pulling my hair out for a couple days with a W11 VM nesting inside of proxmox 8 on 6.2.16-10 to run a junk docker container that requires WSL2 (with parts of hyper-V for a "great" experience) for a testnet. I thought it was a windows 11 issue, so installed 10 today, to have the same doggy VM as before. mitigations=off made a giant improvement so far. Before, when it'd finally load, my download speeds on the vm were fine, but upload was ~1Mbps. With no docker container running, it would do 100+. I'll have to do some optimizing on this old R710 with twin L5640s. I have a handful of them running various servers at our office, but boy, Proxmox 8 does NOT like Windows guests, period. The various Ubuntu and Fedora VMs just run and run.

I appreciate you guys taking the time to post here, and thankful I was able to catch it!

Sebi-S · Sep 15, 2023

67firebird455 said:
I have a handful of them running various servers at our office, but boy, Proxmox 8 does NOT like Windows guests, period. The various Ubuntu and Fedora VMs just run and run.

I appreciate you guys taking the time to post here, and thankful I was able to catch it!

That's my conclusion, too.

I would have liked the Proxmox staff members to have spoken out on the issue. But so far there has been no feedback at all. I find that very unusual.
Is there already any news on this issue?

My special thanks go to jens-maus who has been so persistent in dealing with the problem.

ITT · Sep 15, 2023

There´s some Problems with the 6.2 Kernel.
Not only Windows-VM´s are affected, we also experienced some weird random behaviours spreaded across different locations.
KSM, Qemu, Network-related things etc...

Hopefully the 6.2 gets mature as soon as possible.

coenvl · Sep 18, 2023

Jorge Teixeira said:
I was having this same problem and have posted a few months ago https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-560869 what i had done that somehow mitigated the problem. One of them was disabling KSM. Since then the problem have not repeated. I would like to try with the kernel 6.4 to see if the problem persist so maybe Proxmox team will publish a optional kernel 6.4 soon for us to test???

The problem that this thread refers to actually has been solved. It appears a bug was found and fixed in the kernel relating to the virtualization process. The fix to this bug in particular is in:

6.2.16-11~bpo11+2 of the 6.2 opt-in kernel on PVE 7/Bullseye
6.2.16-12 of the 6.2 (current) default kernel on PVE 8/Bookworm

If you upgrade to those versions on the host, this 100% CPU freeze bug will probably be gone. As for the other mentioned behaviours, I don't think they will be affected.

Sebi-S · Sep 18, 2023

coenvl said:
The problem that this thread refers to actually has been solved. It appears a bug was found and fixed in the kernel relating to the virtualization process. The fix to this bug in particular is in:

6.2.16-11~bpo11+2 of the 6.2 opt-in kernel on PVE 7/Bullseye

6.2.16-12 of the 6.2 (current) default kernel on PVE 8/Bookworm

If you upgrade to those versions on the host, this 100% CPU freeze bug will probably be gone. As for the other mentioned behaviours, I don't think they will be affected.

Thank you for your assessment.

I can't confirm that so far. Our hosts are running kernel 6.2.16-12, and the problem still occurs there when we boot with the current kernel version.

With kernel version 5.15.116-1 everything works as intended. I therefore consider the problem as not finally solved.

Outpost1534 · Oct 3, 2023

Sebi-S said:
Thank you for your assessment.

I can't confirm that so far. Our hosts are running kernel 6.2.16-12, and the problem still occurs there when we boot with the current kernel version.

With kernel version 5.15.116-1 everything works as intended. I therefore consider the problem as not finally solved.

Also still seeing the issue here. Guest is Debian 12.

[SOLVED] Proxmox 8.0 / Kernel 6.2.x 100%CPU issue with Windows Server 2019 VMs

Famous Member

Renowned Member

Active Member

Active Member

Renowned Member

Active Member

Attachments

Member

Famous Member

Member

Famous Member

Distinguished Member

Member

New Member

Renowned Member

Active Member

Member

Renowned Member

Member

Member

New Member

We value your privacy