Proxmox 8.0 / Kernel 6.2.x 100%CPU issue with Windows Server 2019 VMs

I have a couple of AMD Zen 4 Dell R7625 servers with dual 9554P CPUs (128C physical total, base 3.1 GHz) that I reported in a ticket and was referenced the thread mentioned here.

In my case, I would experience freezes on Windows VMs to a non-trivial degree. The servers have capacity for a lot of high performance VMs, and at one point we had 45 Windows VMs with 8C each. We'd see up to 3 freezes per day, but sometimes only 1-2 a week.

I spent a couple months chasing it, and there definitely seemed to be a correlation with disk IO load. I tweaked a bunch of bandwidth caps, I think 150 MB/s was mostly solid, but I would still see a few freezes per host per week, and the absurd caps would make a 3 hour large Packer qemu-img rebase/merge job take over a day on Linux. I ended up "resolving" it by moving all Windows VMs to Intel hosts which never hit the issue, although the latest Intel chip we have is a 2019 core design. I have 40-45 Linux VMs (Ubuntu 20.04 LTS) constantly stressing the system and have had them for 3 weeks without issue.

I did disable the NUMA load balancer for a week, and torture tested things, and it didn't freeze, but I've had periods of 1-2 weeks without freezes regardless, so I'm not 100% confident this is the same bug. Disabling NUMA for us isn't a solution because while it doesn't freeze, it degrades system performance so much our VMs, which are Azure DevOps Pipelines agents, would micro disconnect long enough to abort jobs.

I've been trying to procure an expansion for H1 for months now, it might go through soon.

I expect to get 2 more AMD servers but I've respeced because we missed the mark in general with disk IO. The older AMD Zen 4 servers have 16 PCIe gen 3 NVMes in ZFS software RAID10. My best guess, which I don't know how all of the components within PVE work entirely, is that if it is NUMA load balancer related, having ZFS RAID10 with disks localized in PCIe lanes between each actual CPU and 95% of our disk load are writes continuously at 1.5 GB/s, having that data stream between CPU sockets seems less than ideal.

2 new AMD servers will be 9374F x2 which is 64C total at 3.85 GHz base and 24x PCIe Gen 4 SSDs. I got PERC12 with it, which I'll try using CPU affinity and 2 RAID volumes so I can pair VMs to run on a CPU with only local NVMes. If that doesn't help, I can flash it to passthrough and continue using ZFS.

I'm going to use the AMD servers off production to continue to try to resolve the Windows freezes because performance wise, the Intel chips simply do not compare, for example the 32C chips run nearly 1 GHz slower. Their fastest 32C chip requires liquid cooling...in 2024, I'm guessing they didn't spread out cores because they decided to compete with ARM/mobile 10 years ago and haven't had a chance to react to Zen. Once I have those, I can test things out.

Because we need to expand in production reliably, I also have 2 Intel servers - Dell R760 with either 4th or 5th generation scalable xeon. Probably 32C x2 at 2.8-3.1 GHz. What could be interesting here is that I see Intel chips mentioned in some of these posts, and maybe the generation of the core design is at fault and it's a general x86 thing and I've been blaming AMD.

It also could just go away if it's NUMA related because I think the 32C chips have less NUMA zones. I don't know, it's frustrating for sure.
 
Last edited:
@Whatever can try it since he also very early reported on these kind of 100%CPU freezes and suggestes solutions like disabling mitigations and kvm which in fact worked for me and our 8.1.4 PVE production cluster works flawlessly since then.

Will do my best as soon as I get a chance and report back
 
Hi all, since kernel proxmox-kernel-6.5.13-1-pve with the scheduler patch [1] did not seem to fix the freezes reported in this thread, we decided to revert [2] the scheduler patch in proxmox-kernel-6.5.13-2-pve, which is now available in the pvetest repositories. So, for the time being, PVE kernel 6.5 does not include a fix for the freezes reported in this thread -- disabling the NUMA balancer still seems like the most viable workaround.

Fortunately, the "proper" KVM fix that was proposed upstream made it into mainline kernel release 6.8 [3] [4]. Hence, the fix will be part of the PVE kernel 6.8, which will become available in the next weeks to months. I'll let you know in this thread when this kernel is available for testing.

[1] https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=29cb6fcbb78e0d2b0b585783031402cc8d4ca148
[2] https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=46bc78011a4d369a8ea17ea25418af7efcb9ca68
[3] https://git.kernel.org/pub/scm/linu.../?id=d02c357e5bfa7dfd618b7b3015624beb71f58f1f
[4] https://lore.kernel.org/lkml/CAHk-=wiehc0DfPtL6fC2=bFuyzkTnuiuYSQrr6JTQxQao6pq1Q@mail.gmail.com/#t
 
Hi all, since kernel proxmox-kernel-6.5.13-1-pve with the scheduler patch [1] did not seem to fix the freezes reported in this thread, we decided to revert [2] the scheduler patch in proxmox-kernel-6.5.13-2-pve, which is now available in the pvetest repositories. So, for the time being, PVE kernel 6.5 does not include a fix for the freezes reported in this thread -- disabling the NUMA balancer still seems like the most viable workaround.

Fortunately, the "proper" KVM fix that was proposed upstream made it into mainline kernel release 6.8 [3] [4]. Hence, the fix will be part of the PVE kernel 6.8, which will become available in the next weeks to months. I'll let you know in this thread when this kernel is available for testing.

[1] https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=29cb6fcbb78e0d2b0b585783031402cc8d4ca148
[2] https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=46bc78011a4d369a8ea17ea25418af7efcb9ca68
[3] https://git.kernel.org/pub/scm/linu.../?id=d02c357e5bfa7dfd618b7b3015624beb71f58f1f
[4] https://lore.kernel.org/lkml/CAHk-=wiehc0DfPtL6fC2=bFuyzkTnuiuYSQrr6JTQxQao6pq1Q@mail.gmail.com/#t

I may not have new hardware to test until early to mid April, but when I do I explicitly have 1 server to try to test this among other things.

Suppose I get the hardware in April and 6.8 isn't out yet, is there a way to setup repos to get the KVM fix build? Or is it inextricable from the kernel?

Motivation is performance. We hit freezes a lot with Windows VMs supposedly due to this issue with AMD Zen 4 (see my post a bit above for more details). Zen 5 is coming out later this year and I think is TSMC 3-4 nm whereas Intel just released 5th generation scalable and it's still "Intel 7" (10nm with marketing). The lithography differences means that AMD's chips will continue to have even better core count and/or base frequencies. For a 32C chip, compute speed per thread is about 33% faster on AMD vs Intel and with the die shrink it will probably increase to 40% or beyond. We have a lot of builds with single threaded portions that can take hours to days, and per thread performance is crucial. But so is not freezing.
 
  • Like
Reactions: zeuxprox
An 6.8 Kernel right now, is not as easy, or maybe even not possible right now.
ZFS isn't even ready for the 6.8 Kernel at the moment. The 6.8 Kernel is just realeased some days ago.

Don't expect that Lamprecht & Others put hours of work to fix Compilation, which can be extremely Risky even if they work, into a testing Kernel right now, while ZFS at least will anyway support the 6.8 Kernel in Probably ~2 Weeks.
Thats just a useless work, or at least not really worth it.

Especially because there are partially working Workarounds around the issue, which allows bridging the time gap.

And tbh, im talking only about ZFS, they have to collect/migrate their own Proxmox Kernel Patches to 6.8 either, which i believe they are doing right now and probably some other hurdles that breaks compiling right now.

So there is no other solution, to wait at least around 2 weeks till zfs supports 6.8 fully with 2.2.4, after that we can start penetrate Lamprecht xD
Cheers
 
Last edited:
  • Like
Reactions: zeuxprox
An 6.8 Kernel right now, is not as easy, or maybe even not possible right now.
ZFS isn't even ready for the 6.8 Kernel at the moment. The 6.8 Kernel is just realeased some days ago.

Don't expect that Lamprecht & Others put hours of work to fix Compilation, which can be extremely Risky even if they work, into a testing Kernel right now, while ZFS at least will anyway support the 6.8 Kernel in Probably ~2 Weeks.
Thats just a useless work, or at least not really worth it.

Especially because there are partially working Workarounds around the issue, which allows bridging the time gap.

And tbh, im talking only about ZFS, they have to collect/migrate their own Proxmox Kernel Patches to 6.8 either, which i believe they are doing right now and probably some other hurdles that breaks compiling right now.

So there is no other solution, to wait at least around 2 weeks till zfs supports 6.8 fully with 2.2.4, after that we can start penetrate Lamprecht xD
Cheers
Oh, I wasn't suggesting they work late to do anything.

My question should have highlighted my amateur understanding of how PVE packages Debian and components. I thought I was asking a dumb question like "of course you can edit repo lists to get this kvm fix" or something, but if it's inextricable, I'll wait. I too have plenty to work on.
 
  • Like
Reactions: zeuxprox
I have a couple of AMD Zen 4 Dell R7625 servers with dual 9554P CPUs (128C physical total, base 3.1 GHz) that I reported in a ticket and was referenced the thread mentioned here.

In my case, I would experience freezes on Windows VMs to a non-trivial degree. The servers have capacity for a lot of high performance VMs, and at one point we had 45 Windows VMs with 8C each. We'd see up to 3 freezes per day, but sometimes only 1-2 a week.

I spent a couple months chasing it, and there definitely seemed to be a correlation with disk IO load. I tweaked a bunch of bandwidth caps, I think 150 MB/s was mostly solid, but I would still see a few freezes per host per week, and the absurd caps would make a 3 hour large Packer qemu-img rebase/merge job take over a day on Linux. I ended up "resolving" it by moving all Windows VMs to Intel hosts which never hit the issue, although the latest Intel chip we have is a 2019 core design. I have 40-45 Linux VMs (Ubuntu 20.04 LTS) constantly stressing the system and have had them for 3 weeks without issue.
This sounds like a very complex issue. Do I understand correctly that
  1. you have Windows VMs that did temporarily(!) freeze on AMD hosts, and do not freeze on Intel hosts (both have multiple NUMA nodes, I presume?)
  2. and Linux VMs never froze on the AMD hosts
Especially (1) makes me suspect that the issue discussed in this thread may not be the only factor at play here.
I did disable the NUMA load balancer for a week, and torture tested things, and it didn't freeze, but I've had periods of 1-2 weeks without freezes regardless, so I'm not 100% confident this is the same bug. Disabling NUMA for us isn't a solution because while it doesn't freeze, it degrades system performance so much our VMs, which are Azure DevOps Pipelines agents, would micro disconnect long enough to abort jobs.
By "disabling NUMA", do you mean "disabling the NUMA balancer"? And if so, do I understand correctly disabling the NUMA balancer noticeably degrades VM performance? if so, that would be an interesting data point.

If optimal NUMA placement is that critical in your usecase, one option would be to use the affinity settings [1] to pin VMs to the cores belonging to one NUMA node -- in that case, I presume the NUMA balancer wouldn't have much to do anyway.
Suppose I get the hardware in April and 6.8 isn't out yet, is there a way to setup repos to get the KVM fix build? Or is it inextricable from the kernel?
There is no supported way to get the KVM fix [2] before our 6.8 kernel is available (of course technically nothing prevents you from compiling your own kernel or trying a mainline kernel, but I would generally recommend against doing that on a production system, and these setups are not supported by us). If it had been the case that the KVM fix can be easily applied on a 6.5 kernel, we would have done so to make the fix available as soon as possible -- but unfortunately it has a bunch of dependencies on other KVM changes [3] which are not straightforward to apply on a 6.5 kernel.

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_cpu_resource_limits
[2] https://git.kernel.org/pub/scm/linu.../?id=d02c357e5bfa7dfd618b7b3015624beb71f58f1f
[3] https://lore.kernel.org/all/Zaa654hwFKba_7pf@google.com/
 
This sounds like a very complex issue. Do I understand correctly that
  1. you have Windows VMs that did temporarily(!) freeze on AMD hosts, and do not freeze on Intel hosts (both have multiple NUMA nodes, I presume?)
  2. and Linux VMs never froze on the AMD hosts
Especially (1) makes me suspect that the issue discussed in this thread may not be the only factor at play here.

I had a ticket with PVE support under 7990138, but I realized I didn't have the appropriate license type for support so after some initial investigation by staff linking to this thread and bugs, I closed it.

We have 4 servers on the way:
- 2 Dell R7625 with 2x AMD 9374F
- 2 Dell R760 with 2x Intel 8562y+

Our 2 AMD servers that hit the bug in the last year are Dell R7625 with 2x AMD 9554P. Because we're targeting more disk IO per VM, we've decided to scale back to higher performance, lower core count chips.

I intend to resume investigating this with 1 of the AMD servers outside of production so that I don't impact real builds. Currently we lack the capacity to spare a server to continue investigation.

Another team within my branch of my company just hit the freeze on Windows with the nearly same CPU, but the single socket variety AMD 9554. That's probably relevant with NUMA boundaries.

So I wonder if the fact that my other team hit the freeze with the 9554 on a single socket, does that eliminate the NUMA load balancer being at play? Or are there other boundaries within a single chip?

The output from our 9554P x2 systems for:
root@rnd-compute-30:~# lscpu | grep -i numa
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255

Team just informed me their hypervisor is freezing with IDE disk settings, I let them know about CPU soft locks with IDE disk types.

We "resolved" the issue by rebalancing our VMs and putting all Windows VMs on older Intel hosts and put all Linux Ubuntu 20.04 LTS VMs on the AMD hosts. While I have seen 2-3 freezes of Linux VMs over the last 2 months on the AMD servers, there's some sort of general bug in the guest OS causing that as we've seen it on Intel hosts and on VMWare ESXi before we moved to proxmox. It's super rare, and while I suppose it is possible it's the same issue because I have no way of telling the difference between the freezes, it's orders of magnitudes more rare.

I tried my due diligence in approaching this bug. I updated BIOS to get the latest AMD chipset drivers. I updated virtio-win to latest, but it has since had a release. We've been configuring disks as virtio with aio=threads and iothread=1 and the required Virtio SCSI single controller to enable threads. I tried Windows 11 23H2 thinking that something in the NT kernel may not be handling C states or the CPU. Nothing changed the behavior, which is quite odd considering how common Zen 4 is as it has a strong performance advantage. Usually I would expect something released for 1 year to have these sorts of issues resolved.

Maybe it's something else.


By "disabling NUMA", do you mean "disabling the NUMA balancer"? And if so, do I understand correctly disabling the NUMA balancer noticeably degrades VM performance? if so, that would be an interesting data point.

If optimal NUMA placement is that critical in your usecase, one option would be to use the affinity settings [1] to pin VMs to the cores belonging to one NUMA node -- in that case, I presume the NUMA balancer wouldn't have much to do anyway.

Yes, I meant disabling the NUMA load balancer using the command:
echo 0 > /proc/sys/kernel/numa_balancing

My amateur understanding is that perhaps disabling the NUMA balancer tanks our performance because we have a single RAID10 ZFS on a dual socket host and we're already saturating disk IO so my assumption is no NUMA balancer means it's not contextually handling PCIe lane traffic beyond NUMA boundaries and making things worse. It didn't freeze for the 7 days I had it disabled, but it may not be enough time to have confidence it resolved the freezes. However, I don't really know how these components work, I could be misunderstanding the responsibilities the NUMA load balancer is for, and maybe it's only for RAM modules and not PCIe lanes too.

I had to re-enable numa_balancing because performance was so bad that the Azure DevOps agent service would time out connections and abort builds.

The freezes seem to do with the NT kernel panicking in some way. The most common way is it just freezes, often we would see 2 VMs freeze with the same guest OS system clock time suggesting the hypervisor is involved. Sometimes Windows recovers enough to crash dump and completes, other times it crashes and freezes while dumping files.
- Complete freezes is by far the most common (we probably had 100+ over 3 months)
- Then completed crash dumps (20-30 instances)
- Then freezes while dumping (I believe just a few times)

I believe while I was interacting with PVE support, the BSOD were ongoing but I was unaware because Windows was automatically restarting on completion of the crash dump. We've since changed that, and I scanned all Intel VMs for crash dumps and I could not find a single crash over 5 months of uptime.

For what it's worth, most of the BSOD start with:
HYPERVISOR_ERROR (20001)
The hypervisor has encountered a fatal error.

Or:
BugCheck 20001, {11, 299f40, 1005, fffff870cd375c10}
Probably caused by : ntkrnlmp.exe ( nt!HvlSkCrashdumpCallbackRoutine+6b )

It isn't really feasible for me to continue investigation until I get the new hardware because we're over committed in capacity, and I really need to be able to torture test things outside of production. The rate of the freezes is significantly correlated with Disk IO and/or CPU load. I did not have a single freeze for 2-3 weeks during the holidays when load was really low, and I thought I had resolved the issue with changes, but by January 10th it clearly was not.

Interestingly, I had a group so dependent on single threaded performance that I created 10 Windows VMs with 2C instead of 8C allocated and those have been online for 2 weeks without freezes on the AMD hosts. Not enough time to be a pass, but I wonder if the number of cores changes how this issue occurs. I'm assuming you test the default 2C setting more than other combinations.

(Lot of edits for clarity.)
 
Last edited:
Does disabling numa_balancing for servers with 1 CPU also help in the current solution, or only for dual-processor servers?

Disabling KSM, and Mitigations, it was confusing whether it is necessary or not...
 
Last edited:
  • Like
Reactions: zeuxprox
@trey.b
Thanks for the detailed writedown.

I think the Proxmox Team or @Thomas Lamprecht will be able to Provide the 6.8 Kernel, or as Testing Kernel at fullowing Dates:
  • March 28, 2024 (UTC) : kernel feature freeze
  • April 1, 2024 (UTC) : beta freeze
  • April 11, 2024 (UTC) : kernel freeze
  • April 18, 2024 (UTC) : final freeze
  • April 25, 2024 (UTC) : final release
Starting from 11th April, at that timeframe OpenZFS could release OpenZFS 2.2.4 ready for the 6.8 Kernel.

But thats only an assumption.

Im waiting for the 6.8 Kernel either, because i will have surely the same issues with my 2x Genoa 9274F Servers.
Let's just hope it will get solved then, because if not, it will get much worse for all of us.

Cheers
 
I Summerized this Thread a little, to help to keep Track:

-- All issues are Starting with Kernel > 5.15
- On 5.15 there are no issues.
- On 6.5.X there are still all the same issues.
- Kernel 6.8 will Probably fix all Issues

jens-maus (23 nodes)
Issue: WS2019 100%CPU / Freeze
Hardware: AMD EPYC 7313 & Dual Socket Xeons 6240R
Kernel 5.15 -> no issues
Kernel 6.2 -> i440fx/q35, ballooning, CPU Type, different qemu-tools (nothing helps)
Notes:
- Assign less then 64GB-Ram to VM's (Helps, Less Freezes)
- All Nodes with performance CPU governor, frequency scaling disabled
- mitigations=off helps a lot, but doesn't solve the issue entirely
- disabling KSM Sharing + mitigations=off, Solves the issue

andrewrf
Issue: WS2019 100%CPU Sometimes
Hardware: Dual Socket - e5-2699 v4
Kernel 5.15 -> no issues
Kernel 6.2 -> Disabling Numa doesn't help
50-64GB Memory -> Slowdowns, but not Freeze

emunt6
Issue: WS2019 Slowdowns
Hardware: Dual Socket - E5-2667 v2

der-berni (2 Nodes)
Issue 1: Ubuntu 20.04.1 100%CPU / Freeze
Issue 2: WS2022 100%CPU / Freeze
Hardware: Xeon E-2336 / Xeon E-2236

Whatever
Issue: WS2019 Above 64GB-Ram Slowdowns / Freeze
Hardware: Dual-Socket E5-2690 v4 + Dual-Socket E5-2697 v2
Notes:
- mitigations=off helps a lot, but doesn't solve the issue entirely
- Below 16GB-Ram & 8 vCPUs per VM, Solves the issue
- WS2019 (96GB-Ram + 42 cores) -> Slowdowns/Issues
- WS2019 (128GB-Ram + 32 cores) -> Slowdowns/Issues

croRF
Issue: VMs use 100%CPU every 2 sec
Hardware: Doal-Socket E5-2699 v4

szelezola
Issue: WS2012R2 Slowdowns
Hardware: DL380-G9 - Dual-Socket E5-2630L v4

pgcent2023
Issue: WS2019 Slowdowns/Freezes
Hardware: Xeon E5-4657L v2
Notes:
- Disabling KSM Sharing makes no difference
- mitigations=off Solves the Issue

Jorge Teixeira
Issue: WS2022 100%CPU / Freeze
Hardware: DL360-G9
Notes:
- Swappiness=0 + Disabling KSM + Limit Arc, Solves the Issue

67firebird455
Issue: W11 & W10 Slowdowns
Hardware: Dell R710 - Dual-Socket L5640
Notes:
- mitigations=off Helps a lot

mygeeknc
Issue: WS2019 Slowdowns

kobemtl
Issue: W11 Slowdowns
Notes:
- mitigations=off Solves the Issue

Sebi-S
Issue: WS2022 100%CPU / Slowdowns
Hardware: Dual-Socket - Xeon Gold 5317
Notes:
- WS2022 VM hast 16 CPUs and 48GB-Ram
- mitigations=off Doesn't help

JL17
Issue: WS2019 100%CPU / Slowdowns
Hardware: Dual-Socket - Xeon E5-2696 v3
Notes:
- mitigations=off and still issues below:
- WS2019 (64GB-Ram + 32 cores) -> Slowdowns / 100%CPU
- WS2019 (32GB-Ram + 16 cores) -> No Issues
- WS2019 (16GB-Ram + 16 cores) -> No Issues
- WS2019 (24GB-Ram + 12 cores) -> No Issues

chr00t (Not Proxmox, Canonical MAAS with Kernel 5.19 & 6.2)
Issue: Ubuntu 22.04 Slowdowns
Notes:
- Just for reference, from another KVM based Hypervisor

------------------------------------------

What is known for now, fixes the issue, or at least "Helps" for most People:
- Disabling Numa-Balancer:
-- Runtime/Temporarily: echo 0 > /proc/sys/kernel/numa_balancing
-- Permanent: Add to cmdline: numa_balancing=disable
- Disable Mitigrations agains Spectre/Meltdown/etc:
-- Permanent: Add to cmdline: mitigations=off
- VM's with less then 32GB-Ram seems to be less affected
- Huge amount of Cores in VM's can make them more affected.

Notes about Disabling Numa and Mitigrations:
- Disabling Numa-Balancer comes with a big Performance Penalty in terms of Memory Bandwidth/Latency on Dual-Socket Systems.
-- On Single-Socket Systems the Performance Penalty should be minimal. Even on High-Core-Count Genoa CPU's.
- On Dual-Socket Systems this can be mitigrated with CPU Pinning, or at least avoid using Multiple Sockets in VM's Configuration.

- Disabling Spectre/Meltdown Mitigrations with "mitigations=off", will allow Side-Channel Attacks between VM's and Between Applications inside VM's, in short, the vulnarability allows Access to Protected Memory Areas where Passwords and so on are stored, and allows for Manipulation.
-- This shouldn't be very Problematic if you don't expose your VM's to the Public Internet, or to Customers etc...
-- In Short, there is a risk, but in general if the VM's are only for internal Usecase in your Network, the risks aren't high. Malware/Viruses and so on, if any stupid RDP User for example downloads something or Opens a wrong email could still infect the Whole Hypervisor in worst case.
- On newer CPU's like Genoa / Saphire Rapids, there is no performance gain on disabling mitigrations, since by default they aren't as vulnerable and most of the mitigrations aren't in use anymore.
- On older CPU's there should be a 5-15% Performance Gain with mitigations=off

Keep in mind, disabling Numa-Balancer and mitigations=off, are Workaround till we can test Kernel 6.8.
-> Try one of them and check if it improves your situation, not both directly.
-- Lately the Numa-Balancer seems to be the main issue, but just few people confirmed, so the chances are 80% that disabling will fixes the issue.
-- On other Hand mitigations=off, seemed to help a lot of People here, but it may just be an Workaround for Numa.

Hope this "Summerization" of this Thread helps.
You may look into this Thread: https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
The issue is almost the same / similar to this Thread.

Cheers
 
Last edited:
@trey.b
Thanks for the detailed writedown.

I think the Proxmox Team or @Thomas Lamprecht will be able to Provide the 6.8 Kernel, or as Testing Kernel at fullowing Dates:
  • March 28, 2024 (UTC) : kernel feature freeze
  • April 1, 2024 (UTC) : beta freeze
  • April 11, 2024 (UTC) : kernel freeze
  • April 18, 2024 (UTC) : final freeze
  • April 25, 2024 (UTC) : final release
Starting from 11th April, at that timeframe OpenZFS could release OpenZFS 2.2.4 ready for the 6.8 Kernel.

But thats only an assumption.

Im waiting for the 6.8 Kernel either, because i will have surely the same issues with my 2x Genoa 9274F Servers.
Let's just hope it will get solved then, because if not, it will get much worse for all of us.

Cheers
Thanks for the info. Adding some details below, I'll try some things out once we get new servers in coming weeks.

Reading through the thread and your summary immediately after, it seems plausible they are the same issue. Seems there's a lot of variation and different settings that seem to greatly reduce the rate, but disabling the NUMA load balancer seems to be the most complete resolution.

To add some additional details on how the freezes present for our Windows VMs:
  • When it freezes, the CPU usage of guest aligns 100% usage of 1-7 out of 8 allocated cores and fluctuates up to 3% but otherwise very static. Eg for 8 cores, 25%, 50%, 62.25%, 75%, 87.5%, 100%.
  • VM is not responsive
  • VNC window can open but has render at time of freeze
  • guest tools do not respond
  • pings do not respond
  • No trace in PVE logs until later when pings fail
  • Nothing in Windows logs.

I see several people mention lower core counts and RAM improving things, and indeed our experimental 10 Windows VMs that have been up for over 2 weeks have not frozen and are 2C and 16 GiB down from 8C and 32 GiB.

I have not tried disabling mitigations (spectre meltdown disable in grub) or disabling KSM sharing.

We could be seeing it on Linux too, I'm not sure. Our Windows builds are much more strenuous compared to our Linux builds simply because our R&D has a lot more product history and features within Windows. We had rare freezes for Ubuntu 20.04 LTS with ESXi and when we first moved to Proxmox, so we automated restoring unresponsive VMs in our daily cleanup code. We had a pseudo freeze last weekend on a Linux VM that responds to pings and very slowly will update VNC, but disconnected from Azure DevOps as an agent and was using near 100% CPU. At the time of the instability, it was working on a CPU intense job and using most of the 8C allocated. It continued to use near 100% but with a lot more fluctuations than Windows when it freezes.

Our freezes are much more frequent with higher disk and CPU utilization, so it could be we just see it more with Windows for that reason, but others seem to suggest Windows is more impacted.

Host:
Linux 6.5.11-7-pve (2023-12-05T09:44Z)

VM settings:
Windows 10 2021 LTS spec - freezes:
agent: 1
balloon: 0
boot: order=virtio0
cores: 8
cpu: host
machine: pc-i440fx-7.2
memory: 32768
meta: creation-qemu=7.2.0,ctime=1689404255
name: REDACTED
net0: virtio=5E:0E:DE:00:1C:05,bridge=vmbr0,firewall=1
onboot: 1
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=f4ae0855-3c7a-4584-a9e1-770e903743b6
virtio0: local-zfs:base-128114553-disk-0/vm-22805-disk-0,aio=threads,discard=on,iothread=1,size=1500G
virtio1: local-zfs:vm-22805-disk-1,iothread=1,size=1G
vmgenid: 70c26a4f-4552-4056-a344-b78048f61498

Windows 11 23H2 spec - freezes:
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 8
cpu: host
efidisk0: local-zfs:vm-22814-disk-efi,efitype=2m,pre-enrolled-keys=1,size=1M
machine: pc-q35-7.2
memory: 32768
meta: creation-qemu=8.1.2,ctime=1705600543
name: REDACTED
net0: virtio=5E:0E:DE:00:1C:0E,bridge=vmbr0,firewall=1
onboot: 1
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=b65c7f56-9307-4c32-8a83-b43c824a87ab
tpmstate0: local-zfs:vm-22814-disk-tpm,size=4M,version=v2.0
virtio0: local-zfs:base-128785902-disk-0/vm-22814-disk-0,aio=threads,discard=on,iothread=1,size=1500G
vmgenid: 92e866d1-a463-48e6-82d5-31828df71f56

Windows 11 23H2 experimental spec - no freezes after 2 weeks:
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 2
cpu: host
efidisk0: local-zfs:vm-23001-disk-efi,efitype=2m,pre-enrolled-keys=1,size=1M
machine: pc-q35-7.2
memory: 16384
meta: creation-qemu=8.1.2,ctime=1704632845
name: REDACTED
net0: virtio=5E:0E:DE:00:1E:01,bridge=vmbr0,firewall=1
onboot: 1
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=96e0edfd-9379-4526-a1a6-2ff325160b04
tpmstate0: local-zfs:vm-23001-disk-tpm,size=4M,version=v2.0
virtio0: local-zfs:base-130785902-disk-0/vm-23001-disk-0,aio=threads,discard=on,iothread=1,size=1000G
vmgenid: cec19d07-9fb6-4ce3-9ca5-85dabca6e07d

Linux VMs that are significantly more stable:
agent: 1
balloon: 0
bios: seabios
boot: order=virtio0
cores: 8
cpu: host
machine: pc-i440fx-7.2
memory: 32768
meta: creation-qemu=7.2.0,ctime=1691796451
name: REDACTED
net0: virtio=5E:0E:DE:00:1D:96,bridge=vmbr0,firewall=1
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=97754d63-c807-4bc4-b1eb-ffbbb95d77cb
virtio0: local-zfs:base-129497837-disk-0/vm-32922-disk-0,aio=threads,discard=on,iothread=1,size=1500G
vmgenid: cdf46e5c-1bde-41ab-8f9b-04a929f4e62e
 
Last edited:
Also, this may or may not be related, but while I was trying dozens of things to resolve the freezes back in December-January, I observed an issue common on both of our current PVE8 configurations:
As I have mentioned, we're saturating disk IO. We have a build job that stresses disk IO uniquely, which is our Packer build of our Windows VM template deployed to PVE as Windows build machines. It has decades of required software to build anything supported in R&D, and we have a product catalogue dating back to the 1980s. So for example, we install Visual Studio 2003-2022, yes all of them. The template ends up being about 400 GB used in a 600 GB disk and we have around 24 packer templates. Each packer template at the finalization of build uses qemu-img commit to merge those pieces.

It was taking about 7 hours (with ZFS compression enabled, ~4.5 without), and I did some math at the expected average 500 MB/s RW IO sustained and it was taking much longer than it should. We have metrics that report things like disk IO, and it was reporting something slower. Turns out it was smoothing out a nice sinusoid wave, or maybe it's more blocky on/off and activity monitor on desktop Ubuntu also smooths things out.

I found a thread that described similar behavior, and a few techniques. I found that if I modified our zfs_dirty_data_max values from the default of 4 GB to 1 GB that this eliminated the pulsing disk IO. Now this process takes just over 2 hours, which fits the math.

I wonder if ZFS is hitting the same microfreezes described here and my "optimization" was just working around it much like KSM disable, mitigations disable, NUMA load balancer.

It didn't resolve our Windows freezes, but I THINK it reduced the rate. It's really hard to make any absolute claims because I was scrambling to test things with production builds before deciding that I needed a more controlled way of testing this.

While we have yet to encounter a BSOD or freeze on Windows VMs on the R840 config, the same pulsing behavior was observed. Very similar NVMe gen 3 PCIe disks are used, but perhaps the differences in the 2019 Intel core vs AMD Zen 4, 4 sockets vs 2, or something else is causing the bug to not express itself in a catastrophic way. Which assuming everything is connected here, which it very likely is not, is plausible. If this bug caused chaos in all configs, it would have been reported way more than a 10 page nearly 1 year old forum thread.
 
Last edited:
Confirming we experience the freeze on Linux, just way less common. We need it back online in production and I'm assuming the investigation is complete, so I'm putting it back, but if there's a list of commands next time it happens let me know and I can run them.

This is our existing AMD hardware that we "resolved" by moving all Windows VMs off of it, new hardware shipping.
freeze.png
 
  • Like
Reactions: zeuxprox
Last edited:
Can any of you guys check if Kernel 6.8 helps? Its available and @t.lamprecht was a lot faster then i expected :)
Thank you a lot @t.lamprecht !!!
Cheers
Yes, indeed, kernel 6.8 is now available for testing. It includes the KVM patch [1] that intends to fix the temporary freezes on hosts with multiple NUMA nodes (in combination with KSM and/or the NUMA balancer). Anyone who has been affected by these freezes: It would be great if you could check out kernel 6.8 (see [2] for instructions) and report back whether it fixes the temporary freezes in your case. You should be able to re-enable KSM and the NUMA balancer.

When you report back, please attach the output of the following commands:
Code:
lscpu
uname -a
grep "" /proc/sys/kernel/numa_* /sys/kernel/debug/sched/preempt /sys/kernel/mm/ksm/*

Thanks everyone!

[1] https://git.kernel.org/pub/scm/linu.../?id=d02c357e5bfa7dfd618b7b3015624beb71f58f1f
[2] https://forum.proxmox.com/threads/144557/
 
Last edited:
I Summerized this Thread a little, to help to keep Track:
Thank you for your effort! Some remarks:
Notes about Disabling Numa and Mitigrations:
- Disabling Numa-Balancer comes with a big Performance Penalty in terms of Memory Bandwidth/Latency on Dual-Socket Systems.
-- On Single-Socket Systems the Performance Penalty should be minimal. Even on High-Core-Count Genoa CPU's.
It really depends on the number of NUMA nodes reported to the host (IIRC some/many AMD processors allow configuring how many NUMA nodes are presented to the host). You can check the number of NUMA nodes in the output of lscpu. If the host only sees a single NUMA node, it should not be affected by the temporary freezes reported in this thread. If anyone sees temporary freezes on a host with a single NUMA node, please let me know.
- On Dual-Socket Systems this can be mitigrated with CPU Pinning, or at least avoid using Multiple Sockets in VM's Configuration.
Currently I don't think disabling the NUMA balancer always comes with a big performance penalty (but I haven't done any extensive performance testing either). I can imagine it makes a difference for some workloads (e.g. apparently for @trey.b it does). But if you are affected by the temporary freezes reported here and can't test kernel 6.8 yet, disabling the NUMA balancer seems worth a try in my opinion.
You may look into this Thread: https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
The issue is almost the same / similar to this Thread.
Just for posterity, while the two issues both involve VM freezes, they are quite different:
  • In the linked thread [0], VMs would freeze permanently after some days/weeks of uptime, and could only be waken up by stopping/starting or by suspending/resuming. In principle all hosts and VMs were affected, but there were some factors that made particular VMs more prone to freezing. The issue is completely fixed since the latest 6.2 kernel (more specifically since 6.2.16-12).
  • In this thread, VMs freeze temporarily (for a few milliseconds/seconds), and only hosts with multiple NUMA nodes seem to be affected.
Confirming we experience the freeze on Linux, just way less common. We need it back online in production and I'm assuming the investigation is complete, so I'm putting it back, but if there's a list of commands next time it happens let me know and I can run them.
I suspect the freeze issues you're seeing are not only due to the issues reported in this thread, but also due to other factors. Hence, could you please open a new thread for them? Feel free to mention me there. When you do, please attach the output of lscpu from all affected hosts, in particular the one running the Linux VM that froze. Please also check the journal of the Linux VM for any messages during the freezes, and if you find some, please attach them.

[0] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
 
Currently I don't think disabling the NUMA balancer always comes with a big performance penalty (but I haven't done any extensive performance testing either). I can imagine it makes a difference for some workloads (e.g. apparently for @trey.b it does). But if you are affected by the temporary freezes reported here and can't test kernel 6.8 yet, disabling the NUMA balancer seems worth a try in my opinion.


I suspect the freeze issues you're seeing are not only due to the issues reported in this thread, but also due to other factors. Hence, could you please open a new thread for them? Feel free to mention me there. When you do, please attach the output of lscpu from all affected hosts, in particular the one running the Linux VM that froze. Please also check the journal of the Linux VM for any messages during the freezes, and if you find some, please attach them.

[0] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/

Thanks for the patch, I'll try it out as soon as I can and open a new support case if there's still issues and if not reply here. Unfortunately our AMD servers are delayed, at best late April. Basically paused at Dell for assembly, I'm betting the CPU or PERC12 RAID card is out of stock.

I think the NUMA load balancer performance hit for us is especially bad because we're already saturating disk and CPU. I try to target max sustained averaging 65% logical CPU load because after all hyperthreading isn't alchemy, but we've been at 65% a long time. We have a single 16 disk RAID10 in ZFS with the 2 socket system, and I think that along with already saturating load creates catastrophic overcommitting to the point PVE dashboard doesn't always respond with "too many redirections" error. The new systems with PERC12 will have a RAID localized per socket with much faster disks and CPU affinity, so that should reduce the penalty I think to the point it is a viable workaround should it be needed.

Thanks for getting this out.
 
  • Like
Reactions: zeuxprox

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!