The Reasons for poor performance of Windows when the CPU type is host

Kobayashi_Bairuo

New Member
Aug 12, 2023
5
4
3
blog.bairuo.net
Hey guys, I did some experiments recently and I think I finally found out why Windows performs poorly when the CPU type is host. You can check the complete experiment process and conclusion in my blog (Chinese, use google translate!)

In short, the experiment finally found that the reason was that the two flags md_clear and flush_l1d caused performance problems. They would activate the CPU vulnerability mitigation measures of Windows, which would cause a significant increase in memory read latency, thus causing Windows to freeze.

The two flags md_clear and flush_l1d are not passed to the virtual machine in traditional CPUs such as x86_64-v2-AES or Ivybridge-IBRS. This means that Windows will not and cannot start CPU side channel vulnerability mitigation measures in these CPU types, and performance will not be affected. This explains why Windows is normal when using these types, but Windows is stuck when using host, which is the most powerful type in theory.

The good news is that it is not Windows Hyper-V virtualization startup (bcdedit /set hypervisorlaunchtype off) and VBS that cause the performance degradation. Through the method in my blog, you can also perform nested virtualization in Windows without using a host.

These data do not appear in the official Proxmox Windows best practices, so many people are confused and I have not seen anyone give a specific reason so far, so I came here. You can find an alternative to using the host directly in my blog;)
 
Last edited:
Hi, thanks for sharing your insights.

That mitigations for CPU security bugs affect performance was clear when meltdown and spectre was made public, that's why we added overrides for certain CPU flags in the CPU edit dialogue so that one can enforce enable or disable some specific CPU flag.
For Windows, it's hard to tell as it's basically a black box, in your case with a 12+ year old E5 2667v2 CPU where the meltdown/spectre mitigations are particular slow to mitigate as they have now HW support in the CPU, and that CPU not being officially supported by Windows 10 (ref) it might well be that Windows uses a less secure but faster code path if the flag is not present.
IIRC there even where some cases where highly optimized software mitigations could be a bit faster than plainly depending on the verw instruction and L1D_FLUSH commands that md_clear provides, while that was observed in the Linux world Windows might have copied/implemented a similar approach and just differ in favoring the md_clear one if the flag is present, so it might also be just that (I'd need to recheck in the sources to be more sure of that though).

In any way: I think that in any way one should not blindly turn those flags off, using a matching CPU type, or the newest x86-64-v* one that still works will almost always be a better choice over type host for a variety of reasons though, so adding that as generic advice to a best practice guide might be good to do.
For projects where a (commercial) entity relies on the underlying systems being secure it might be better to use a CPU that is not EOL for about 5 years.
 
  • Like
Reactions: Johannes S
In fact, I don't see the md_cleaer and flush_l1d flags in the guest in both x86_64-v2 and x86_64-v3, so using x86_64-v* should not be more "safe" than my solution in mitigating these two CPU attacks. The data path of Windows should be the same. My approach is to enable additional flags on top of x86_64-v* to get the same capabilities as the host in x86_64-v* (especially nested virtualization). ;)
Hi, thanks for sharing your insights.

That mitigations for CPU security bugs affect performance was clear when meltdown and spectre was made public, that's why we added overrides for certain CPU flags in the CPU edit dialogue so that one can enforce enable or disable some specific CPU flag.
For Windows, it's hard to tell as it's basically a black box, in your case with a 12+ year old E5 2667v2 CPU where the meltdown/spectre mitigations are particular slow to mitigate as they have now HW support in the CPU, and that CPU not being officially supported by Windows 10 (ref) it might well be that Windows uses a less secure but faster code path if the flag is not present.
IIRC there even where some cases where highly optimized software mitigations could be a bit faster than plainly depending on the verw instruction and L1D_FLUSH commands that md_clear provides, while that was observed in the Linux world Windows might have copied/implemented a similar approach and just differ in favoring the md_clear one if the flag is present, so it might also be just that (I'd need to recheck in the sources to be more sure of that though).

In any way: I think that in any way one should not blindly turn those flags off, using a matching CPU type, or the newest x86-64-v* one that still works will almost always be a better choice over type host for a variety of reasons though, so adding that as generic advice to a best practice guide might be good to do.
For projects where a (commercial) entity relies on the underlying systems being secure it might be better to use a CPU that is not EOL for about 5 years.
 
Last edited:
Based on what you said, I re-tested on my i9-13900k, and my conclusion still holds true, but it is not as serious on the i9, DRAM latency only increased from about 30ns to about 90ns, instead of from 100ns to 2000ns in the E5 v2. L1\L2\L3 cache latency also has a slight impact, and on such a modern CPU that should have complete md_clear hardware acceleration, md_clear still reduces performance

Here is two test on my custom host cpu type
1741084593844.png1741084631700.png

these are use cpu host directly:
1741084662847.png1741084679200.png

This is my custom CPU type configuration for a fitted i9 13900k:

cpu-model: windows-host
flags +clflushopt;+clwb;+fsrm;+gfni;+movdir64b;+movdiri;+pdpe1gb;+pku;+rdpid;+serialize;+ss;+ssbd;+stibp;+tsc_adjust;+umip;+vaes;+vmx;+vpclmulqdq;+waitpkg
phys-bits host
hidden 0
hv-vendor-id proxmox
reported-model Skylake-Client-v4
Hi, thanks for sharing your insights.

That mitigations for CPU security bugs affect performance was clear when meltdown and spectre was made public, that's why we added overrides for certain CPU flags in the CPU edit dialogue so that one can enforce enable or disable some specific CPU flag.
For Windows, it's hard to tell as it's basically a black box, in your case with a 12+ year old E5 2667v2 CPU where the meltdown/spectre mitigations are particular slow to mitigate as they have now HW support in the CPU, and that CPU not being officially supported by Windows 10 (ref) it might well be that Windows uses a less secure but faster code path if the flag is not present.
IIRC there even where some cases where highly optimized software mitigations could be a bit faster than plainly depending on the verw instruction and L1D_FLUSH commands that md_clear provides, while that was observed in the Linux world Windows might have copied/implemented a similar approach and just differ in favoring the md_clear one if the flag is present, so it might also be just that (I'd need to recheck in the sources to be more sure of that though).

In any way: I think that in any way one should not blindly turn those flags off, using a matching CPU type, or the newest x86-64-v* one that still works will almost always be a better choice over type host for a variety of reasons though, so adding that as generic advice to a best practice guide might be good to do.
For projects where a (commercial) entity relies on the underlying systems being secure it might be better to use a CPU that is not EOL for about 5 years.
 
Last edited:
Just an additional note here. it *seemed* to coincide with the 24H2 rollout as well. The host CPU performance seemed to tank as people got that update.
 
CPU Type "host" also absolutely destroys our performance on Windows Servers.

AMD EPYC 9554 64-Core Processor (1 Socket)
768GB DDR5 (12*64GB)
Kioxia Performance NVMEs

If we run Windows Server 2022 with "host", certain services experience massive delay.
5 seconds instead of 0,1 seconds for each step in a multi-step service startup routine for example, delaying startup for up to two minutes.
Also performance of the application once the services are started was FUBAR.
There were no errors, Procmon and other tools only showd "no errors, just nothing happening for a while".

Changing to x86_64-v2-AES improved performance to expected levels!

But then, we've had another server/software absolutely tank in performance, because AVX extensions aren't available in x86_64-v2-AES... but we could manually toggle that in the CPU selection screen, so that helped in that one case.


Our conclusion was also that some builtin MS security feature doesn't work well with QEMUs "host" setting on our quiet modern hardware.
(I think things pointed in the direction of Windows Device Security/Core Isolation/Memory Integrity; changing to x86_64-v2-AES).


I'd be very happy if the problematic cpu feature could be found and be disabled while still using most of "host" features for added performance (e.g. AVX)
 
Last edited:
But then, we've had another server/software absolutely tank in performance, because AVX extensions aren't available in x86_64-v2-AES... but we could manually toggle that in the CPU selection screen, so that helped in that one case.
Have you tried setting the CPU type to EPYC-Genoa?

1741277005573.png

According to
a matching CPU type, or the newest x86-64-v* one that still works will almost always be a better choice over type host for a variety of reasons

the best to use for you would be either EPYC-Genoa or x86-64-v4 (enables AVX512)
 
Have you tried setting the CPU type to EPYC-Genoa?
...
the best to use for you would be either EPYC-Genoa or x86-64-v4 (enables AVX512)

yes we tried those, even x86-64-v3 showed bad performance (but more random, sometimes the service-sequence had a step at 5seconds and the next at 0.1). Only -v2 showed consistent good performance
 
  • Like
Reactions: logics
yes we tried those, even x86-64-v3 showed bad performance (but more random, sometimes the service-sequence had a step at 5seconds and the next at 0.1). Only -v2 showed consistent good performance
This is surprising, given the added instructions of v3 and v4 over v2 or v2-AES. Here [1] is the definition of each x86-64 CPU type. Maybe you could try using a custom CPU and manually add each flag and bench your workload to try to find out which one is introducing the performance loss that you see.

Have not detected anything like that in any of my clusters, neither with older or newer hardware from both Intel or AMD.

[1] https://git.proxmox.com/?p=qemu-ser...96064435f04d0c37e61b03510f9a16e7c;hb=HEAD#l35