Optimizing Guest CPU Performance

louhy · May 22, 2023

Running a WIn10 guest with successful GPU passthrough, performance on that front seems decent. But I can't say the same for the CPU.

Pretty disappointing, this is what the bare metal results for the host's CPU should be near for Single Thread score:

CPU Test Suite Average Results for Intel Xeon E5-2690 @ 2.90GHz

Integer Math	39,651 MOps/Sec
Floating Point Math	14,699 MOps/Sec
Find Prime Numbers	49 Million Primes/Sec
Random String Sorting	25,426 Thousand Strings/Sec
Data Encryption	3,061 MBytes/Sec
Data Compression	170,967 KBytes/Sec
Physics	725 Frames/Sec
Extended Instructions	7,383 Million Matrices/Sec
Single Thread	1,661 MOps/Sec

I realize most, or maybe all of these scores OTHER than Single Thread, should be irrelevant due to the difference in cores (the guest only has 4 cores, host has 16 due to dual sockets). But I would still think the Single Thread performance should be closer to 1,661... shouldn't it?

Proxmox version is 7.4-3.

affinity: 0-3
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 4
cpu: host
cpuunits: 10000
efidisk0: local-lvm:vm-102-disk-0,size=4M
hostpci0: 0000:42:00,pcie=1,x-vga=1
machine: pc-q35-5.2
memory: 16128
name: win10vm
net0: virtio=E2:F4:FA:12:FE:35,bridge=vmbr0
numa: 1
ostype: win10
scsi0: local-lvm:vm-102-disk-1,cache=writeback,iothread=1,replicate=0,size=55G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7aeaefca-7a7b-4378-a6a0-181357a63967
sockets: 1
vga: none
vmgenid: b9575c47-6a1b-4302-a9ac-9559dd1c5f02

I found this somewhere else as a performance tip:
(3) Ensure each vCPU has two sibling cores isolated and dedicated.

Is that correct? Wouldn't that defeat the point of CPU pinning?

Looking for ideas to see if I can improve performance.

leesteken · May 22, 2023

If your system uses a NUMA, then you might get better (memory) performance if you give the VM the same number of virtual sockets as physical sockets/NUMA-domains (according to the Proxmxox manual). I hope this gets you enough performance, so you don't need the whole core pinning hassle.

If you go through with CPU pinning; make sure not to pin half the VM cores to the hyper-thread sibling of the other pinned VM cores. When both siblings are busy at the same time performance is a little higher that 1x but nowhere near 2x. But also make sure all pinned VM cores are on the same L3 cache (and NUMA domain) otherwise (memory) performance goes down again.
Note that some virtualization overhead and emulation (network. disks) also run on a (separate, if you enable it) thread, which might be on a different NUMA domain and not be optimally fast.

toomanylogins · May 22, 2023

I have a very similar setup 2* E5-2660. As described above I have set up the Vm to match the host with 2 sockets 16 cores.
Results inc single thread are much better. So numa and matching cores seems to be the answer. It also stops adobe software crashing when it does its checking thing.

louhy · May 23, 2023

@leesteken Wow, that's a lot to absorb. lol Yeah I guess I have some reading to do.

@toomanylogins Now those numbers are more like I would expect! I must have something seriously hosed up here.
...Wait so are you saying you basically over-provisioned the VM and gave it everything the host has (2 sockets + 16 cores)?

louhy · May 23, 2023

Code:

# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           45
Model name:                      Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Stepping:                        7
CPU MHz:                         3800.000
CPU max MHz:                     3800.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        5800.47
Virtualization:                  VT-x
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        40 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRS
                                 B-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acp
                                 i mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon
                                  pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 mon
                                 itor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt t
                                 sc_deadline_timer aes xsave avx lahf_lm pti ibrs ibpb stibp tpr_shadow vnmi flexpriorit
                                 y ept vpid xsaveopt dtherm ida arat pln pts

Okay so two NUMA nodes, so if I understand correctly...

One funny symptom I see in this VM is how slow the Task Manager is to switch tabs - it takes a few seconds just to show the CPU usage. Doing most anything else isn't nearly that slow. In fact it's almost fine - is that Task Manager slowness something expected, even if performance is as it should be? The benchmark is still pretty concerning.

toomanylogins · May 23, 2023

louhy said:
@leesteken Wow, that's a lot to absorb. lol Yeah I guess I have some reading to do.

@toomanylogins Now those numbers are more like I would expect! I must have something seriously hosed up here.
...Wait so are you saying you basically over-provisioned the VM and gave it everything the host has (2 sockets + 16 cores)?

Yes. I did this to solve another problem relating to Adobe software and DragonDictate. Both of these use the Flexera comon software manager and as the VM was virtualised from my original workstation software would crash if you try to use host with a virtual number of cores. It was okay with KVM64 but this does give a performance penalty. So to get the software to work I have the VM setup matching a host hardware. Which is probably why the performance is okay. I've still got bunch of other virtual machines running on the server without problem but it's my home office development machine so not a lot of traffic.

I notice you have quiet an old machine type 5.2 ? I am using 7.2

louhy · May 23, 2023

Thanks for pointing out the machine type thing. As far as benchmarks, I did get it to give me a Single Thread score over 1,000 ONE time but I'm not able to repeat it. LOL But I do think that machine version setting was something I needed to change regardless.

Would you mind posting the configuration for that VM from /etc/pve/nodes/[node name]/qemu-server/[vm id].conf?

According to this my CPU should perform a little better, but it's definitely worse, so this is still a mystery. (Ignore the missing L3 cache on mine, that's a lie and lscpu clearly report that it's the same at 40MB.)

	×Intel Xeon E5-2660 @ 2.20GHz	×Intel Xeon E5-2690 @ 2.90GHz
Socket Type	LGA2011	LGA2011
CPU Class	Server	Server
Clockspeed	2.2 GHz	2.9 GHz
Turbo Speed	Up to 3.0 GHz	Up to 3.8 GHz
# of Physical Cores	8 (Threads: 16)	8 (Threads: 16)
Cache	L1: 128KB, L2: 0.5MB, L3: 40MB	L1: 512KB, L2: 2.0MB,
Max TDP	95W	135W
Yearly Running Cost	$17.34	$24.64
Other
First Seen on Chart	Q2 2012	Q1 2012
# of Samples	186	555
CPU Value	281.1	251.0
Single Thread Rating(% diff. to max in group)	1400(-15.8%)	1661(0.0%)
CPU Mark(% diff. to max in group)	8152(-16.7%)	9790(0.0%)

I think I've seen a 600 single thread score once or twice but still disappointing. This was with (2 sockets, 2 cores) (each), no CPU pinning. CPU Units 10,000 and processor type "host".

And here's the crazy-mode with 2 sockets 16 cores (32 virtual). THIS was a shocker:

That is frustrating... everything does good except the one I really need! There's got to be something I'm missing...

toomanylogins · May 23, 2023

Here you go.

agent: 1
balloon: 0
bios: seabios
boot: order=scsi0;ide2
cores: 8
cpu: host
cpuunits: 1000
hostpci0: 0000:03:00,pcie=1,x-vga=1
hostpci1: 0000:00:1d
ide2: none,media=cdrom
machine: pc-q35-7.2
memory: 16384
name: paul-win10-final
net0: e1000=AA:B9:25:32:F8:BA,bridge=vmbr0,firewall=1
net1: virtio=26:AB:1C:A2:17:F2,bridge=vmbr0
numa: 1
onboot: 1
ostype: win10
parent: qemu_cpu
scsi0: nvme:302/vm-109-disk-3.qcow2,backup=0,cache=writeback,discard=on,iothread=1,size=256059432448,ssd=1
scsi1: dev-sdd:302/vm-302-disk-0.qcow2,backup=0,cache=writethrough,discard=on,iothread=1,size=128G,ssd=1
scsi2: local-lvm:vm-302-disk-0,cache=writeback,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=aed097aa-6856-400f-83e4-33fa2c9daadb
sockets: 2
startup: order=10,up=240
tablet: 0
vmgenid: b59e7ef9-ceb9-493c-b5f5-a73058b07e43

louhy · May 23, 2023

Thanks!
Really similar now, but not completely. It's a little ridiculous we've got almost the exact same setup but that single thread performance is way off!

The BIOS is different but I can't imagine why that'd matter. I'll match the cores and cpuunits for the heck of it:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 8
cpu: host,flags=+pcid
cpuunits: 1000
efidisk0: local-lvm:vm-102-disk-0,size=4M
hostpci0: 0000:42:00,pcie=1,x-vga=1
machine: pc-q35-7.2
memory: 16128
name: win10vm
net0: virtio=E2:F4:FA:12:FE:35,bridge=vmbr0
numa: 1
ostype: win10
scsi0: local-lvm:vm-102-disk-1,cache=writeback,iothread=1,replicate=0,size=55G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7aeaefca-7a7b-4378-a6a0-181357a63967
sockets: 2
vga: none
vmgenid: b9575c47-6a1b-4302-a9ac-9559dd1c5f02

---

parent: qemu_cpu

I wonder what this is in your config.

I just enabled 1GB hugepages on the host now, I doubt it'll matter for a CPU benchmark but let's try some updated tests...

louhy · May 23, 2023

It makes no sense!!!! Every other number is about what you'd expect, but that single thread is junk.

Only idea I have left to try is adding that "parent: qemu_cpu" thing.

Edit - No difference at all. Out of ideas...

louhy · May 23, 2023

Okay, I lied - I figured it out.

I did not think it would make such a difference, but I decided to try hooking a monitor up directly one more time, along with a passed through USB mouse.

Single threaded score went way up, I think it was 1700 (maybe I misremember and it was 1200's which makes more sense), but in any case it was way up and it worked perfectly fine.

So I guess it was the remote desktop connection... it's hard to believe it makes that much difference. I guess now the last thing I have to do is see if using "Looking Glass" is feasible here, which I think gets around this type of thing somehow.

Connecting a monitor directly defeats the purpose of doing this - I want a fast Windows VM running in a convenient Linux window, without a KVM or monitor input switching involved. I can already do the KVM thing now.

toomanylogins · May 24, 2023

Morning, Not quite although I dont know the answer. I just ran the test on a plain vanilla win 10 via rdp (no pass through on this one) on same host with 1 socket 4 cores and only 8gb ram. Single thread is the same. So its not the RDP or the host/cores/numa that is the problem.

agent: 1
boot: order=scsi0;ide2
cores: 4
cpu: host
ide2: none,media=cdrom
machine: pc-i440fx-7.1
memory: 8192
name: win10-base
net0: e1000=12:7C:C3

9:F5:84,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
parent: node_installed
scsi0: local-lvm:vm-103-disk-0,discard=on,size=42G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=4ffee31e-c7e9-4a03-9dfe-3e4e200951ef
sockets: 1
unused0: dev-sdb:103/Win10Base.qcow2
vmgenid: 771104ae-23f2-44c9-9100-72f24831f206

louhy · May 25, 2023

Oooh, that is interesting. Maybe I shouldn't give up on the RDP method, but I'm not really sure what to try next either. I see you even changed the machine from q35 too.

Hmm, I wonder if it could be the NIC? I'm not passing anything through for that and obviously RDP is pretty dependent on that. What I'm using is supposed to be paravirtualized though... if this R720 allows it maybe I could try passing through a network port to see what happens.

Maybe I'll try kicking off the test and then disconnect real fast and reconnect a minute later. lol

louhy · May 25, 2023

Ha... the "run it and disconnect real fast" test actually had interesting results:

Hmm... without disconnecting, using a smaller Remmina window:

Fullscreen RDP (test software window was also maximized):

And here's the one that really surprised me.
Fullscreen RDP via Remmina, and I just made the PassMark PerformanceTest window small:

Resize it small again and it went back up to 1566. Makes me question the value of this test software...

leesteken · May 25, 2023

It looks like the size of the software rendered (because of RDP) desktop/window influences your CPU score. Maybe that's not a strange as software rendering takes CPU cycles (as well as memory) away from the benchmark? On bare-metal, the GPU would to the window and desktop rendering.

louhy · May 25, 2023

Yeah I don't know what's involved in using RDP (well I say RDP, I really mean Remmina, not Windows remote desktop) vs pulling data right out of the GPU, but if this is normal that's a shocking amount of penalty. Even if it does take a few cores all their time to push that data, why is it slowing single threaded performance down for ALL of them?

So I don't get it really, with all the supposed "remote gaming" services out there where they're supposed to stream the visuals to you - shouldn't doing this over a local network be easy? I guess there's just no good way to get the data from the card to the screen on another PC.

I may try some alternate RDP software to see if it makes any difference, but I'll probably end up needing to lose the flexibility of remote access and just do a regular passthrough on my main Linux PC with Looking Glass.

leesteken · May 25, 2023

louhy said:
Yeah I don't know what's involved in using RDP (well I say RDP, I really mean Remmina, not Windows remote desktop) vs pulling data right out of the GPU, but if this is normal that's a shocking amount of penalty. Even if it does take a few cores all their time to push that data, why is it slowing single threaded performance down for ALL of them?

All those pixels need to be computed by a CPU emulating a GPU. It does not really surprise me.

louhy said:
So I don't get it really, with all the supposed "remote gaming" services out there where they're supposed to stream the visuals to you - shouldn't doing this over a local network be easy? I guess there's just no good way to get the data from the card to the screen on another PC.

They use professional GPUs to encode it as compressed video, which is technically different from regular RDP. "Remote" can mean drawing the user interface element remotely (sending draw commands), or drawing pixels locally (with GPU or CPU) and sending parts of the screen as pictures remotely, or "recording a video" (with GPU or CPU) and sending that remotely. There are different techniques that suit different purposes.

louhy said:
I may try some alternate RDP software to see if it makes any difference, but I'll probably end up needing to lose the flexibility of remote access and just do a regular passthrough on my main Linux PC with Looking Glass.

Looking glass and Parsec (no experience with either) are more suitable for game streaming, yes. I have seen some threads about those on this forum.

louhy · May 25, 2023

The CPU shouldn't be emulating the GPU though, I'm passing the GPU through so all the CPU needs to do is grab whatever the GPU renders and push it over a network connection, right? (Probably this is an oversimplification...)

Yes I was just looking at Parsec but it's shady that they require you to login to something even if you're just connecting over LAN. Especially since it's "free" (it's never free).

leesteken · May 25, 2023

louhy said:
The CPU shouldn't be emulating the GPU though, I'm passing the GPU through so all the CPU needs to do is grab whatever the GPU renders and push it over a network connection, right? (Probably this is an oversimplification...)

That's not how some of the RDP protocols work, sorry. Some read the screen from the GPU and others switch to software rendering using CPU(s).

louhy · May 25, 2023

Makes sense.. well maybe I'll try this Parsec thing and just lock it down with a local firewall after the initial connect. It's still strange that it's hogging every core instead of just a few to do it's thing.

Optimizing Guest CPU Performance

New Member

CPU Test Suite Average Results for Intel Xeon E5-2690 @ 2.90GHz​

Distinguished Member

Renowned Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

New Member

Renowned Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

We value your privacy

CPU Test Suite Average Results for Intel Xeon E5-2690 @ 2.90GHz