Correct guest NUMA & HT affinity

kiler129

Member
Oct 20, 2020
28
52
18
I recently started playing in my homelab with NUMA. My test system runs AMD Ryzen 5900x (6 cores pers CCX, 2x CCX, 2 threads per core) with properly reported NUMA nodes:

Code:
# lscpu
//...
Model name:                      AMD Ryzen 9 5900X 12-Core Processor
//...
NUMA node0 CPU(s):               0-5,12-17
NUMA node1 CPU(s):               6-11,18-23

New Proxmox (I think 7.3+?) now contains a convenient shortcut to "taskset" to set CPU affinity. Assuming I would like to pin my VM to the 2nd CCX I can simply set the affinity to "6-11". However, this seems like a half of the story. I believe it should be set to "6-11,18-23". However, there's one puzzle-piece missing in my book: how do I properly configure SMP to let the guest OS best utilize the host topology?

In other words, I know I can set "-smp 12,cores=6,threads=2,sockets=1" as extra args and it will cause the VM to see a single 6-core CPU with 12 threads. This will mirror the physical CCX layout. However, it seems like guest scheduler decisions will be disconnected from the host CPU affinity as setting affinity as "6-11,18-23" will essentially mix real cores with HT-cores.

How is such situation usually handled? Should I be messing with guest NUMA in the case of a single CCX being pinned?
 
  • Like
Reactions: log
To my knowledge, there is no simple way to pass along the "sibling" relationships from the host to the guest. However, if you want to reach under the covers, here's an easy way to identify VCPUs and map them to specific cores:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#qemu-vcpu-affinity

Also, here's a good reason to consider pinning: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#processor-topology.

In short, keeping a guest contained to a single CCD will reduce the effective cache size, but can reduce inter-VCPU synchronization latencies. You might also consider disabling hyperthreading all together.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: log and kiler129
Thank you for great links and amazing writeup on your website.

Oh, yes, that's not even a question with chiplet designs - pinning is necessary to avoid cross-IOD latency. New Proxmox has a handy shortcut for that, but it will let vCPU workers roam between physical cores according to the Linux scheduler - that's less than ideal.


To my knowledge, there is no simple way to pass along the "sibling" relationships from the host to the guest. However, if you want to reach under the covers, here's an easy way to identify VCPUs and map them to specific cores:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#qemu-vcpu-affinity
TIL - I had no idea that I can do it. I need to experiment with identification of vCPUs and setting a proper SMP.... and whether that makes a real-world latency difference on the guest. While Win11's scheduler is smarter and actually takes into account heterogenous architecture of the CPU, I'm not sure if Win10 even attempts to schedule processes based on SMP properties. While 5900x and 5950x are technically not fully NUMA systems, as they have a single IOD, they still behave like NUMA ones wrt to moving workloads between CCXes.

After some testing I arrived with more-or-less 3-prong approach at the post-start phase (hook script)
Code:
      echo "Setting CPU governot to performance"
      echo -n "Current governor is set to: "
      cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
      for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "performance" > $file; done

      echo "Constraining host to 1st CCX (NUMA0)"
      systemctl set-property --runtime -- user.slice AllowedCPUs=0-5,12-17
      systemctl set-property --runtime -- system.slice AllowedCPUs=0-5,12-17
      systemctl set-property --runtime -- init.scope AllowedCPUs=0-5,12-17

      vmPid="$(< /run/qemu-server/$vmid.pid)"
      echo "Pinning 2nd CCX (NUMA1) to VM id=$vmId pid=$vmPid"
      taskset --cpu-list --all-tasks --pid "6-11,18-23" $vmPid

That way I'm not only pinning the cores to the VM but also ensuring the host will not schedule its own workload (esp. kthreads) on pinned ones.

In short, keeping a guest contained to a single CCD will reduce the effective cache size, but can reduce inter-VCPU synchronization latencies. You might also consider disabling hyperthreading all together.
That's definitely one way of handling that. However, from my other testing I found that leaving the HT enabled benefits other workloads, like my virtualized NAS setup with ZFS.




** EDIT #1 **


I had some time to play with that and welp... virtualized windows sees cores & threads config properly when using -smp 12,sockets=1,cores=6,threads=2,maxcpus=12. While task manager doesn't want to show threads when a VM is detected, wmic does:
Code:
C:\Users\User>wmic
wmic:root\cli>CPU Get NumberOfCores,NumberOfLogicalProcessors /Format:List

NumberOfCores=6
NumberOfLogicalProcessors=12


However qemu simply lumps all vcpus together, so it's kind of impossible to pin them properly between HT vs real cores:
Code:
# ps -T -p 1610432
    PID    SPID TTY          TIME CMD
1610432 1610432 ?        00:00:11 kvm
1610432 1610433 ?        00:00:00 call_rcu
1610432 1610434 ?        00:00:01 kvm
1610432 1610627 ?        00:00:16 CPU 0/KVM
1610432 1610628 ?        00:00:06 CPU 1/KVM
1610432 1610629 ?        00:00:13 CPU 2/KVM
1610432 1610630 ?        00:00:07 CPU 3/KVM
1610432 1610631 ?        00:00:11 CPU 4/KVM
1610432 1610632 ?        00:00:08 CPU 5/KVM
1610432 1610633 ?        00:00:09 CPU 6/KVM
1610432 1610634 ?        00:00:08 CPU 7/KVM
1610432 1610635 ?        00:00:09 CPU 8/KVM
1610432 1610636 ?        00:00:06 CPU 9/KVM
1610432 1610637 ?        00:00:12 CPU 10/KVM
1610432 1610638 ?        00:00:12 CPU 11/KVM
1610432 1610685 ?        00:00:00 vnc_worker
1610432 1610846 ?        00:00:00 iou-wrk-1610434




** EDIT #2 **
Quick addition: QEMU assigns SMT threads sequentially. So in this case CPU 0/KVM corresponds to the first thread of first virtual core, while CPU 1/KVM corresponds to the second/HT thread of the first virtual core. Thus, when pinning on a real 5900x (after checking lstopo) one should pin CPU 0/KVM to host thread 0 and CPU 1/KVM to physical thread 12.
 
Last edited:
  • Like
Reactions: rtgy
Hello Kiler129, I'm also investigating the operation of the "numa" and especially on maximizing CPU performance.

It all started with an old Opteron 6328, which despite being an old processor of late 2013, doing some tests with the latest version of proxmox 7.4, I realized that it worked too badly.

In the end I installed version 6.4 of Proxmox because I had the feeling that the CPU boost was never activated (this CPU can touch 3,800 MHz while on Proxmox 7.4 the CPU shows a frequency of 3,200MHz on all cores), also doing some tests I noticed that the VMs were constantly "moved" from core to core giving bad results in some benchmark.

I haven't finished my investigation yet but soon I'll show you some results and share my experience. Thank you for your sharing.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!