Correct guest NUMA & HT affinity

kiler129 · Feb 20, 2023

I recently started playing in my homelab with NUMA. My test system runs AMD Ryzen 5900x (6 cores pers CCX, 2x CCX, 2 threads per core) with properly reported NUMA nodes:

Code:

# lscpu
//...
Model name:                      AMD Ryzen 9 5900X 12-Core Processor
//...
NUMA node0 CPU(s):               0-5,12-17
NUMA node1 CPU(s):               6-11,18-23

New Proxmox (I think 7.3+?) now contains a convenient shortcut to "taskset" to set CPU affinity. Assuming I would like to pin my VM to the 2nd CCX I can simply set the affinity to "6-11". However, this seems like a half of the story. I believe it should be set to "6-11,18-23". However, there's one puzzle-piece missing in my book: how do I properly configure SMP to let the guest OS best utilize the host topology?

In other words, I know I can set "-smp 12,cores=6,threads=2,sockets=1" as extra args and it will cause the VM to see a single 6-core CPU with 12 threads. This will mirror the physical CCX layout. However, it seems like guest scheduler decisions will be disconnected from the host CPU affinity as setting affinity as "6-11,18-23" will essentially mix real cores with HT-cores.

How is such situation usually handled? Should I be messing with guest NUMA in the case of a single CCX being pinned?

bbgeek17 · Feb 21, 2023

To my knowledge, there is no simple way to pass along the "sibling" relationships from the host to the guest. However, if you want to reach under the covers, here's an easy way to identify VCPUs and map them to specific cores:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#qemu-vcpu-affinity

Also, here's a good reason to consider pinning: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#processor-topology.

In short, keeping a guest contained to a single CCD will reduce the effective cache size, but can reduce inter-VCPU synchronization latencies. You might also consider disabling hyperthreading all together.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kiler129 · Feb 21, 2023

Thank you for great links and amazing writeup on your website.

bbgeek17 said:
Also, here's a good reason to consider pinning: https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#processor-topology.

Oh, yes, that's not even a question with chiplet designs - pinning is necessary to avoid cross-IOD latency. New Proxmox has a handy shortcut for that, but it will let vCPU workers roam between physical cores according to the Linux scheduler - that's less than ideal.

bbgeek17 said:
To my knowledge, there is no simple way to pass along the "sibling" relationships from the host to the guest. However, if you want to reach under the covers, here's an easy way to identify VCPUs and map them to specific cores:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage/#qemu-vcpu-affinity

TIL - I had no idea that I can do it. I need to experiment with identification of vCPUs and setting a proper SMP.... and whether that makes a real-world latency difference on the guest. While Win11's scheduler is smarter and actually takes into account heterogenous architecture of the CPU, I'm not sure if Win10 even attempts to schedule processes based on SMP properties. While 5900x and 5950x are technically not fully NUMA systems, as they have a single IOD, they still behave like NUMA ones wrt to moving workloads between CCXes.

After some testing I arrived with more-or-less 3-prong approach at the post-start phase (hook script)

Code:

      echo "Setting CPU governot to performance"
      echo -n "Current governor is set to: "
      cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
      for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "performance" > $file; done

      echo "Constraining host to 1st CCX (NUMA0)"
      systemctl set-property --runtime -- user.slice AllowedCPUs=0-5,12-17
      systemctl set-property --runtime -- system.slice AllowedCPUs=0-5,12-17
      systemctl set-property --runtime -- init.scope AllowedCPUs=0-5,12-17

      vmPid="$(< /run/qemu-server/$vmid.pid)"
      echo "Pinning 2nd CCX (NUMA1) to VM id=$vmId pid=$vmPid"
      taskset --cpu-list --all-tasks --pid "6-11,18-23" $vmPid

That way I'm not only pinning the cores to the VM but also ensuring the host will not schedule its own workload (esp. kthreads) on pinned ones.

bbgeek17 said:
In short, keeping a guest contained to a single CCD will reduce the effective cache size, but can reduce inter-VCPU synchronization latencies. You might also consider disabling hyperthreading all together.

That's definitely one way of handling that. However, from my other testing I found that leaving the HT enabled benefits other workloads, like my virtualized NAS setup with ZFS.

** EDIT #1 **

I had some time to play with that and welp... virtualized windows sees cores & threads config properly when using -smp 12,sockets=1,cores=6,threads=2,maxcpus=12. While task manager doesn't want to show threads when a VM is detected, wmic does:

Code:

C:\Users\User>wmic
wmic:root\cli>CPU Get NumberOfCores,NumberOfLogicalProcessors /Format:List

NumberOfCores=6
NumberOfLogicalProcessors=12

However qemu simply lumps all vcpus together, so it's kind of impossible to pin them properly between HT vs real cores:

Code:

# ps -T -p 1610432
    PID    SPID TTY          TIME CMD
1610432 1610432 ?        00:00:11 kvm
1610432 1610433 ?        00:00:00 call_rcu
1610432 1610434 ?        00:00:01 kvm
1610432 1610627 ?        00:00:16 CPU 0/KVM
1610432 1610628 ?        00:00:06 CPU 1/KVM
1610432 1610629 ?        00:00:13 CPU 2/KVM
1610432 1610630 ?        00:00:07 CPU 3/KVM
1610432 1610631 ?        00:00:11 CPU 4/KVM
1610432 1610632 ?        00:00:08 CPU 5/KVM
1610432 1610633 ?        00:00:09 CPU 6/KVM
1610432 1610634 ?        00:00:08 CPU 7/KVM
1610432 1610635 ?        00:00:09 CPU 8/KVM
1610432 1610636 ?        00:00:06 CPU 9/KVM
1610432 1610637 ?        00:00:12 CPU 10/KVM
1610432 1610638 ?        00:00:12 CPU 11/KVM
1610432 1610685 ?        00:00:00 vnc_worker
1610432 1610846 ?        00:00:00 iou-wrk-1610434

** EDIT #2 **
Quick addition: QEMU assigns SMT threads sequentially. So in this case CPU 0/KVM corresponds to the first thread of first virtual core, while CPU 1/KVM corresponds to the second/HT thread of the first virtual core. Thus, when pinning on a real 5900x (after checking lstopo) one should pin CPU 0/KVM to host thread 0 and CPU 1/KVM to physical thread 12.

emanuelebruno · Apr 23, 2023

Hello Kiler129, I'm also investigating the operation of the "numa" and especially on maximizing CPU performance.

It all started with an old Opteron 6328, which despite being an old processor of late 2013, doing some tests with the latest version of proxmox 7.4, I realized that it worked too badly.

In the end I installed version 6.4 of Proxmox because I had the feeling that the CPU boost was never activated (this CPU can touch 3,800 MHz while on Proxmox 7.4 the CPU shows a frequency of 3,200MHz on all cores), also doing some tests I noticed that the VMs were constantly "moved" from core to core giving bad results in some benchmark.

I haven't finished my investigation yet but soon I'll show you some results and share my experience. Thank you for your sharing.

Search

Search

Correct guest NUMA & HT affinity

kiler129

Member

bbgeek17

Distinguished Member

kiler129

Member

emanuelebruno

Renowned Member