This post is going to be pretty long too long to fit in a single post, but it represents a summary and lessons learned over ~3 weeks of experiments. This post is a half-tutorial and half-RFC so maybe PVE can be improved, as well as a half-tutorial how to actually achieve good results. This thread is split into multiple posts due to forum limits - maybe it can be a wiki one day? I will probably be updating this post a bit as there are some unexplored avenues.
TL;DR: PVE really needs more controls for the resources guarantees for latency-critical or SLA-bound scenarios. Currently all vCPUs can be pinned to a group of logical cores, which isn't much better than lack of pinning. Most of the tweaks and optimizations can be achieved with a script (example from my system), but it's full of hacks.
Background
This project stemmed from a real business scenario, but lead me to experiments on one of my private system. The goal was to create a VM on Proxmox which is stable enough to handle VR gaming, later on being able to be transplanted to a commercial arcade. This goal is very achievable on hypervisors like VMWare and even with some tinkering on KVM-based solutions that use libvirt. It proven quite hard on Proxmox PVE thou.
Hardware
The test system I used:
The Problem
By default creating a VM puts it in a shared resource pool, which causes latency spikes and leads to unpredictable behavior and makes the VM unsuitable for running games, esp. VR ones. Proxmox PVE doesn't expose enough configuration options to properly configure the VM. Below I summarized options which are needed.
Available:
Why SMP matters?
Proxmox PVE seems to completely glance over CPU topology existence. In my opinion, this could've been a good-enough approach 10-15 years ago, but not anymore for three reasons:
To keep this post educational let's look at often seen as scary
A few things can be observed:
Conclusion: VMs that are performance (esp. latency) sensitive must be configured with SMP and correct mapping of vCPUs. The real L3s topology should be exposed to the guest. There's practically no way around that if CPUs aren't uniform physically anymore.
How to do it in Proxmox?
As mentioned before
Current CPU affinity isn't enough
I hope the section above sufficiently makes a point about importance of topology-awarness. Proxmox PVE just got a shiny new
Currently, Proxmox's
LibVirt via CLI
The default XML configuration allows for easy vCPU<=>host thread pinning. Combined with SMP definition it is extremely easy to properly recreate host topology for the guest. Below is an example for my CPU topology, assuming that the VM is meant to stay on a 2nd CCD, with IO thread(s) and emulator processes directed to stay on 1st CCD.
GUI solutions
unRaid contains basic hypervisor capabilities built on top of libvirt. Its approach to pinning is seems similar to PVE, i.e. a whole-machine to a group of cores, but respecting SMP. Since I don't use unRAID I cannot fully confirm that. VirtManager, which is a de-facto standard for running VMs on a desktop system with libvirt backing contains a per-vCPU pinning as well as it's able to generate the pinning config from NUMA (see sect. 3.3 in docs).
How to do it in Proxmox PVE now?
This isn't as easy as it looks at first. I will be referencing semi-universal hook script in this section which I share for the community on Gist. There are 3 levels of complexity here:
Pinning these is as simple as using
Doing #2 is moderately easy. In principal every thread which is not a vCPU thread (see above) should be pinned to a host thread or group that isn't pinned to vCPU. See
Doing #3 is quite hard in bash. Proxmox QEMU doesn't populate thread names for iothreads (bug?), so the only way to do it is to reach to QEMU monitor. Due to no measurable benefit in my use-case I didn't implement that in my script. However, for Proxmox which already communicates with monitor that would be trivial.
#cont. below#
TL;DR: PVE really needs more controls for the resources guarantees for latency-critical or SLA-bound scenarios. Currently all vCPUs can be pinned to a group of logical cores, which isn't much better than lack of pinning. Most of the tweaks and optimizations can be achieved with a script (example from my system), but it's full of hacks.
Background
This project stemmed from a real business scenario, but lead me to experiments on one of my private system. The goal was to create a VM on Proxmox which is stable enough to handle VR gaming, later on being able to be transplanted to a commercial arcade. This goal is very achievable on hypervisors like VMWare and even with some tinkering on KVM-based solutions that use libvirt. It proven quite hard on Proxmox PVE thou.
Hardware
The test system I used:
- CPU:AMD Ryzen 5900x
- UMA RAM (unified memory controller)
- 2x NUMA domains for L3 (one per CCX)
- 2x CCDs with 6 cores each
- It nicely mimics multi-CPU and current enterprise-grade servers
- RAM: 128GB (4x32GB, ECC)
- Storage:
- Tested two local configurations
- 1) SATA SSD dedicated for the VM
- 2) NVMe M2 SSD dedicated for the VM
- PCIe devices:
- 10Gb NIC managed by PVE
- NVidia GPU intended for passthru
- Specific BIOS config:
- Skipping over obvious things like IOMMU etc
- NUMA: NPS2
- L3-as-NUMA enabled
The Problem
By default creating a VM puts it in a shared resource pool, which causes latency spikes and leads to unpredictable behavior and makes the VM unsuitable for running games, esp. VR ones. Proxmox PVE doesn't expose enough configuration options to properly configure the VM. Below I summarized options which are needed.
Available:
cores
: number of threads exposed to the VMcpuunits
: systemd policy on relative CPU sharingmemory
: total memory for the VMhugepages
: enabled hugepages and specifies HP size used; not available in GUINUMA
numa: 1
: enables NUMA (req. for hugepages)numa0
: ensure proper memory pool is used; not available in GUI
affinity
: pin all QEMU processes to a group of logical CPUs (more on why this is a big problem later)
args
:- SMP topology
- CPU flags
- PVE exposes only some flags in the UI (e.g. AES) but a lot of them are missing
- Advanced CPU options need to be passed manually in the
args: -cpu...
- One of the options which probably for sure should be exposed is L3 host topology passthru
- vCPU pinning: will cause issues ranging from L3 eviction creating large and random latency cpikes
- SMP-aware pinning: makes guest scheduling completely
- I/O thread control: lack pinning/isolation from vCPUs creates a lot of context switching, creates host scheduling bottlenecks and massive L3 eviction
- Guaranteed CPU resources isolation
- Pre-HP compacting: VMs will often fail to start without that if not started on boot
Why SMP matters?
Proxmox PVE seems to completely glance over CPU topology existence. In my opinion, this could've been a good-enough approach 10-15 years ago, but not anymore for three reasons:
- SMT/HT: "fake" cores aren't equal to real ones, while AMD ZEN scales better we cannot treat SMT logical CPUs as equal to real cores. For that reason both Linux and Windows scheduler distribute loads with that information being taken into account. For example compute-intensive tasks with low iowaits are usually scheduled on real cores first
- Chiplets: depending on the configuration latencies between different cores aren't equal. This phenomenon stems from separate L3 for different core groups (CCDs) that is shared between CCXes, as bandwidth between chiplets is smaller in comparison to on-die bandwidth. This gets even more complex with Zen2 where L3 is exclusive per CCX. ...and this isn't all, the headache intensifies with X3D for Zen3 and Zen4 where L3 size is different per CCD.
The net effect is that inadvertently moving/rescheduling a task from one CCD to another causes a complete loss of L3, triggering a large transfer via IoD from RAM. In practice such scheduling mistakes can cause latency spikes measured in hundreds of milliseconds. - Heterogenous CPUs: the simplest example are Intel E-cores & P-cores, but this gets way more complex on ARM (that Proxmox seems to be exploring). Intel's implementation in addition falls into HT pitfall as E-cores don't support HT. With such CPUs blind semi-random scheduling isn't an option and MS struggling for months with new Intel chips should be enough of a proof
To keep this post educational let's look at often seen as scary
lscpu -e
output.
Code:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 5073.0459 2200.0000
1 0 0 1 1:1:1:0 yes 5073.0459 2200.0000
2 0 0 2 2:2:2:0 yes 5073.0459 2200.0000
3 0 0 3 3:3:3:0 yes 5073.0459 2200.0000
4 0 0 4 4:4:4:0 yes 5073.0459 2200.0000
5 0 0 5 5:5:5:0 yes 5073.0459 2200.0000
6 1 0 6 6:6:6:1 yes 5073.0459 2200.0000
7 1 0 7 7:7:7:1 yes 5073.0459 2200.0000
8 1 0 8 8:8:8:1 yes 5073.0459 2200.0000
9 1 0 9 9:9:9:1 yes 5073.0459 2200.0000
10 1 0 10 10:10:10:1 yes 5073.0459 2200.0000
11 1 0 11 11:11:11:1 yes 5073.0459 2200.0000
12 0 0 0 0:0:0:0 yes 5073.0459 2200.0000
13 0 0 1 1:1:1:0 yes 5073.0459 2200.0000
14 0 0 2 2:2:2:0 yes 5073.0459 2200.0000
15 0 0 3 3:3:3:0 yes 5073.0459 2200.0000
16 0 0 4 4:4:4:0 yes 5073.0459 2200.0000
17 0 0 5 5:5:5:0 yes 5073.0459 2200.0000
18 1 0 6 6:6:6:1 yes 5073.0459 2200.0000
19 1 0 7 7:7:7:1 yes 5073.0459 2200.0000
20 1 0 8 8:8:8:1 yes 5073.0459 2200.0000
21 1 0 9 9:9:9:1 yes 5073.0459 2200.0000
22 1 0 10 10:10:10:1 yes 5073.0459 2200.0000
23 1 0 11 11:11:11:1 yes 5073.0459 2200.0000
A few things can be observed:
- CPU contains 24 usable threads ("CPU" column)
- CPU contains 12 physical cores ("CORE" column)
- 2 threads on each core (=SMT-enabled CPU)
- Each core has exactly 2 threads (=homogenous CPU, i.e. no e/p-cores)
- On this particular CPU topology first contains all real threads and then all SMT ones, e.g. first core (CORE #0) exposes threads (CPU) 0 and 12. On other CPUs the topology often contains interleaved threads (i.e. 1st core = threads 0 and 1, 2nd core = threads 2 and 3, etc).
- Each core has separate L1d, L1i, and L2 caches
- This is because each of these columns contains a unique number per core
- For example core 4 thread ("CPU") 4 has L1/L2 #4
- It's a NUMA system wrt to (at least) L3 caches
- Cores 0-5 contain L3 group #0, while cores 6-11 report L3 group #1
lstopo
can be used to see caches details, but it's not fully reliable on UMA RAM + NUMA L3 systems
- On non-NUMA systems a simple list of sibling threads (i.e. real core + HT "core") can also be obtained using
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort
Conclusion: VMs that are performance (esp. latency) sensitive must be configured with SMP and correct mapping of vCPUs. The real L3s topology should be exposed to the guest. There's practically no way around that if CPUs aren't uniform physically anymore.
How to do it in Proxmox?
As mentioned before
args
can be used in the VM config. Example: args: -cpu 'host,topoext=on' -smp '12,sockets=1,cores=6,threads=2,maxcpus=12'
will create a single six-core CPU, with two threads per core, and SMT/HT enabled. Note that this is effectively meaningless if pinning (next section) isn't used.Current CPU affinity isn't enough
I hope the section above sufficiently makes a point about importance of topology-awarness. Proxmox PVE just got a shiny new
affinity
option. However, while it helps, it is much too simple. Each VM naturally runs as a collection of processes:
Code:
# ps -T -p $(cat /run/qemu-server/100.pid)
PID SPID TTY TIME CMD
7022 7022 ? 00:37:39 kvm
7022 7023 ? 00:00:00 call_rcu
7022 7152 ? 06:36:15 CPU 0/KVM
7022 7153 ? 07:05:50 CPU 1/KVM
7022 7154 ? 06:39:00 CPU 2/KVM
7022 7155 ? 06:31:37 CPU 3/KVM
7022 7156 ? 06:36:18 CPU 4/KVM
7022 7157 ? 06:35:13 CPU 5/KVM
7022 7158 ? 06:31:41 CPU 6/KVM
7022 7159 ? 06:31:21 CPU 7/KVM
7022 7290 ? 00:00:00 vnc_worker
7022 7918 ? 00:00:40 worker
7022 7922 ? 00:00:40 worker
//... more worker processes
Currently, Proxmox's
affinity
option simply pins the main kvm thread along with all its threads to a group of host threads. This solves the issue of cross-CCD/heterogenous groups rescheduling, but leaves several issues on the table:- IO thread(s) are pinned to the same set of threads as vCPUs
- Emulator threads also share threads with vCPUs
- All vCPUs are treated as equal, with no affinity to real vs. SMT/HT threads
- vCPUs are rescheduled by the host willy-nilly between different host threads
LibVirt via CLI
The default XML configuration allows for easy vCPU<=>host thread pinning. Combined with SMP definition it is extremely easy to properly recreate host topology for the guest. Below is an example for my CPU topology, assuming that the VM is meant to stay on a 2nd CCD, with IO thread(s) and emulator processes directed to stay on 1st CCD.
XML:
<domain>
...
<cputune>
<vcpupin vcpu="0" cpuset="6"/>
<vcpupin vcpu="1" cpuset="18"/>
<vcpupin vcpu="2" cpuset="7"/>
<vcpupin vcpu="3" cpuset="19"/>
<emulatorpin cpuset="0-5,12-17"/>
<iothreadpin iothread="1" cpuset="0-5,12-17"/>
GUI solutions
unRaid contains basic hypervisor capabilities built on top of libvirt. Its approach to pinning is seems similar to PVE, i.e. a whole-machine to a group of cores, but respecting SMP. Since I don't use unRAID I cannot fully confirm that. VirtManager, which is a de-facto standard for running VMs on a desktop system with libvirt backing contains a per-vCPU pinning as well as it's able to generate the pinning config from NUMA (see sect. 3.3 in docs).
How to do it in Proxmox PVE now?
This isn't as easy as it looks at first. I will be referencing semi-universal hook script in this section which I share for the community on Gist. There are 3 levels of complexity here:
- Pinning vCPUs to correct threads
- Pinning all non-vCPU processes to non-vCPU threads
- Pinning emulator & io-threads to correct threads
/run/qemu-server/<VM_ID>.pid
contains the PID of the main QEMU process for the VM. Threads can be found by looking up /proc/<VM_PID>/task/*/comm
for the name of the thread. Each vCPU thread will be named "CPU #/KVM". QEMU assigns threads on cores in a sequential fashion [code citation needed here ]. I.e.: when emulating a 6c12t CPU - "CPU 0/KVM" = 1st thread of 1st core, "CPU 1/KVM" = 2nd thread of 1st core, "CPU 2/KVM" = 1st thread of 2nd core etc.Pinning these is as simple as using
taskset --cpu-list --pid <HOST_THREAD_NUM> <vCPU_Thread_PID>
. In my script mentioned above there's a handy function called pinVCpu
, accepting two arguments - vCPU# and host thread #.Doing #2 is moderately easy. In principal every thread which is not a vCPU thread (see above) should be pinned to a host thread or group that isn't pinned to vCPU. See
pinNonVCpuTasks
function in my script.Doing #3 is quite hard in bash. Proxmox QEMU doesn't populate thread names for iothreads (bug?), so the only way to do it is to reach to QEMU monitor. Due to no measurable benefit in my use-case I didn't implement that in my script. However, for Proxmox which already communicates with monitor that would be trivial.
#cont. below#
Last edited: