[TUTORIAL] Hey Proxmox & Community - Let's talk about resources isolation

kiler129

Member
Oct 20, 2020
28
52
18
This post is going to be pretty long too long to fit in a single post, but it represents a summary and lessons learned over ~3 weeks of experiments. This post is a half-tutorial and half-RFC so maybe PVE can be improved, as well as a half-tutorial how to actually achieve good results. This thread is split into multiple posts due to forum limits - maybe it can be a wiki one day? ;) I will probably be updating this post a bit as there are some unexplored avenues.

TL;DR: PVE really needs more controls for the resources guarantees for latency-critical or SLA-bound scenarios. Currently all vCPUs can be pinned to a group of logical cores, which isn't much better than lack of pinning. Most of the tweaks and optimizations can be achieved with a script (example from my system), but it's full of hacks.




Background
This project stemmed from a real business scenario, but lead me to experiments on one of my private system. The goal was to create a VM on Proxmox which is stable enough to handle VR gaming, later on being able to be transplanted to a commercial arcade. This goal is very achievable on hypervisors like VMWare and even with some tinkering on KVM-based solutions that use libvirt. It proven quite hard on Proxmox PVE thou.


Hardware
The test system I used:
  • CPU:AMD Ryzen 5900x
    • UMA RAM (unified memory controller)
    • 2x NUMA domains for L3 (one per CCX)
    • 2x CCDs with 6 cores each
    • It nicely mimics multi-CPU and current enterprise-grade servers
  • RAM: 128GB (4x32GB, ECC)
  • Storage:
    • Tested two local configurations
    • 1) SATA SSD dedicated for the VM
    • 2) NVMe M2 SSD dedicated for the VM
  • PCIe devices:
    • 10Gb NIC managed by PVE
    • NVidia GPU intended for passthru
  • Specific BIOS config:
    • Skipping over obvious things like IOMMU etc
    • NUMA: NPS2
    • L3-as-NUMA enabled

The Problem
By default creating a VM puts it in a shared resource pool, which causes latency spikes and leads to unpredictable behavior and makes the VM unsuitable for running games, esp. VR ones. Proxmox PVE doesn't expose enough configuration options to properly configure the VM. Below I summarized options which are needed.

Available:
  • cores: number of threads exposed to the VM
  • cpuunits: systemd policy on relative CPU sharing
  • memory: total memory for the VM
  • hugepages: enabled hugepages and specifies HP size used; not available in GUI
  • NUMA
    • numa: 1: enables NUMA (req. for hugepages)
    • numa0: ensure proper memory pool is used; not available in GUI
  • affinity: pin all QEMU processes to a group of logical CPUs (more on why this is a big problem later)
Available via args:
  • SMP topology
  • CPU flags
    • PVE exposes only some flags in the UI (e.g. AES) but a lot of them are missing
    • Advanced CPU options need to be passed manually in the args: -cpu...
    • One of the options which probably for sure should be exposed is L3 host topology passthru
Missing options:
  • vCPU pinning: will cause issues ranging from L3 eviction creating large and random latency cpikes
  • SMP-aware pinning: makes guest scheduling completely
  • I/O thread control: lack pinning/isolation from vCPUs creates a lot of context switching, creates host scheduling bottlenecks and massive L3 eviction
  • Guaranteed CPU resources isolation
  • Pre-HP compacting: VMs will often fail to start without that if not started on boot
The mission options are ostensibly easy to achieve, but as with everything it's a classical 80/20. With Proxmox move to cgroups v2 the problem became even more complex. As PVE doesn't use libvirt the solutions aren't available out of the box.


Why SMP matters?
Proxmox PVE seems to completely glance over CPU topology existence. In my opinion, this could've been a good-enough approach 10-15 years ago, but not anymore for three reasons:
  1. SMT/HT: "fake" cores aren't equal to real ones, while AMD ZEN scales better we cannot treat SMT logical CPUs as equal to real cores. For that reason both Linux and Windows scheduler distribute loads with that information being taken into account. For example compute-intensive tasks with low iowaits are usually scheduled on real cores first
  2. Chiplets: depending on the configuration latencies between different cores aren't equal. This phenomenon stems from separate L3 for different core groups (CCDs) that is shared between CCXes, as bandwidth between chiplets is smaller in comparison to on-die bandwidth. This gets even more complex with Zen2 where L3 is exclusive per CCX. ...and this isn't all, the headache intensifies with X3D for Zen3 and Zen4 where L3 size is different per CCD.
    The net effect is that inadvertently moving/rescheduling a task from one CCD to another causes a complete loss of L3, triggering a large transfer via IoD from RAM. In practice such scheduling mistakes can cause latency spikes measured in hundreds of milliseconds.
  3. Heterogenous CPUs: the simplest example are Intel E-cores & P-cores, but this gets way more complex on ARM (that Proxmox seems to be exploring). Intel's implementation in addition falls into HT pitfall as E-cores don't support HT. With such CPUs blind semi-random scheduling isn't an option and MS struggling for months with new Intel chips should be enough of a proof ;)
SMP example of an AMD Ryzen 5900x
To keep this post educational let's look at often seen as scary lscpu -e output.
Code:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
  0    0      0    0 0:0:0:0          yes 5073.0459 2200.0000
  1    0      0    1 1:1:1:0          yes 5073.0459 2200.0000
  2    0      0    2 2:2:2:0          yes 5073.0459 2200.0000
  3    0      0    3 3:3:3:0          yes 5073.0459 2200.0000
  4    0      0    4 4:4:4:0          yes 5073.0459 2200.0000
  5    0      0    5 5:5:5:0          yes 5073.0459 2200.0000
  6    1      0    6 6:6:6:1          yes 5073.0459 2200.0000
  7    1      0    7 7:7:7:1          yes 5073.0459 2200.0000
  8    1      0    8 8:8:8:1          yes 5073.0459 2200.0000
  9    1      0    9 9:9:9:1          yes 5073.0459 2200.0000
 10    1      0   10 10:10:10:1       yes 5073.0459 2200.0000
 11    1      0   11 11:11:11:1       yes 5073.0459 2200.0000
 12    0      0    0 0:0:0:0          yes 5073.0459 2200.0000
 13    0      0    1 1:1:1:0          yes 5073.0459 2200.0000
 14    0      0    2 2:2:2:0          yes 5073.0459 2200.0000
 15    0      0    3 3:3:3:0          yes 5073.0459 2200.0000
 16    0      0    4 4:4:4:0          yes 5073.0459 2200.0000
 17    0      0    5 5:5:5:0          yes 5073.0459 2200.0000
 18    1      0    6 6:6:6:1          yes 5073.0459 2200.0000
 19    1      0    7 7:7:7:1          yes 5073.0459 2200.0000
 20    1      0    8 8:8:8:1          yes 5073.0459 2200.0000
 21    1      0    9 9:9:9:1          yes 5073.0459 2200.0000
 22    1      0   10 10:10:10:1       yes 5073.0459 2200.0000
 23    1      0   11 11:11:11:1       yes 5073.0459 2200.0000

A few things can be observed:
  • CPU contains 24 usable threads ("CPU" column)
  • CPU contains 12 physical cores ("CORE" column)
    • 2 threads on each core (=SMT-enabled CPU)
    • Each core has exactly 2 threads (=homogenous CPU, i.e. no e/p-cores)
    • On this particular CPU topology first contains all real threads and then all SMT ones, e.g. first core (CORE #0) exposes threads (CPU) 0 and 12. On other CPUs the topology often contains interleaved threads (i.e. 1st core = threads 0 and 1, 2nd core = threads 2 and 3, etc).
  • Each core has separate L1d, L1i, and L2 caches
    • This is because each of these columns contains a unique number per core
    • For example core 4 thread ("CPU") 4 has L1/L2 #4
  • It's a NUMA system wrt to (at least) L3 caches
  • On non-NUMA systems a simple list of sibling threads (i.e. real core + HT "core") can also be obtained using cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | sort

Conclusion: VMs that are performance (esp. latency) sensitive must be configured with SMP and correct mapping of vCPUs. The real L3s topology should be exposed to the guest. There's practically no way around that if CPUs aren't uniform physically anymore.

How to do it in Proxmox?
As mentioned before args can be used in the VM config. Example: args: -cpu 'host,topoext=on' -smp '12,sockets=1,cores=6,threads=2,maxcpus=12' will create a single six-core CPU, with two threads per core, and SMT/HT enabled. Note that this is effectively meaningless if pinning (next section) isn't used.


Current CPU affinity isn't enough
I hope the section above sufficiently makes a point about importance of topology-awarness. Proxmox PVE just got a shiny new affinity option. However, while it helps, it is much too simple. Each VM naturally runs as a collection of processes:

Code:
# ps -T -p $(cat /run/qemu-server/100.pid)
    PID    SPID TTY          TIME CMD
   7022    7022 ?        00:37:39 kvm
   7022    7023 ?        00:00:00 call_rcu
   7022    7152 ?        06:36:15 CPU 0/KVM
   7022    7153 ?        07:05:50 CPU 1/KVM
   7022    7154 ?        06:39:00 CPU 2/KVM
   7022    7155 ?        06:31:37 CPU 3/KVM
   7022    7156 ?        06:36:18 CPU 4/KVM
   7022    7157 ?        06:35:13 CPU 5/KVM
   7022    7158 ?        06:31:41 CPU 6/KVM
   7022    7159 ?        06:31:21 CPU 7/KVM
   7022    7290 ?        00:00:00 vnc_worker
   7022    7918 ?        00:00:40 worker
   7022    7922 ?        00:00:40 worker
   //... more worker processes

Currently, Proxmox's affinity option simply pins the main kvm thread along with all its threads to a group of host threads. This solves the issue of cross-CCD/heterogenous groups rescheduling, but leaves several issues on the table:
  • IO thread(s) are pinned to the same set of threads as vCPUs
  • Emulator threads also share threads with vCPUs
  • All vCPUs are treated as equal, with no affinity to real vs. SMT/HT threads
  • vCPUs are rescheduled by the host willy-nilly between different host threads
Effectively these above not only negate the work guest scheduler tries to do, but also couple I/O workload to compute activity. This is pretty bad for applications like games, which nowadays heavily utilize asset streaming and decompression. RedHat also published a writeup on that in non-gaming scenarios. As nothing is created in vacuum I looked how others do that.

LibVirt via CLI
The default XML configuration allows for easy vCPU<=>host thread pinning. Combined with SMP definition it is extremely easy to properly recreate host topology for the guest. Below is an example for my CPU topology, assuming that the VM is meant to stay on a 2nd CCD, with IO thread(s) and emulator processes directed to stay on 1st CCD.
XML:
<domain>
  ...
  <cputune>
    <vcpupin vcpu="0" cpuset="6"/>
    <vcpupin vcpu="1" cpuset="18"/>
    <vcpupin vcpu="2" cpuset="7"/>
    <vcpupin vcpu="3" cpuset="19"/>
    <emulatorpin cpuset="0-5,12-17"/>
    <iothreadpin iothread="1" cpuset="0-5,12-17"/>

GUI solutions
unRaid contains basic hypervisor capabilities built on top of libvirt. Its approach to pinning is seems similar to PVE, i.e. a whole-machine to a group of cores, but respecting SMP. Since I don't use unRAID I cannot fully confirm that. VirtManager, which is a de-facto standard for running VMs on a desktop system with libvirt backing contains a per-vCPU pinning as well as it's able to generate the pinning config from NUMA (see sect. 3.3 in docs).

How to do it in Proxmox PVE now?
This isn't as easy as it looks at first. I will be referencing semi-universal hook script in this section which I share for the community on Gist. There are 3 levels of complexity here:
  1. Pinning vCPUs to correct threads
  2. Pinning all non-vCPU processes to non-vCPU threads
  3. Pinning emulator & io-threads to correct threads
Doing #1 is quite simple in principle. The /run/qemu-server/<VM_ID>.pid contains the PID of the main QEMU process for the VM. Threads can be found by looking up /proc/<VM_PID>/task/*/comm for the name of the thread. Each vCPU thread will be named "CPU #/KVM". QEMU assigns threads on cores in a sequential fashion [code citation needed here ;)]. I.e.: when emulating a 6c12t CPU - "CPU 0/KVM" = 1st thread of 1st core, "CPU 1/KVM" = 2nd thread of 1st core, "CPU 2/KVM" = 1st thread of 2nd core etc.
Pinning these is as simple as using taskset --cpu-list --pid <HOST_THREAD_NUM> <vCPU_Thread_PID>. In my script mentioned above there's a handy function called pinVCpu, accepting two arguments - vCPU# and host thread #.

Doing #2 is moderately easy. In principal every thread which is not a vCPU thread (see above) should be pinned to a host thread or group that isn't pinned to vCPU. See pinNonVCpuTasks function in my script.

Doing #3 is quite hard in bash. Proxmox QEMU doesn't populate thread names for iothreads (bug?), so the only way to do it is to reach to QEMU monitor. Due to no measurable benefit in my use-case I didn't implement that in my script. However, for Proxmox which already communicates with monitor that would be trivial.

#cont. below#
 
Last edited:
VM Resources Isolation
This is by far the hardest part to achieve, and the more I dug into it the harder it seemed. There are conceptually four categories of isolation, in the order of increasing difficultiness:
  1. Isolating vCPUs from other VM's loads (this is solved by CPU pinning, see above)
  2. Preventing IRQ handling on vCPU threads
  3. Isolating VM from the host userspace processes
  4. Isolating VM from the host kernel threads (so-called "kthreads")
  5. Isolating VM from other VMs on the host
Before describing any methods here, I feel I need to address an elephant in the room. There are tutorials suggesting to use taskset on "all but VM processes" recommending to set pinning of all other processes to something else than the VM. This method is flawed and should NOT be used, as it doesn't account for any threads & processes created while the VM is running!
All examples here use a 2nd CCD of 5900x and there's a reason for that described below.


2) Preventing IRQ handling on vCPU threads
Interrupts and their handling can be a source of frustration and a lot of inconsitency in system's performance. By default Linux tries to balance most interrupts handling over all CPUs in the system, with some smartiness included. You can easily see that by using watch -n0.5 -w -d cat /proc/interrupts while e.g. perofming heavy network transfers.

Usually, with latency-sensitive workloads like gaming on a VM, the aim is to prevent interrupt handling on the same CPU thread as the non-interrupt-related workload. It's not hard to imagine network card interrupts flooding a CPU used for vCPU handling. There are two common ways of implementing isolation:

  1. Setting CPUs bitmask
    1. Never triggered: /proc/irq/default_smp_affinity
    2. Previously triggered: /proc/irq/<num>/smp_affinity
  2. Setting CPUs list via /proc/irq/<num>/smp_affinity_list

The first method has a distinct advantage of containing a way to set isolated CPUs for interrupts that weren't triggered (yet). However, this is a very rare case, unless you're modyfing affinity early during the boot process. The second method doesn't require a calculator and is more human-friendly ;)

To achieve IRQ isolation manually you need to write desired CPU list to every interrupt (e.g. echo '0-5,12-17' > /proc/irq/1234/smp_affinity_list). Similarly to kernel threads (discussed below) some interrupts cannot be moved to a different CPU - this is normal and expected and I/O errors, that can be ignored in this case, will be raised upon attempting to move such a special interrupt handler. My script mentioned above contains a premade function pinIrqs to handle that automatically. This SHOULD be reversed after VM is stopped.

This is just scratching the surface. The art of optimizing IRQs could certainly be a book in itself, with just simple network cases requiring long articles. Fortunately the method above is good enough for the purpose here.


3) Isolating VM from the host userspace processes
For long time (long abandoned) cset was used to do this. However, it doesn't work in new Proxmox versions. However, it is quite simple to achieve that using systemd, by executing the following command after starting your VM (e.g. in "post-start" of your hook script):

Code:
systemctl set-property --runtime -- user.slice AllowedCPUs=0-5,12-17
systemctl set-property --runtime -- system.slice AllowedCPUs=0-5,12-17
systemctl set-property --runtime -- init.scope AllowedCPUs=0-5,12-17

This SHOULD be reversed after VM is stopped. Simply use AllowedCPUs=0-23 (changing 23 to last thread number of course) in "post-stop" of your VM hook script. This is why the above command should be executed in "post-start" and not in "pre-start", as "*-stop" are not executed if VM start fails.
A more complete solution is available in my script mentioned above - look for setHostAllowedCpus function and its usage.


4) Isolating VM from the host kernel threads
As mentioned above, my test VM is configured to pin to the 2nd CCD. The reason for that was partially due to kthreads isolation - some kthreads cannot be dynamically moved off the 1st thread of 1st core of 1st CPU. The easiest way to move kthreads off cores which are meant to be used for VM only is to turn them off and on, after starting the VM, but before the VM is pinned to these cores. While it sounds like a crazy thing to do, the Linux kernel is perfectly capable of putting cores into offline and online states dynamically.

In essence you should do:
Code:
echo 0 > /sys/devices/system/cpu/cpu6/online
echo 0 > /sys/devices/system/cpu/cpu7/online
# .... repeat for every CPU/thread for VM, e.g. 6-11 and 18-23
echo 1 > /sys/devices/system/cpu/cpu6/online
echo 1 > /sys/devices/system/cpu/cpu7/online
# .... repeat for every CPU/thread for VM

This effectively moves busy kthreads (e.g. those dealing with filesystems) to other cores. This however, doesn't prevent later reschedules of currently inactive kthreads to cores running vCPUs. This can be solved with pinning of kthreads, similarly to other user-space processes. There are two gotchas while doing that: 1) kthreads are not part of any cgroup, and 2) some may not be moved from their dedicated CPUs if they perform CPU-specific work. Fortunately the kernel is smart enough to simply not let users-pace break kthreads, so in short to pin kthreads you should:

  1. Find all kthreads
    1. You can heck PF_KTHREAD mask on all processes in the system
    2. If you're writing a script do not parse ps to find [ as this will lead to false positives!
    3. Checking for parent PID of 2 also isn't accurate, despite being suggested in some sources, as this is just a convention that doesn't need to be followed by all processes.
  2. Pin them to desired CPUs
    1. You can use standard taskset
    2. You must handle/ignore errors: many kthreads will not be movable

Even the combinations of methods above doesn't account for new kthreads created after the VM starts. However, most of the times this shouldn't be a concern. There are additional kernel tunables for that, but this is out of scope even for this scenario. There doesn't seem to be a way to dynamically enforce new kthreads affinity as kthreads + cgroups still have some gotchas.

This SHOULD be reversed after VM is stopped. In my script mentioned above you can simply use setCpuStateRange function to off/on-line CPU cores on the fly, and pinKthreads function to pin all kthreads.


5) Isolating VM from other VMs on the host
This was by far the hardest thing to do. All VMs on Proxmox PVE run in a hardcoded qemu.slice cgroups branch. One cannot simply set the "qemu.slice" to be bound to the 1st CCD and then pin vCPUs of one VM to a 2nd CCD. This is because such configuration is illegal: child group AllowedCPUs must be a subset of its parent. To achieve this, the scope of a VM optimized here must be moved to a separate root slice. However, this is not supported by systemd, as confirmed by one of the systemd authors, but cgropups can be used directly (albeit without warranties) to do that. This step shouldn't be followed blindly as it slightly breaks accounting and goes around systemd. For me this is a totally acceptable tradeoff thou, until Proxmox supports isolation natively:

  1. Create separate root slice: mkdir /sys/fs/cgroup/qemu-decoupled.slice (once per host reboot)
  2. Create scope for a VM: mkdir /sys/fs/cgroup/qemu-decoupled.slice/<VM_ID>.scope
  3. Migrate VM to a new scope: echo $(cat /run/qemu-server/<VM_ID>.pid) > /sys/fs/cgroup/qemu-decoupled.slice/<VM_ID>.scope/cgroup.procs
  4. Set all other VMs to only use 1st CCD: systemctl set-property --runtime -- qemu.slice AllowedCPUs=0-5,12-17 (once per host reboot)
  5. Now pin, as described CPU pinning section, your VM processes to 2nd CCD/other cores
Note: the newly created qemu-decoupled.slice doesn't have any AllowedCPUs set, since it's assumed that every scope/VM inside of it will be pinned anyway.


Memory Optimization
Proxmox already contains most necessary options, but docs are lacking. Often times I've seen users being frustrated with SHP configuration, and simply ignoring them or rely on THPs (that are less performant) because their VMs aren't starting. Huge Pages aren't hard but have a few gotchas, which I believe Proxmox docs could explain better and also compact memory on VM pre-start to increase likelihood of VM starting.
To fully achieve this I had to manually ensure the following:
  1. Enable kernel huge pages
    1. Enable 1GB HPs: add hugepagesz=1G kernel parameter
    2. Enable 2MB HPs: add hugepagesz=2M kernel parameter
    3. Set 2MB HPs as default: default_hugepagesz=2M kernel param
    4. The full addition will be e.g. hugepagesz=1G hugepagesz=2M default_hugepagesz=2M
  2. Configure VM NUMA
    1. numa: 1 must be enabled in VM config
    2. numa0: ... option must be properly configured. For example my test VM uses numa0: cpus=0-11,memory=24576,hostnodes=1,policy=preferred. Normally the VM will not start even if there's plenty of memory if it tries to allocate it on host node which doesn't have enough memory. This can be checked using numactl -H. For example right now my system has 350MB of free memory on node0 and 27GB free on node1.
  3. Compact caches before starting the VM
    1. Proxmox doesn't do that automatically (maybe it should?) but after some time as the system is running huge pages will be used up. On way to combat that is to set default_hugepagesz=2M (as described in #1). However, sometimes this isn't enough.
    2. Executing the following commands before VM starts (e.g. in "pre-start" phase of hook script) almost always helps:
      1. sync (flushes host I/O buffers from memory to disks)
      2. echo 3 > /proc/sys/vm/drop_caches (see kernel docs on drop_caches)
      3. echo 1 > /proc/sys/vm/compact_memory (tries to consolidate fragmented memory to allow for big continuous regions to be available)
Troubleshooting & limitations
When my VM doesn't magically start I usually do watch -n 0.5 'numactl -H ; grep -i huge /proc/meminfo'. It would be great if this was brought to Proxmox PVE UI. This will show allocations per NUMA and live preview where the VM tries to get the memory while starting and how HPs are reserved.
As for huge-pages, even gray-beards weren't able to tell me about one gotcha that wasted hours of my time: number of fast HPs seems to be limited by the CPU. So if magically a second VM isn't starting with huge pages your memory is either too fragmented or you're going over the HP TLB limit. On my system with 128GB of RAM I can only use 64GB with 1G pages:

Code:
# cpuid -1 | grep -A 4 '1G pages'
   L1 TLB information: 1G pages (0x80000019/eax):
      instruction # entries     = 0x40 (64)
      instruction associativity = full (15)
      data # entries            = 0x40 (64)                             <==========
      data associativity        = full (15)
   L2 TLB information: 1G pages (0x80000019/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x40 (64)                             <==========
      data associativity        = full (15)

In theory using more HPs than TLB space should work perfectly fine, just slower. However, on my two separate AMD systems this doesn't work. To avoid fragmentation and other issues with HPs, the space can be pre-reserved. See "Troubleshooting" section at the end of this tutorial.


Other optimizations to consider
There are some small random things to consider, which aren't specific to just VM tuning:

CPU governor
By default, especially when running a homelab, people usually configure the CPU to run ondemand governor. It is usually configured using CPUFreqUtils. For latency-sensitive tasks it's recommended to use performance governor. You can set it per-core dynamically by doing echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor.
After the VM is stopped, default governor, as set in /etc/default/cpufrequtils, can be restored by simple service cpufrequtils restart.

Tune-down Linux virtual memory stats
Linux internally generates some stats to manage virtual memory better. It does this pretty often (once a second). This can sometimes cause jitter in fast-paced games (esp. VR). This can be changed using sysctl vm.stat_interval=120 for the duration of the "gaming VM" running. It's recommended to restore that to 1 after VM is shut down.


#cont. below#
 
Last edited:
Use proper storage
Using a VirtIO Single with a disk backed by a physical SATA drive is pretty much required to achieve predictible performance. Windows (at least 10) doesn't play well with full SCSI LUN passthru (i.e. with full SMART etc visible in the guest), but works with block device emulation described in the wiki.
The best results however can be achived by passing a full M.2 NVMe SSD using PCIe passthrough.


Use idle Windows VM
Modern GPUs are extremely efficient in idle, but there's a catch: they need to be put into idle state. My test 4070 Ti uses 2W(!) on idle. Neither vfio-pci driver nor Linux NVidia drivers are particularly good at keeping the card in a low-power state when the gaming VM is off. Simply create a separate Windows VM (1 core, 1GB ram, 20GB disk) with just drivers installed and no auto-login. Then any time you turn off your gaming VM start the "idle VM". It uses practically no CPU and realistically around 70-80MB of memory only. Auto starting/stopping of that VM can be simply automated with hook script.



Ok, how can we get that into Proxmox? ;)
I am a man of solutions, not problems. While I don't know PVE code by heart I looked around how optimizations can be brought to PVE core.

SMP
Proxmox should definitely contain an option to specify sockets/cores/threads numbers. As this is an already available option in QEMU, I believe it should be trivial from the perspective of both .conf and GUI.

Proper CPU pinning
In the simplest version Proxmox should support:
  • Pinning vCPU to host thread 1:1
  • Pinning all IO threads to a group of host threads
  • Pinning "all other" / emulator threads to a group of host threads
As kind-of "v2" I would love to see granular pinning of IO threads and emulator threads 1:1, as it can have measurable benefits.

QEMU monitor contains information about some threads:
Code:
qm> info cpus
* CPU #0: thread_id=7152
  CPU #1: thread_id=7153
  CPU #2: thread_id=7154
  CPU #3: thread_id=7155
  CPU #4: thread_id=7156
  CPU #5: thread_id=7157
  CPU #6: thread_id=7158
  CPU #7: thread_id=7159

qm> info iothreads
iothread-virtio0:
  thread_id=108991
  poll-max-ns=32768
  poll-grow=0
  poll-shrink=0
  aio-max-batch=0

I don't see any way to get information about other emulator threads thou. To not play with manual tasksets left and right Proxmox should probably just split threads properly into cgroup slices like libvirt does.


Huge Pages Configuration
As I described above, the current HP config isn't very stright-forward. Given the platform complexity it probably cannot be simplified, but bringing the NUMA config to the GUI would be a welcomed change. The wiki is sparse on the subject and admin guide doesn't really give much of the context either.

Isolation
I'm not sure what and how isolation can be brought into Proxmox natively. For sure making cgroups slice configurable and maybe more granular would be huge step forward.
As an example, unRAID offers some form of isolation between VMs but not a full-fledged solution. I believe that hiding at least some of the complexity would be great. I think it could be configured as "reserved resources" on the node level, where CPU threads can be configured. Threads configured as reserved will under-the-hood will isolated to an extent (checkbox table?) specified, so that neither host nor other VMs can touch them if desired. I think this goes beyond just reserving resources for one VM, but also ensuring that e.g. no VMs can use one of the CPU cores and it's always available for the host to prevent lockups.
The current script-based approach works-ish. However, it's a PITA to maintain and configuring on multiple nodes. In addition visually being able to see NUMA nodes, split between HT/normal cores, and L3 caches would be a godsend. Currently these things are scattered all over multiple Linux tools and not really exposed by Proxmox.




Troubleshooting

VM doesn't start due to memory allocation failure
If you followed this tutorial you're most likely using huge pages. In some cases your VM may not start with an error "hugepage allocation failed" or similar, even if the system's memory pressure is low and the system has plenty of free RAM. Not going into massive details, this is usually caused by memory fragmentation. It's a problem that kernel developers are trying to solve for at least 20 years. After the system is running for some time it may not be possible to reserve multiple large chunks of memory (multiple 1GB continuous chunks!) even if cache compacting is performed.

In such cases the method of last resort is to reserve the huge pages early on boot. This requires adding at least two parameters to the kernel boot line: hugepagesz=1G and hugepages=16. This will reserve sixteen blocks of 1G. While these huge pages can be claimed by things other than VMs, it shouldn't happen in PVE, as host shouldn't run processes requesting memory in such huge chunks.
To make the system perform optimally you can specify multiple types of HPs reserved on boot, as well as default size. For example my 128GB Proxmox host uses hugepagesz=1G hugepages=96 hugepagesz=2M default_hugepagesz=2M. The Linux documentation contains a pretty good explanation of tuning of these parameters.

In order to add kernel command line parameters under Proxmox see wiki. The method will be different for hosts using legacy BIOS boot vs. UEFI boot.


VM starts only once; black screen on subsequent starts
This seems to be a rather new and peculiar bug. Personally, I wasn't able to narrow down its root cause. However, there's an easy workaround: reset PCI(e) devices passed from host to the VM before the VM is started. This can be achieved using:

Code:
// replace 0000:0b:00.0 with your device ID
# echo 1 > /sys/bus/pci/devices/0000\:0b\:00.0/remove
# echo 1 > /sys/bus/pci/rescan

Doing this automatically is slightly more complicated, as properly ALL functions of a given device should be reset. Unfortunately Linux (as of 6.2) contains a race condition and it cannot be done too fast in a script or otherwise the kernel will crash internally. See resetVmPciDevices function in my hook script.


The future
This section was last updated in April 2023
While this tutorial is addressed more to people not involved in kernel development, I will like to leave a small exciting news for more technical folks. The kernel will hopefully soon gain support for BPF-controller scheduling logic. With that in place many things described here will not be needed. Instead of manually pinning vCPUs, the eBPF code will be able to intelligently direct scheduling of non-vCPUs away from vCPUs. The mailing list exchange mentions not only benefits for VMs but also using machine learning to improve scheduling performance in general.





Changelog
  1. 2023/03/16: initial version
  2. 2023/04/11: fix last CPU calculation in the script; add PCIe reset workaround; add kthreads pinning; add IRQ isolation; describe SHPs reservation issue; add BPF scheduler note.
  3. Future work: explore vm_exit analysis; experiment with cpu-pm=on, test workqueue isolation
 
Last edited:
This is amazing work, you have my enduring gratitude for organizing this information.

Every time I tried digging into it myself I'd just get overwhelmed by the speculation and vaguely explained options because of poor documentation, obsolete methods, trying to figure out the domain that each solution covers, and further translating the instructions from other distros into what proxmox does. The end result was better than doing nothing on a 4 Numa node Epyc chip, but still not completely addressed and suboptimal.
 
This is amazing work, you have my enduring gratitude for organizing this information.

Thank you! I added some more details and expanded the performance tweaks. (if you don't see 3 posts wait as I triggered moderator approval :D).


Every time I tried digging into it myself I'd just get overwhelmed by the speculation and vaguely explained options because of poor documentation, obsolete methods, trying to figure out the domain that each solution covers, and further translating the instructions from other distros into what proxmox does. The end result was better than doing nothing on a 4 Numa node Epyc chip, but still not completely addressed and suboptimal.

Indeed, this is what motivated me to write and update this piece. It's extremely hard to truly understand and even more so implement hypervisor tuning for low and predictable latency. In many cases it requires going directly to the Linux's and QEMU source code.
This is why I wish some of the things can be brought natively into Proxmox PVE, to abstract the complexity.
 
Last edited:
thx mate, an interesting read and really good described.

proxmox is ok, but missing deep settings to get the last bit of performance out. i have seen on ceph. but to be fair - proxmox is just a "generic" product. it should run anywhere with just standard settings. it would be good if some points you mentioned, they include as "advanced config".

i keep watching you.
 
  • Like
Reactions: kiler129
Hey Kiler, awesome work!

Just stumbled on your post and it's very helpful. I'm trying to reduce the CPU latency for a Windows VM running a proprietary Video Management System, which is ingesting as of now, near 4000 frames per second using pure CPU power.

Your script, out of the box, only works for the Ryzen 9 CPU you described, right?
 
Nice posts and very detailed. Thank you! I also run into similar situations, so I can relate - although you dug deeper in some problems than I did, so thank you especially for that.

Back in ...hmm .. 2016 or 2017, I also conducted hugepage experiments in PVE and yes, it speeds up things of course (around 10% last time I tried), yet in the end it wasn't worth the hassle in a clusted environment due to the fact, that free hugepages are not consided while moving VMs around on a node failure resulting in VM failures. Hugepages are totally fine in a static environment and we use them exclusively on large Oracle Databases (on bare metal as well as on VMs (host+guest)), yet you do not want frequent changes and you will run into the same probems you described with memory fragmentation up to the level where you need to reboot the system. A few years back, there was a shift from "I allocate the hugepages and everything in the way gets swapped out" to "I try to allocate and if it does not work, it will not allocate them". Sadly, both are not what I/you want in most cases.

I would really like to see more optimizations with respect to the HT architecture as you described. I will not try to optimize the things in a clustered environment. It would be full-time job.
 
This guide is amazing, I have a NUMA-enabled DL580G7 with 4 sockets and weak CPUs. I've been investigating issues regarding virtualization in NUMA systems, but this guide is practically the answer to my prayers.

I will give this a go when I have a bit more time, but nevertheless thank you for your contribution - It would be amazing if Proxmox would actually integrate these findings.

Keep up the good work!
 
Our team is working on a version of kiler's script for multi-socketed systems with homogenous cpus, which is our only use case for now.

I don't have any specific ETAs but we'll post it here when everything is battle tested, maybe in a couple of weeks from now.
 
interesting thread. very thorough! have you run any system benchmarks with your optimizations and without (proxmox stock) ?

it's no small amount of tweaking so very curious to see how the before and after effort shapes up
 
+
one thing kiler129 - in my testing seems this script/guide will force existing processes off VM assigned cores, but there is little mechanism to later prevent kernel to reassign afinity? or even for future launched processes to then share VM cores. Maybe i'm missing a step

Weird part here is system monitor apps. For example BTOP, acts expected, shows only leftover cores after assigning VM blocks i.e: user.slice system.slice init.scope .. however !! launching HTOP will show all cores, all processes, and their affinity (plot twist, they're still sharing VM cores).
Obviously running as root on proxmox HV so should not be permissions issue or anything .. odd!

Anyway, I've actually tried similar systemctl tweaks a while back, but sadly again and again saw inconsistant results. I mean, I'm expecting here to see increased 3D performance. Measurable A/B benchmark metrics.
Maybe i just run the Unicorn proxmox hardware rig where 99% defaults are optimal ;) Or are we optimizing hypervisor IO here beyond point where synthetic benchmarks would represent?
 
interesting thread. very thorough! have you run any system benchmarks with your optimizations and without (proxmox stock) ?

it's no small amount of tweaking so very curious to see how the before and after effort shapes up
Testing a Video Monitoring System in stock Windows 10 Pro 22H2, we did only what is avaliable within Proxmox GUI plus disabling Processor Vulnerability Mitigation and C-States 2 and 3 in a dual processor system:
  • Before: 2000ish fps with 20% frames dropped
  • After: 3000ish fps with 0% frames dropped
We configured the cpu affinity option with all 16x physical cores across the 2 NUMA nodes.

I'll try to post more detailed benchmarks when we finish our testing but yesterday we applied more changes using Windows Server 2022 and basically doing everything we could to bind the 2nd processor exclusively to the Windows VM and already passed 4000ish fps. Basically double the VM performance as stock PVE.

BTW, our system is very common, it's a Dell R720xd with two Xeon E5-2667 v2.
 
Last edited:
Testing a Video Monitoring System in stock Windows 10 Pro 22H2, we did only what is avaliable within Proxmox GUI plus disabling Processor Vulnerability Mitigation and C-States 2 and 3 in a dual processor system:
  • Before: 2000ish fps with 20% frames dropped
  • After: 3000ish fps with 0% frames dropped
We configured the cpu affinity option with all 16x physical cores across the 2 NUMA nodes.
Hi So what setting did you put/change in the vm config ?
And while most of this script look interesting for system with 2 cpu, like big server.. Just for game, cpu and storage have very low effect vs gpu used.
 
@Docop2
Mostly it was, beased in the following article:
  • Set the CPU affinity (we only used the 'real' physical cores at the time)
  • Disable Processor Vulnerability Mitigation
  • Disable C-States 2 and 3
That's it. All we done at the time.
 
This was posted back in March and we have been dealing with NUMA like this on PVE and Epyc hosts. Have you made any additional progress being able to share/expose the topology to the guest and not have to use pinning? Pinning is fine at home and such, but it is not acceptable in the datacenter. I am going to raise this with the PVE dev channel soon but first wanted to check in on your progress. I worked this same issue with VMware back in 2019 when Epyc first shipped then again when 7002 shipped, forcing VMware to add a bunch of conditional args for the host and vmx layers. I have no problem doing this again at the PVE team, since we saw an insane performance jump after the topology was being presented to the guests correctly.
 
  • Like
Reactions: gpshead
Thanks for this nice summary @kiler129!

I also struggeled a lot with latency, lag and audio issues on my new desktop-vm during the last months with several VFIO PCI devices.
Even though I got it sorted after some time, having found this thread earlier would have saved me a lot of time.

There is however one general question that comes to my mind:
Why are you not making use of isolated CPU set partitions instead of fiddling with kernel threads?

Looking at the cgroup setup of systemd + Proxmox 8, there is the qemu.slice control group in /sys/fs/cgroup/qemu.slice wherein each VM gets an own sub-group. It is defnitely possible to create an isolated subset of cores within the qemu.slice cgroup or $VMID.scope cgroup that simply reduces the effectively used CPUs of the parent cgroup.

Lets say you'd like to run
  1. the VCPU threads of one primary 6x2 VM exclusively pinned on CPUs 6-11,18-23 - lets assume its ID to be 100
  2. other tasks of the primary VM and all other VMs on 3x2 CPUs 3-5,15-17
  3. and everything else like system and user processes on 3x2 CPUs 0-2,12-14
you could do the following in some hook-script and the post-start action:

a) Restrict init.scope, system.slice and user.slice according to (3.)
Bash:
systemctl set-property --runtime init.scope AllowedCPUs=0-2,12-14
systemctl set-property --runtime system.slice AllowedCPUs=0-2,12-14
systemctl set-property --runtime user.slice AllowedCPUs=0-2,12-14

b) Restrict qemu.slice to include all VM CPUs from (1.) and (2.)
Bash:
systemctl set-property --runtime qemu.slice AllowedCPUs=3-11,15-23

c) Restrict all cores of the primary VM accroding to (1.) and (2.)
Bash:
VMID=100 # example
VMPID=$(cat /var/run/qemu-server/$VMID.pid)

# Make sure the $VMID.scope control group allows to controlling cpusets
echo +cpuset > /sys/fs/cgroup/qemu.slice/cgroup.subtree_control
echo +cpuset > /sys/fs/cgroup/qemu.slice/$VMID.scope/cgroup.subtree_control

# Restrict the $VMID.scope control group to ALL CPUs for the primary VM
# Note: In this case this is equal to the qemu.slice but it could be a subset, too.
echo 3-11,15-23 > /sys/fs/cgroup/qemu.slice/$VMID.scope/cpuset.cpus
taskset -cpa 3-11,15-23 $VMPID

# Promote the cpuset partition of the qemu.slice and $VMID.scope control groups
echo root > /sys/fs/cgroup/qemu.slice/cpuset.cpus.partition
echo root > /sys/fs/cgroup/qemu.slice/$VMID.scope/cpuset.cpus.partition

# Create a dedicated control group below the $VMID.scope to hold only the VCPU threads
mkdir -p /sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus

# Make the newly created control group threaded (we'd like to transfer threads, not processes)
echo threaded > /sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus/cgroup.type

# Add only the CPUs for the VCPU threads into their dedicated control group
echo 6-11,18-23 > /sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus/cpuset.cpus

# Isolate the new control group from the kernel scheduler
echo isolated >/sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus/cpuset.cpus.partition

# Put the VCPUS and VCPU threads for the (running) primary VM into arrays
VCPUS=(6 7 8 9 10 11 18 19 20 21 22 23)
VCPUTHREADS=($(ps -o tid=,comm= -T -q $VMPID | grep 'CPU' | grep '/KVM' | awk '{print $1}'))

# Pin the VCPU threads to individual cores
for i in {0..5}; do
    j=$(($i + 6))

    cpu_i=${VCPUS[$i]}
    cpu_j=${VCPUS[$j]}

    thread_i=${VCPUTHREADS[$i]}
    thread_j=${VCPUTHREADS[$j]}

    # Move threads
    echo $thread_i > /sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus/cgroup.threads
    echo $thread_j > /sys/fs/cgroup/qemu.slice/$VMID.scope/vcpus/cgroup.threads

    # Set affinity
    taskset -cp $cpu_i $thread_i
    taskset -cp $cpu_j $thread_j
done

I think this should be all that is needed to isolate the primary VM - not sure 100% though. At least the VM is running smooth. :-D
Of course the whole procedure could be accompanied with affinity for VFIO IRQs.

Be aware however that this might quickly get a mess during experimenting as the cgroups get invalided very easily.
I've debugged this using a small helper script like this:
Bash:
# Show cgroups with custom cpusets
show_cgroups() {
    echo
    echo Control groups with defined CPU sets:
    for path in $(find /sys/fs/cgroup -name cpuset.cpus | sort); do
        local cpus=$(cat "$path")
        if [[ -n "$cpus" ]]; then
            local name=$(dirname "${path#/sys/fs/cgroup/}")

            local effective=$(cat "$path.effective")
            local partition=$(cat "$path.partition")

            local type=$(cat "$(dirname $path)/cgroup.type")
            if [[ "$type" == "threaded" ]]; then
                local pids=$(cat "$(dirname $path)/cgroup.threads")
            else
                local pids=$(cat "$(dirname $path)/cgroup.procs")
            fi
            local cpuset="-- (cpuset: $partition $cpus/$effective)"
            local cgroup=""
            [[ ! -z "$pids" ]] && cgroup="-- ($type: $pids)"
            echo $name $cpuset $cgroup
        fi
    done
}

By running
Bash:
show_cgroups
you should get some output like this if succesful if nothing else has changed:
Code:
Control groups with defined CPU sets:
init.scope -- (cpuset: member 0-2,12-14/0-2,12-14) -- (domain: 1)
qemu.slice/100.scope -- (cpuset: root 3-11,15-23/3-5,15-17) -- (domain threaded: PID PID PID PID)
qemu.slice/100.scope/vcpus -- (cpuset: isolated 6-11,18-23/6-11,18-23) -- (threaded: TID TID TID TID TID TID TID TID TID TID TID TID)
qemu.slice -- (cpuset: root 3-11,15-23/3-5,15-17)
system.slice -- (cpuset: member 0-2,12-14/0-2,12-14)
user.slice -- (cpuset: member 0-2,12-14/0-2,12-14)

Hope I did not make any typos and that this is useful.
Please be cureful if someone likes to test this...
 
Last edited:
  • Like
Reactions: Asano, rtgy and log
Awesome info, made my understanding of things a whole lot better.

If I'm counting on this hook-script, do the equivalent GUI pieces still need to be configured? For example, with pinning the vCPUs, there's a dedicated function for that, but that's also in the GUI. Do we leave the GUI alone and let the hook-script take care of all the logic?
 
About allocating hugepages at boot on a NUMA system with the kernel boot args; It will allocate them evenly across the nodes, so It won't work if you're just assigning just enough hugepages for the one VM.

Just something to note in your troubleshooting section. Maybe we can allocate them with a script at boot before any other VM is turned on, no clue. I just allocated less memory to the VM.

Great thread BTW. You put a crap ton of work into figuring out cgroupsv2 & Proxmox, I applause you! I really hope the Proxmox team work on making these features native.
 
I also struggeled a lot with latency, lag and audio issues on my new desktop-vm during the last months with several VFIO PCI devices.
Even though I got it sorted after some time, having found this thread earlier would have saved me a lot of time.
May i ask how you sorted out your lag & audio issues? I am wrestling with this right now with a windows 10 gaming VM that has usb/hdmi audio popping/crackling under load as well periodic lag in games and my googling has led here