proxmox-cpu-affinity service - version 0.0.9

Der Harry · Thursday at 22:57

Hello Proxmoxer,

I created a tool to automatically set the CPU affinity (at runtime) for VMs.

https://github.com/egandro/proxmox-cpu-affinity

This picks the best combination of cores. As mentioned by this project: https://github.com/nviennot/core-to-core-latency.

Here an example.

CPU 1 loves CPU 2
CPU 16 hates CPU 2 - it's about 3.5x slower to pick this affinity (on average)
If you have 128 Cores (with multiple dies on a chip) or even a dual socket machine the results are even more dramatic.
(Fun fact. Every CPU has a "buddy" with the (lowest) latency. That is most likely it's HT twin.)

If you have 10 VMs with 8 cores (2 Sockets x 4 cores or 1 Socket x 4 cores ...) it would be awesome to set the affinity.

Proxmox supports affinity - however - this is not calculated for your specific CPU.

Wouldn't it be nice if someone can (automaticually pick).

- CPU1 with 3, 5, 6 ...
- CPU2 with 3, 4, 5 ...
- ...
- CPU9 with 10, 11, 12, ...

That is where proxmox-cpu-affinity kicks in.

It has a service. This scans your CPU at start time and gets the best cores combo. (It can even listen to CPU changes e.g. if you have data center hardware with CPU hotswap.)

You have to install a hookscript to your VM. In the onStart Event the VM knocks at the service. The service can now detect the PId and via a round robin algorithm align optimal Cores with low latency to the VMs.

There is a cli too, that helps you attaching / detaching the hookscript.

Hints:

- This is not a scheduler - that is what your kernel does.
- This is not a Loadbalancer - this is also done in your kernel.
- CPU affinity is not a dictatorship it's a wish request from user to the kernel. The kernel will try to pin a process or thread on this CPU.
- By purpose the web hook is not installed on VMs templates nor on VMs that are part of HA (Proxmox config limitations would require that the hookscript is available at the same storage on all HA machines).

I have no idea how I can measure a performance gain! I think pinning processes / treads to cores with optimal latency outsmarts the kernel. Or taskset (1) would not exist at the first place.

If you have access to > 128 or > 512 core systems please ping me I am interested in making this fly (or to add a tool to gather some data)

PVE Hook Scripts: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_hookscripts

PRs + Bug Reports are welcome.

Happy xmas!

Der Harry · Friday at 14:07

This is what you get.

Der Harry · Saturday at 06:06

We now have an SVG exporter.

VMs are automatically assigned. proxmox-cpu-affiny tries to avoid bad combos.

Bash:

root@proxmox:~# proxmox-cpu-affinity status svg --affinity -o heatmap-affinity.svg

This is still not production ready but it runs now on all of my servers.

fstrankowski · Saturday at 11:37

In general an interesting approach but i'd argue that for Proxmox and highly dynamic virtual envirements this approach is not as good as it may sound like in the beginning.

Let me give you an example:

If you have NUMA enviroment and for example two CPU sockets or you are using multi-CCD CPUs like the 7950X, the inter-core communication latency might in fact become noticable, atleast for certain tasks. Though, i'd argue that in general its a neglectable sideaffect.

We gonna assume that you have a cluster of three 32 Core system with two CCDs, so you most likely want VMs stick to either a group of 0-15 or 16-32 for maximum performance. If you have a fairly static amount of VMs, you can calculate and pin the VMs to their respectively best cores. That works up the the point where you would have for example 20,30,40,100 VMs which all, in the end, need to share the same amount of cores. By taking away the dynamic load-balancing due to core-pinning this will become an increasingly hard to manage task, especially in large(r) enviroments and also in cluster enviroments.

Assume you have 400 VMs running and need to pin those to cores. You have to move VMs back and forth between cluster-nodes for management purposes, load balancing etc. This will quickly become less and less manageable and thus non-practical.

Even though you lose some nanoseconds due to inter-ccd / numa communication the value in having the system autobalancing the load, especially for a high amount of vms should be considered aswell. Personally, i would not set affinity for each and every VM because i'd lose lots of dynamic behaviour and loadbalancing and rather decide on a case-by-case basis if the VM requires a dedicated affinity.

Also, this whole idea is getting lot more complex on a cluster-system which does not consist of all of the same node hardware. Again, NUMA/Affinity settings exist for a reason but i'd argue that in a virtualized enviroment it highly depends on the usecase.

Der Harry · Saturday at 13:43

fstrankowski said:
In general an interesting approach but i'd argue that for Proxmox and highly dynamic virtual envirements this approach is not as good as it may sound like in the beginning.

Well

this line - is the online line in Proxmox to set task affinity.

https://github.com/proxmox/qemu-ser...db6cd99ac60bbfbb5/src/PVE/QemuServer.pm#L3202

Even with Numa (where you have multiple CPUs that have direkt and not direct memory) the user (that's you!) has to write a value in the affinity config. Or every CPU is taken That's the starting point.

fstrankowski said:
If you have NUMA enviroment and for example two CPU sockets or you are using multi-CCD CPUs like the 7950X, the inter-core communication latency might in fact become noticable, atleast for certain tasks. Though, i'd argue that in general its a neglectable sideaffect.

Is a different scenario. You can optimize with a socket 0, 1, ... Still Proxmox doesn't do that. The kernel does. And you pray.

fstrankowski said:
We gonna assume that you have a cluster of three 32 Core system with two CCDs, so you most likely want VMs stick to either a group of 0-15 or 16-32 for maximum performance. If you have a fairly static amount of VMs, you can calculate and pin the VMs to their respectively best cores. That works up the the point where you would have for example 20,30,40,100 VMs which all, in the end, need to share the same amount of cores. By taking away the dynamic load-balancing due to core-pinning this will become an increasingly hard to manage task, especially in large(r) enviroments and also in cluster enviroments.

This assumption is wrong. I can prove it. Look at the numbers.

1) I totally agree with the socket example.
2) I don't agree with "always stick to 0-15, 15-32". You don't know what number a CPU has. (find /sys/devices/system/cpu) They 0,1,2 .. it is not 100% accurate to assume that - you need to check the core and see what socket is inside.
3) Even in the group 0-15 ... you can measure latency differences. Even on a p-/e- core i13.
4) Lets assume you don't assign 20 VMs with 16 cores. You have 20 VMs with 8 cores. You want to pick the "best combination" of cores and stick to them.
5) That is something the kernel is not aware. That's silicon lottery - your CPU is different then mine - even if it's t same type of CPU.

-> Gaming Industry knows this. If they need 8 of X cores, games do the latency test. They "might" do this on consoles, too. (I have to be careful here about my words)

-> The overclocking knows, that you have CPUs that are a much better choice then other CPU.

SO the only way to "really" know is measuring it. That's what this approch is about.

fstrankowski said:
Assume you have 400 VMs running and need to pin those to cores. You have to move VMs back and forth between cluster-nodes for management purposes, load balancing etc. This will quickly become less and less manageable and thus non-practical.

100% agreed. We can't "magically" increase cache or cache efficiency. We are selecting CPU affinity. A wish request what we think which cpu cores should be buddies. The kernel still can decide - any time - to give us a no - destroying the cache efficency.

fstrankowski said:
Even though you lose some nanoseconds due to inter-ccd / numa communication the value in having the system autobalancing the load, especially for a high amount of vms should be considered aswell. Personally, i would not set affinity for each and every VM because i'd lose lots of dynamic behaviour and loadbalancing and rather decide on a case-by-case basis if the VM requires a dedicated affinity.

Again - affinity. This software only call's takset. It's not loadbalancing. It's not scheduling. It is just picking best buddies.

Bash:

ASKSET(1)                                   User Commands                                   TASKSET(1)

NAME
       taskset - set or retrieve a process's CPU affinity

SYNOPSIS
       taskset [options] mask command [argument...]

       taskset [options] -p [mask] pid

DESCRIPTION
       The taskset command is used to set or retrieve the CPU affinity of a running process given its
       pid, or to launch a new command with a given CPU affinity. CPU affinity is a scheduler property
       that "bonds" a process to a given set of CPUs on the system. The Linux scheduler will honor the
       given CPU affinity and the process will not run on any other CPUs. Note that the Linux scheduler
       also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as
       long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful
       only in certain applications. The affinity of some processes like kernel per-CPU threads cannot
       be set.

fstrankowski said:
Also, this whole idea is getting lot more complex on a cluster-system which does not consist of all of the same node hardware. Again, NUMA/Affinity settings exist for a reason but i'd argue that in a virtualized enviroment it highly depends on the usecase.

Why is calling taskset "complex"?

Fun fact did you read the readme?

Code:

There is no guarantee that this project will increase performance. This is an experiment.

My next tasks are a benchmark tool. That's not simple. You need to create a bunch of VMs and run them with and without proxmox-cpu-affinity. E.g. compute, redis, postgres, webservers, memcache, ...

The cache latency (as the single factor we are working here) is dramatic between two cores on one chip. The AMD excample is from one of my servers. There are combos that I would avoid like the plague as buddies. The kernel doesn't know about this.

Also - in Proxmox - we have the Linux Kernel and many many other Kernels running together at one point in time on a system. There is no way that that many schedulers and the scheduler in the hardware are the best team that you can have.

Search

Search

proxmox-cpu-affinity service - version 0.0.9

Der Harry

Active Member

Der Harry

Active Member

Der Harry

Active Member

fstrankowski

Renowned Member

Der Harry

Active Member

We value your privacy