So, I ran several tests.
TLDR:
As far as this testing of CPU pinning goes, numactl can do most of what taskset can do and more. One of those additional features offers better numa memory latency but is a double edged sword for servers that are numa OOM. If you only want to pin a vm to cores in a single numa domain, taskset works just as good as numactl. However, numactl is't installed by default. If/when numactl becomes one of the default utilities installed on PVE, I would consider making a patch to switch to taskset to numactl. In order to utilize the memory binding feature of numactl, a GUI switch would need to be added as well; something to the effect of, "Restrict VM Memory to Numa Nodes" or something like that.
Test Rig:
TR 3975WX, 32c/64t, 8 NUMA domains
Code:
root@pve-01:~# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 32 33 34 35
node 0 size: 64295 MB
node 0 free: 61465 MB
node 1 cpus: 4 5 6 7 36 37 38 39
node 1 size: 64508 MB
node 1 free: 62149 MB
node 2 cpus: 8 9 10 11 40 41 42 43
node 2 size: 64508 MB
node 2 free: 62778 MB
node 3 cpus: 12 13 14 15 44 45 46 47
node 3 size: 64508 MB
node 3 free: 62351 MB
node 4 cpus: 16 17 18 19 48 49 50 51
node 4 size: 64508 MB
node 4 free: 61332 MB
node 5 cpus: 20 21 22 23 52 53 54 55
node 5 size: 64473 MB
node 5 free: 62232 MB
node 6 cpus: 24 25 26 27 56 57 58 59
node 6 size: 64508 MB
node 6 free: 62483 MB
node 7 cpus: 28 29 30 31 60 61 62 63
node 7 size: 64488 MB
node 7 free: 61570 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 11 11 11 11 11 11 11
1: 11 10 11 11 11 11 11 11
2: 11 11 10 11 11 11 11 11
3: 11 11 11 10 11 11 11 11
4: 11 11 11 11 10 11 11 11
5: 11 11 11 11 11 10 11 11
6: 11 11 11 11 11 11 10 11
7: 11 11 11 11 11 11 11 10
*Each numa domain has 64G of memory. This is important.
Test Script:
hold_mem.sh
Bash:
#!/bin/bash
MEM=$1
</dev/zero head -c $MEM | pv | tail
The script will write zeros to memory, allocating it to the process.
To allocate 96G of memory:
./hold_mem.sh 96000m
This table shows the commands used launch the script.
Scenario | Command |
01 | ./hold_mem.sh 32000m |
02 | ./hold_mem.sh 96000m |
03 | taskset --cpu-list --all-tasks 16-19 ./hold_mem.sh 32000m |
04 | taskset --cpu-list --all-tasks 16-19 ./hold_mem.sh 96000m |
05 | taskset --cpu-list --all-tasks 8-11,16-19 ./hold_mem.sh 96000m |
06 | numactl -C +16-19 ./hold_mem.sh 32000m |
07 | numactl -C +16-19 ./hold_mem.sh 96000m |
08 | numactl -C +8-11,16-19 ./hold_mem.sh 96000m |
09 | numactl -C +16-19 --membind=4 ./hold_mem.sh 32000m |
10 | numactl -C +16-19 --membind=4 ./hold_mem.sh 96000m |
11 | numactl -C +8-11,16-19 --membind=2,4 ./hold_mem.sh 96000m |
The second table shows the properties of each scenario.
Scenario | Allocated Memory | Pinning | Numa Nodes | Cores |
01 | 32G | None | | |
02 | 96G | None | | |
03 | 32G | taskset | 4 | 16-19 |
04 | 96G | taskset | 4 | 16-19 |
05 | 96G | taskset | 2,4 | 8-11,16-19 |
06 | 32G | numactl | 4 | 16-19 |
07 | 96G | numactl | 4 | 16-19 |
08 | 96G | numactl | 2,4 | 8-11,16-19 |
09 | 32G | numactl w/ membind | 4 | 16-19 |
10 | 96G | numactl w/ membind | 4 | 16-19 |
11 | 96G | numactl w/ membind | 2,4 | 8-11,16-19 |
The third table tracks which numa node the memory was allocated to.
For example, in Scenario 03, we allocated 96000m of memory. The memory from 0-64GB was allocated on numa node 4, and the memory from 64-96GB was allocated on numa node 5 (N + 1).
Scenario | Node 0-64GB | Node 64-96G |
01 | Any | N/A |
02 | Any | N + 1 |
03 | 4 | N/A |
04 | 4 | N + 1 |
05 | 2 or 4 | N + 1 |
06 | 4 | N/A |
07 | 4 | N + 1 |
08 | 2 or 4 | N + 1 |
09 | 4 | N/A |
10 | 4 | FAILED |
11 | 2 or 4 | 2 or 4 (Not N) |
I ran each a few times, because, as you can see, some of these are not deterministic.
When we start a process, the scheduler puts it on a core and starts allocating memory. Both taskset and numactl act as a shim to ensure the scheduler puts our next process on the correct cores. Furthermore, numactl has the ability to lock the allocation of memory to particular numa node memory pools.
1. In situations when you want to pin a VM to a SINGLE numa node, there's not really any major problem with taskset or numactl. Both will allocate the memory in the correct numa node. However, numactl can force the memory to be bound on that numa node, failing if that numa node does not have enough memory. This is the first concern. If we membind with numactl and we go over our memory definitions in just the slightest, we will end up killing the process.
2. In situations when you want to pin a VM to MULTIPLE numa nodes, there is a problem. Both taskset and numactl will randomly choose a core from the cpuset, and begin allocating memory on that core. The memory will be first allocated from the memory pool of the numa node that the process landed on. If we have 8 numa nodes, and we choose cores from numa node 2 and 4, there are two possible scenarios.
A. The process lands on 4, and memory gets allocated from the memory pools of numa nodes 4 and 5.
B. The process lands on 2, and memory gets allocated from the memory pools of numa nodes 2 and 3.
Neither of these are desirable. Even when we put the numa nodes next to eachother, there's only a 50% chance that we will allocate memory from both the pools we want.
However, numactl has a way around this. Numactl has the membind parameter which forces the process into the memory pools of a set of defined numa nodes. membind=2,4. With this, there is a 100% chance that the process will pull memory from the correct numa node pools. This is also a double edged sword. If there is not enough memory in those pools to satisfy the process, the process will be terminated.
--
As the TLDR says above, numactl seems to be able to do almost everything taskset can do. Numactl isn't installed by default, so, that would need to be changed before numactl could be used for this. For binding a VM to a single numa node, taskset works just fine -- it even doesn't kill the process if the numa node is OOM. When binding a VM to multiple numa nodes, there's a high probability of ending up with a big chunk of memory outside of the NUMA boundary.
--
Note: In my research, I found a few references saying that numactl cannot change the cpu cores that a running process is already bound to. I believe this to be the case, but I don't ultimately see that as a con.