proxmox NUMA static configuration support

ryang · Aug 7, 2023

I am struggling to fully understand everything about NUMA, however I feel like I've got a decent understanding of how proxmox is handling it, as I've run many configuration tests over the last few days trying to solve my performance issues.

I have a AMD EPYC 7551p (zen 1 architecture, 7001 series) 32 core 64 thread CPU and 256gb of ram ( 8 dimms, 2 channels per NUMA node at 64gb each)

I notice that when in proxmox if I enable NUMA feature on my two VM's, both VMs will latch onto the same numa node usually, even though I've configured the VM's to use all the numa node cores (sockets as defined in proxmox). I don't want both VM's to latch onto the same numa node, sharing the resources. I've also manually set affinity on both VMs, but there is no guarantee when they startup, that they will latch onto the desired numa node that has the cores that are defined in the affinity configuration in proxmox.

I'd like to be able to assign each VM to a specific numa node and predefine the affinity to be for those numa nodes cores, thus ensuring that the other vm will not attempt to ever engage those cores. Also ensuring that if the VM starts up, it gets assigned the right numa node, otherwise it could be assigned a different numa node than what the affinity defines, and I presume that is causing performance problems as well.

Am I missing something on how NUMA works with proxmox?

Currently I feel like I need to disable numa feature in proxmox, set affinity in proxmox, then make a static hook script for each VM which utilizes something like numactl to enforce numa node configuration and statically assign the numa nodes I desire for those VMs.

I also have a hard time understanding how proxmox handles SMT in relation to cores (not vcpus, but cores), as with SMT, one core has two threads, when I assign a numa node which has 8 cores, it has a total of 16 threads, yet in my VM I only see the 8 cores. Thus I've been assigning 16 cores, assuming that proxmox simply things each thread is a core and thus for me to max out a numa node, I must define 16 cores in the configuration - but I've noticed that I had some performance problems when doing so, thus I'd like clarification on how for this CPU I should best define the configuration and should I do a custom hookscript to statically assign the numa nodes?

thoughts?

ryang · Aug 8, 2023

Somehow I didn't find this in the days of my research but I came across the method of statically assigning numa nodes in proxmox and I tested it but my system still seems to have a sluggish feel to it. Which may be attributed to other factors (which is fine, I can make another post that is specific to optimizing further as I don't want to stray away from this posts initial request)

for others this may help.

First let's see my systems numa nodes:

Code:

root@server:~# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               AuthenticAMD
  BIOS Vendor ID:        Advanced Micro Devices, Inc.
  Model name:            AMD EPYC 7551P 32-Core Processor
    BIOS Model name:     AMD EPYC 7551P 32-Core Processor                Unknown CPU @ 2.0GHz
    BIOS CPU family:     107
    CPU family:          23
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  114%
    CPU max MHz:         2000.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            3999.77
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_
                         tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm
                          sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx
                          smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic
                          v_vmsave_vmload vgif overflow_recov succor smca
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   1 MiB (32 instances)
  L1i:                   2 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    64 MiB (8 instances)
NUMA:
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-7,32-39
  NUMA node1 CPU(s):     8-15,40-47
  NUMA node2 CPU(s):     16-23,48-55
  NUMA node3 CPU(s):     24-31,56-63
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT vulnerable
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected


root@server:~# numactl -H;cat /proc/meminfo | grep -i huge

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 64315 MB
node 0 free: 4134 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 64507 MB
node 1 free: 1278 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 64507 MB
node 2 free: 2405 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 64452 MB
node 3 free: 63651 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10
AnonHugePages:     22528 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        187695104 kB

We see in my case, the CPU has 4 numa nodes, 0,1,2,3. Each node has 64gb~ ram.

In your VM's config file:
/etc/pve/qemu-server/<vid>.conf

ensure that numa is enabled, then list out your numa nodes you'd like to utilize.

Code:

affinity: 1,33,2,34,3,35,4,36,5,37,6,38,7,39,9,41,10,42,11,43,12,44,13,45,14,46,15,47
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
balloon: 0
bios: ovmf
boot: order=hostpci0;ide2;sata0;net0
cores: 14
cpu: host
cpuunits: 10000
hostpci0: 0000:21:00,pcie=1
hostpci1: 0000:25:00,pcie=1
hostpci2: 0000:03:00,pcie=1
hostpci3: 0000:41:00,pcie=1,x-vga=1
hostpci4: 0000:24:00,pcie=1
hostpci5: 0000:23:00,pcie=1
hotplug: disk,network,usb
hugepages: 1024
ide2: none,media=cdrom
machine: pc-q35-8.0
memory: 61440
meta: creation-qemu=7.2.0,ctime=1682487998
name: WINPC
net0: virtio=42:61:33:D6:6E:29,bridge=vmbr0,firewall=1
numa: 1
numa0: cpus=0-13,hostnodes=0,memory=30720,policy=bind
numa1: cpus=14-27,hostnodes=1,memory=30720,policy=bind
onboot: 1
ostype: win11
sata0: none,media=cdrom
scsihw: virtio-scsi-pci
smbios1: uuid=145441bc-0976-4db8-a28b-b09fa609ad8a
sockets: 2
tpmstate0: local-lvm:vm-100-disk-0,size=4M,version=v2.0
usb0: host=3-2.2
usb1: host=3-2.3
usb2: host=9-1.3
usb3: host=11-2.1
vga: none
vmgenid: 6ed17591-01f9-4939-a46a-5dee5ec1a2f6

Note: for myself, I was reading that it's best to reserve 1 core per numa node (in this instance since SMT is enabled, all cores are doubled, thus I figure in a node that supports 16 'cores', max utilization should be 14.) hence why I have cores set to 14 and sockets 2. Totalling 28 cores from a possible 32 cores between two numa nodes.

Note: lines which note 'numa0:' then next line is 'numa1', this does not refer to your actual numa nodes that you want to use. See on those lines the 'hostnodes' parameter - hostnodes is where you specify which numa node you want to use.

Example of wanting to use numa node '2':

Code:

numa: 1
numa0: cpus=0-13,hostnodes=2,memory=61440,policy=bind

In this example, numa is enabled, with the first numa node being defined (numa0 in the config), for numa node 2 as defined in the hostnodes parameter. As this particular example would only be utilizing one numa node, the memory should be fully allocated from that node respectfully, hence memory is set to 61440. (should be multiples of 1024 if hugepages is set to 1024. I wanted 60gb ram so 60x1024=61440.)

Note: cpu's configuration in the numa config (cpus=0-13 in example above), this reflects the CPUs inside the VM (people on the forums call it the 'guest cpu'). It does not reflect the host machines cpu affinity configuration of the actual core configuration. For core affinity (pinning) set the affinity parameter in your vm config file as seen above, I've pinned cores: 1,33,2,34,3,35,4,36,5,37,6,38,7,39,9,41,10,42,11,43,12,44,13,45,14,46,15,47.

Hopefully this clears things up for other folks as sometimes the posts on the forums assume people understand all the terms, this should hopefully make it clearer for some folks.

LnxBil · Aug 22, 2024

ryang said:
for others this may help.

Thank you VERY much, it does indeed help and works as it should according to what I see.
My layout on PVE is as follows:

Code:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 193262 MB
node 0 free: 25629 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 193530 MB
node 1 free: 34831 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

I created a simple machine with 128 GB RAM split across two nodes à 64 GB set the following additional settings in my VM config:

Code:

affinity: 0-3,8-11
memory: 131072
numa: 1
numa0: cpus=0-3,hostnodes=0,memory=65536,policy=bind
numa1: cpus=4-7,hostnodes=1,memory=65536,policy=bind

Running mbw does fill up the space and the memory is allocated as planed:

Code:

$ numastat -p 211805

Per-node process memory usage (in MBs) for PID 211805 (kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         3.91           16.28           20.19
Stack                        0.02            0.89            0.91
Private                  65210.62        65530.46       130741.08
----------------  --------------- --------------- ---------------
Total                    65214.55        65547.64       130762.19

Without the special setting, it is setup like this:

Code:

$ numastat -p 351812

Per-node process memory usage (in MBs) for PID 351812 (kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         5.88           15.41           21.29
Stack                        0.00            0.91            0.91
Private                 126549.38         3450.33       129999.70
----------------  --------------- --------------- ---------------
Total                   126555.26         3466.64       130021.90

Yet I do not get any performance difference in mbw.

VinnyG · Sep 25, 2024

i thought the whole point of setting numa affinity was to skip setting cores affinity now you are doing both things which makes it a 100% more of a PITA?

yup just tested can confirm the numa0 etc etc does nothing

Search

Search

proxmox NUMA static configuration support

ryang

Member

ryang

Member

LnxBil

Distinguished Member

VinnyG

New Member

We value your privacy