DUAL Socket Systems

liszca · Dec 11, 2024

Some questions about multisocket Systems:

How are the CPU Cores loaded, Accross sockets? Is it the same for LXC and VMs?
Is an AMD Ryzen/Epyc handeled like a multisocket System?

leesteken · Dec 11, 2024

liszca said:
Is an AMD Ryzen/Epyc handeled like a multisocket System?

Not if it is a single socket. You can chose to split a single Ryzen/Threadripper/EPYC into NUMA nodes per CCD/L3-cache in most motherboard BIOS. If so, you'll probably want to enable NUMA in Proxmox: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_numa

Ramalama · Dec 11, 2024

If you want to use Proxmox, try to avoid Numa or/and Multisocket Systems.
Otherwise use ESXI or anything that supports Numa.

Proxmox Numa Support is just helpfull in very rare cases, most of the time its just senseless.
The Only way to get it Properly working is with CPU-Pinning (or scripts that do CPU-Pinning).

However, depending on the Hardware, Proxmox will still usually outperform ESXI in most cases (even without working Numa Support).
---------

How are the CPU Cores loaded, Accross sockets? Is it the same for LXC and VMs?
--> Each vCPU of a VM is a Task on the host, those tasks are randomly rotating between all Physical Cores without any Logic.
--> On an LXC Container there is no Task per vCPU, instead the Service/Task inside the LXC Container runs on the Proxmox host Directly but with different Rights. vCPU's on a LXC Container is just somewhat like a Limit.
Is an AMD Ryzen/Epyc handeled like a multisocket System?
--> Here lee gaved you a perfect Answer already.

Cheers

helm · Dec 11, 2024

Ramalama said:
How are the CPU Cores loaded, Accross sockets? Is it the same for LXC and VMs?
--> Each vCPU of a VM is a Task on the host, those tasks are randomly rotating between all Physical Cores without any Logic.

Well, there is a logic, as the process scheduler in the Linux-Kernel is NUMA-aware, since 25 years, before NUMA hardware was common on PCs and also VMWare:
Linux Memory Management Documentation » What is NUMA?
NUMA can be enabled for a VM (the simulated hardware), not for the Proxmox server as the Linux kernel is always NUMA-aware, UMA is NUMA with a single memory area.
All vCores in KVM are threads or processes with shared memory (fork() and pthread_create() are both calling clone() with different parameters in Linux). As you can have may processes, you can have as much as vCPUs as the guest OS supports (over 200 in Linux without special tuning).
You can configure the VM honoring the pysical hardware, eg. 2 sockets with xx cores, to inform the scheduler in the VM about the NUMA.
But it makes only sence for very big VMs having more vCores than a single pCPU, so only one or two VMs per host (used for high availbility of big servers, where software does not support multiple instances). Usualy multiple sockets are used for hotplugging vCPUs (not vCores) only.
VMWare ESX schedules much simpler and needs more tuning regarding NUMA, and due to the co-scheduling of the vCPUs higher number of vCPUs have an impact, search for "Monster VMs".
As discused in an other thread the impact of I/O to the CPU usage in a VM is much higher than in a LXC.
(Disclaimer: as I am new to proxmox I do not know yet which features are configurable in proxmox)

Ramalama · Dec 12, 2024

This would be true if the Kernel would know that vCPUs (Tasks) of a VM are belonging together.

But in our case the Kernel is not aware and the vCPU tasks of a single VM starting randomly spreaded across Numa-Nodes.
For example if you configure the VM with 4 vCPUs, those Tasks on the Host wont run on the same Numa-Node as you expect on a numa-aware Hypervisor.

This is fact and confirmed/tested already by multiple people.
The Kernel itself in the end does its best sure, but in case of L3 Cache youre getting a Performance Penalty on Multitasking Apps inside your VM, between 30-300%.

The only way around is cpu-pinning, like i wrote before.

That fact, that you can enable the numa Option to make your VMs Operating System numa-aware, is like you said, only usefull if you give your VM almost all Sockets and Cores of your Physical machine.

Thats probably the most usefull case but its so rare that someone will do that or need that...
Thats why im saying that proxmox is not Numa-Aware in case of VMs.

On ESXI you arent 100% correct. Its not complicated, you have just a ton of options.

You can use Virtual Numa to expose the topology to the Operating-System of the VM.
- This is actually the same as enabling Numa for a VM in Proxmox.

You can assign VM's to Numa Nodes manually or automatically.
You can specify that physical and ht-core belong together.
And you have additional Tuning options to adjust the behavior.

Cheers

helm · Dec 12, 2024

So you define processes, like the ones running the vCPUs, "belonging together" as they share memory.

Ramalama said:
This would be true if the Kernel would know that vCPUs (Tasks) of a VM are belonging together.

Usually program(er)s should avoid sharing memory as a lot of locking is involved which limits the performance (Amdahl's law)

Ramalama said:
But in our case the Kernel is not aware and the vCPU tasks of a single VM starting randomly spreaded across Numa-Nodes.
For example if you configure the VM with 4 vCPUs, those Tasks on the Host wont run on the same Numa-Node as you expect on a numa-aware Hypervisor.

That's exactly what I would expect on slight loads, 4 running processes are distributed on 4 running nodes, as it can use all available CPU cache and memory bandwidth, for example.

Ramalama said:
This is fact and confirmed/tested already by multiple people.
The Kernel itself in the end does its best sure, but in case of L3 Cache youre getting a Performance Penalty on Multitasking Apps inside your VM, between 30-300%.

The only way around is cpu-pinning, like i wrote before.

I assume, a Windows OS was running within a VM? The question is, does the kernel within the VM the best it can to avoid cache trashing? Windows define a socket as NUMA node: https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
Why should a process rescheduled on an other vCPU or vCore despite oh high load? NUMA is also used for the 1st level cache in single CPU and multicore system.

You are right, to bring this two schedulers together, you probably need manual tuning for getting the maximum performance but you loose administrative flexibility.

LnxBil · Dec 12, 2024

Ramalama said:
But in our case the Kernel is not aware and the vCPU tasks of a single VM starting randomly spreaded across Numa-Nodes.

If that would be true, the numa hit ratio would also be 50% (random choice in a dual socket system) but it is not, which can easily be checked with numastat, especially the miss ratio:

Code:

root@proxmox7 ~ > numastat
                           node0           node1
numa_hit            171596628597     97699809177
numa_miss             1051531921      3289900378
numa_foreign          3289900378      1051531921
interleave_hit                86              91
local_node          171591366759     97693974754
other_node            1056758057      3295733604

root@proxmox7 ~ > uptime
 22:24:00 up 77 days,  6:26,  1 user,  load average: 1,98, 2,40, 2,50

One can see that it's not 50% (and not 100% as you claim with VMware), but 99,4% hit ratio on node0 and 96,7% on node1 is much better than random 50%.

Ramalama · Dec 12, 2024

LnxBil said:
If that would be true, the numa hit ratio would also be 50% (random choice in a dual socket system) but it is not, which can easily be checked with numastat, especially the miss ratio:

Code:

root@proxmox7 ~ > numastat node0 node1 numa_hit 171596628597 97699809177 numa_miss 1051531921 3289900378 numa_foreign 3289900378 1051531921 interleave_hit 86 91 local_node 171591366759 97693974754 other_node 1056758057 3295733604 root@proxmox7 ~ > uptime 22:24:00 up 77 days, 6:26, 1 user, load average: 1,98, 2,40, 2,50

One can see that it's not 50% (and not 100% as you claim with VMware), but 99,4% hit ratio on node0 and 96,7% on node1 is much better than random 50%.

numastat hit/misses doesnt tell the whole story sadly, we had this already.
I believe that the host-kernel (numastat) cannot track anything inside a VM behind the guest kernel, but thats only an assumption so far.

i knew i would start this stupid discussion again after my initial reply, but that has been discussed already soo many times, that im sick of it.
The whole issue is, that it nerves me, when people write that numa is working perfectly on proxmox, while there is no option like "assigning a vm to one numa-node".
This is the most usefull usecase for Numa.

@helm
> I assume, a Windows OS was running within a VM?
No i did even a stupid iperf3 test on linux.

Splitted my Epyc 9374F by L3-Cache = 8 Numa-Nodes per CPU and tested inside a VM iperf3 speed to another VM.
On Both VM's with CPU-Pinning (to pin to the same Numa-Node) and without Pinning, simply default Proxmox behaviour.
The result with pinning is around 72Gbit/s vs 26Gbit/s without pinning.

But tbh it doesnt matter if its linux or windows or whatever the guest is, its always the same.

The Only difference i found out is, that on Intel-Based Servers even with Dual-Sockets, there is no Performance Penalty at all (or maybe very minimal).
While on Epyc (Milan/Genoa/Bergamo) the Penalty is extreme, almost 300%.

> That's exactly what I would expect on slight loads, 4 running processes are distributed on 4 running nodes, as it can use all available CPU cache and memory bandwidth, for example.
Yes if you have a lot of singlethread applications or multicore applications where threads dont need to share data between.
But in most cases, even application to application cummunication performance (for example php with mysql inside the VM, through tcp or socket) is affected if it needs to cross Nodes.
All the Commununication, doesnt matter what protocol (except if they talk filebased, lol), inside your VM goes through Memory or whenever possible through L3 Cache. Doesn't matter if multitasking thread2thread or singletask app2app. As soon as the communication needs to cross Nodes there is a hit.

Cheers

helm · Dec 13, 2024

Ramalama said:
No i did even a stupid iperf3 test on linux.

Splitted my Epyc 9374F by L3-Cache = 8 Numa-Nodes per CPU and tested inside a VM iperf3 speed to another VM.
On Both VM's with CPU-Pinning (to pin to the same Numa-Node) and without Pinning, simply default Proxmox behaviour.
The result with pinning is around 72Gbit/s vs 26Gbit/s without pinning.

So you expect the scheduler to analyze your network-traffic to find the optimal distribution of processes? It is still faster than most physical networks!
How should the distribution of processes be done, when the network has many clients and servers?

Ramalama said:
But tbh it doesnt matter if its linux or windows or whatever the guest is, its always the same.

The Only difference i found out is, that on Intel-Based Servers even with Dual-Sockets, there is no Performance Penalty at all (or maybe very minimal).
While on Epyc (Milan/Genoa/Bergamo) the Penalty is extreme, almost 300%.

Yes, Epic is glued of 8 core chiplets, while Intel is/was a monolitic chip with a ring for communication inside. For Ryzen 9950x new firmware improves the latency between the chiplets.

Ramalama said:
> That's exactly what I would expect on slight loads, 4 running processes are distributed on 4 running nodes, as it can use all available CPU cache and memory bandwidth, for example.
Yes if you have a lot of singlethread applications or multicore applications where threads dont need to share data between.
But in most cases, even application to application cummunication performance (for example php with mysql inside the VM, through tcp or socket) is affected if it needs to cross Nodes.

Well, it my be affectet, but how much? No MySQL will deliver data with network speed. Using a file-socket between MySQL and PHP on the same kernel may be a good idea to avoid TCP/IP overhead, but if you need more PHPs it is not possible. Most software has flaws, PHP is running the same initializon code again and again (addressed by openswoodle) and mysql burns CPU by using spinlocks. Bad SQL queries can cause 3000% or 30000% penality.

Ramalama said:
All the Commununication, doesnt matter what protocol (except if they talk filebased, lol), inside your VM goes through Memory or whenever possible through L3 Cache. Doesn't matter if multitasking thread2thread or singletask app2app. As soon as the communication needs to cross Nodes there is a hit.

Cheers

Yes, if you really need the last % of performance in communication you can do it, loosing flexibility. But in practice I have not seen, that communication is the biggest problem of an real application.

LnxBil · Dec 13, 2024

Ramalama said:
numastat hit/misses doesnt tell the whole story sadly, we had this already.

I remembered the discussion, yet couldn't find it anymore before posting. It's this one. I just found this document, have you looked into that? Could be the solution if it tracks.

Ramalama said:
If you want to use Proxmox, try to avoid Numa or/and Multisocket Systems.

This is according to your own answerns only valid for (bigger) AMD systems as you said that intel does not have a big performance penalty, and that is what I see too. I wonder if you have the same performance penalties with the newer generation high core intel CPUs that also use the chiplet design.

helm · Dec 13, 2024

@LnxBil, thanks for linking the original discussion, so I need not explain the already known.
@Ramalama It is also about CPU-cache usage. To see this, it happens also with a single process. Output from 5700G (16 MB 3rd level cache):

Code:

$ dd if=/dev/zero of=/dev/null bs=15M count=10k
10240+0 records in
10240+0 records out
161061273600 bytes (161 GB, 150 GiB) copied, 2.9481 s, 54.6 GB/s

$ dd if=/dev/zero of=/dev/null bs=16M count=10k
10240+0 records in
10240+0 records out
171798691840 bytes (172 GB, 160 GiB) copied, 4.71228 s, 36.5 GB/s

$ dd if=/dev/zero of=/dev/null bs=17M count=10k
10240+0 records in
10240+0 records out
182536110080 bytes (183 GB, 170 GiB) copied, 6.21028 s, 29.4 GB/s

$ dd if=/dev/zero of=/dev/null bs=32M count=10k
10240+0 records in
10240+0 records out
343597383680 bytes (344 GB, 320 GiB) copied, 20.997 s, 16.4 GB/s

$ dd if=/dev/zero of=/dev/null bs=64M count=10k
10240+0 records in
10240+0 records out
687194767360 bytes (687 GB, 640 GiB) copied, 47.6466 s, 14.4 GB/s

So you can have either big RAM or fastest computation or a compromise. As long as the code and data fits in the CPU cache, like your iperf-test, using a single NUMA node for all processes is faster, but for bigger code and data the distribution of processes to different nodes wins due to the better CPU-cache usage. In my opinion the memory bandwidth is a bigger issue than the core to core speed.

Ramalama · Dec 13, 2024

LnxBil said:
I remembered the discussion, yet couldn't find it anymore before posting. It's this one. I just found this document, have you looked into that? Could be the solution if it tracks.

This is according to your own answerns only valid for (bigger) AMD systems as you said that intel does not have a big performance penalty, and that is what I see too. I wonder if you have the same performance penalties with the newer generation high core intel CPUs that also use the chiplet design.

Sure i readed that and tryed almost anything on that AMD sheet.
About intel, im curious either, but as far i seen the new intel chiplet design is using some sort of a silicon base for the interconnect, which is very expensive but very fast.
So i have the fear that it wouldn't be an issue on Intel in the future xD (Or maybe its great for everyone with an Intel Chip)

helm said:
So you can have either big RAM or fastest computation or a compromise. As long as the code and data fits in the CPU cache, like your iperf-test, using a single NUMA node for all processes is faster, but for bigger code and data the distribution of processes to different nodes wins due to the better CPU-cache usage. In my opinion the memory bandwidth is a bigger issue than the core to core speed.

It doesn't matter because there is no Logic on the Proxmox side, the VM-CPU Tasks are not grouped, the host-kernel doesnt know which tasks belong together.
This means its extremely Random if your VM-CPUTasks are running on the same numa node, or each on own numanode, or 3 on Numa-Node1, 1 on Numa-Node 2, and Numa-Node 3/4 is empty....

If you run only a single VM on Proxmox, there will be no issue, since CPU-Tasks will be balanced with a good Logic. If you run a lot of VMs it will be an issue.
Again, the issue is simply that the Proxmox Host Kernel, doesnt know which VM-Tasks belong together, so for the kernel they are all simply separate Tasks.

So even if you want to run a VM spreaded evenly across all Numa-Nodes, like your example/wish, it won't be the case until you use CPU-Pinning. Its simply random...

Cheers

helm · Dec 13, 2024

Ramalama said:
It doesn't matter because there is no Logic on the Proxmox side, the VM-CPU Tasks are not grouped, the host-kernel doesnt know which tasks belong together.
This means its extremely Random if your VM-CPUTasks are running on the same numa node, or each on own numanode, or 3 on Numa-Node1, 1 on Numa-Node 2, and Numa-Node 3/4 is empty....

If you run only a single VM on Proxmox, there will be no issue, since CPU-Tasks will be balanced with a good Logic. If you run a lot of VMs it will be an issue.
Again, the issue is simply that the Proxmox Host Kernel, doesnt know which VM-Tasks belong together, so for the kernel they are all simply separate Tasks.

So even if you want to run a VM spreaded evenly across all Numa-Nodes, like your example/wish, it won't be the case until you use CPU-Pinning. Its simply random...

Ok, at least for the case of a single VM we agree now.

Running a lot of higher loaded VMs means there are more vCores than pCores. So two or more processes (as a process runs a vCore) may want to run on the same pCore, while other pCores are without or less work. In this case a process would move to an other pCore, of course, as this is better than waiting. Thats the way even simple process schedulers work, which may look random from outside. To keep the scheduler scalable, each pCore has and manages a own queue, and can give a process to an other pCore. Currently there are some optimizations, e.g. for fairness between users and to keep I/O running under load. In the future you can replace the scheduler yourself at runtime: https://lwn.net/Articles/978007/
A VMWare ESX, which did or still does co-scheduling, all vCores need to be available to keep the VM running. So having 12 pCores, 6 VMs with 4 vCores each and 50% CPU load (2 vCores used) each, causes running the VMs only half the time while using 50% of the pCores. That would be the impact of a scheduling keeping vCores together instead of keeping running.
Bye

Search

Search

DUAL Socket Systems

liszca

Active Member

leesteken

Distinguished Member

Ramalama

Renowned Member

helm

Member

Ramalama

Renowned Member

helm

Member

LnxBil

Distinguished Member

Ramalama

Renowned Member

helm

Member

LnxBil

Distinguished Member

helm

Member

Ramalama

Renowned Member

helm

Member