Not if it is a single socket. You can chose to split a single Ryzen/Threadripper/EPYC into NUMA nodes per CCD/L3-cache in most motherboard BIOS. If so, you'll probably want to enable NUMA in Proxmox: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_numa
- Is an AMD Ryzen/Epyc handeled like a multisocket System?
Well, there is a logic, as the process scheduler in the Linux-Kernel is NUMA-aware, since 25 years, before NUMA hardware was common on PCs and also VMWare:
- How are the CPU Cores loaded, Accross sockets? Is it the same for LXC and VMs?
--> Each vCPU of a VM is a Task on the host, those tasks are randomly rotating between all Physical Cores without any Logic.
fork()
and pthread_create()
are both calling clone()
with different parameters in Linux). As you can have may processes, you can have as much as vCPUs as the guest OS supports (over 200 in Linux without special tuning).Usually program(er)s should avoid sharing memory as a lot of locking is involved which limits the performance (Amdahl's law)This would be true if the Kernel would know that vCPUs (Tasks) of a VM are belonging together.
That's exactly what I would expect on slight loads, 4 running processes are distributed on 4 running nodes, as it can use all available CPU cache and memory bandwidth, for example.But in our case the Kernel is not aware and the vCPU tasks of a single VM starting randomly spreaded across Numa-Nodes.
For example if you configure the VM with 4 vCPUs, those Tasks on the Host wont run on the same Numa-Node as you expect on a numa-aware Hypervisor.
I assume, a Windows OS was running within a VM? The question is, does the kernel within the VM the best it can to avoid cache trashing? Windows define a socket as NUMA node: https://learn.microsoft.com/en-us/windows/win32/procthread/numa-supportThis is fact and confirmed/tested already by multiple people.
The Kernel itself in the end does its best sure, but in case of L3 Cache youre getting a Performance Penalty on Multitasking Apps inside your VM, between 30-300%.
The only way around is cpu-pinning, like i wrote before.
If that would be true, the numa hit ratio would also be 50% (random choice in a dual socket system) but it is not, which can easily be checked with numastat, especially the miss ratio:But in our case the Kernel is not aware and the vCPU tasks of a single VM starting randomly spreaded across Numa-Nodes.
root@proxmox7 ~ > numastat
node0 node1
numa_hit 171596628597 97699809177
numa_miss 1051531921 3289900378
numa_foreign 3289900378 1051531921
interleave_hit 86 91
local_node 171591366759 97693974754
other_node 1056758057 3295733604
root@proxmox7 ~ > uptime
22:24:00 up 77 days, 6:26, 1 user, load average: 1,98, 2,40, 2,50
numastat hit/misses doesnt tell the whole story sadly, we had this already.If that would be true, the numa hit ratio would also be 50% (random choice in a dual socket system) but it is not, which can easily be checked with numastat, especially the miss ratio:
Code:root@proxmox7 ~ > numastat node0 node1 numa_hit 171596628597 97699809177 numa_miss 1051531921 3289900378 numa_foreign 3289900378 1051531921 interleave_hit 86 91 local_node 171591366759 97693974754 other_node 1056758057 3295733604 root@proxmox7 ~ > uptime 22:24:00 up 77 days, 6:26, 1 user, load average: 1,98, 2,40, 2,50
One can see that it's not 50% (and not 100% as you claim with VMware), but 99,4% hit ratio on node0 and 96,7% on node1 is much better than random 50%.
So you expect the scheduler to analyze your network-traffic to find the optimal distribution of processes? It is still faster than most physical networks!No i did even a stupid iperf3 test on linux.
Splitted my Epyc 9374F by L3-Cache = 8 Numa-Nodes per CPU and tested inside a VM iperf3 speed to another VM.
On Both VM's with CPU-Pinning (to pin to the same Numa-Node) and without Pinning, simply default Proxmox behaviour.
The result with pinning is around 72Gbit/s vs 26Gbit/s without pinning.
Yes, Epic is glued of 8 core chiplets, while Intel is/was a monolitic chip with a ring for communication inside. For Ryzen 9950x new firmware improves the latency between the chiplets.But tbh it doesnt matter if its linux or windows or whatever the guest is, its always the same.
The Only difference i found out is, that on Intel-Based Servers even with Dual-Sockets, there is no Performance Penalty at all (or maybe very minimal).
While on Epyc (Milan/Genoa/Bergamo) the Penalty is extreme, almost 300%.
Well, it my be affectet, but how much? No MySQL will deliver data with network speed. Using a file-socket between MySQL and PHP on the same kernel may be a good idea to avoid TCP/IP overhead, but if you need more PHPs it is not possible. Most software has flaws, PHP is running the same initializon code again and again (addressed by openswoodle) and mysql burns CPU by using spinlocks. Bad SQL queries can cause 3000% or 30000% penality.> That's exactly what I would expect on slight loads, 4 running processes are distributed on 4 running nodes, as it can use all available CPU cache and memory bandwidth, for example.
Yes if you have a lot of singlethread applications or multicore applications where threads dont need to share data between.
But in most cases, even application to application cummunication performance (for example php with mysql inside the VM, through tcp or socket) is affected if it needs to cross Nodes.
Yes, if you really need the last % of performance in communication you can do it, loosing flexibility. But in practice I have not seen, that communication is the biggest problem of an real application.All the Commununication, doesnt matter what protocol (except if they talk filebased, lol), inside your VM goes through Memory or whenever possible through L3 Cache. Doesn't matter if multitasking thread2thread or singletask app2app. As soon as the communication needs to cross Nodes there is a hit.
Cheers
I remembered the discussion, yet couldn't find it anymore before posting. It's this one. I just found this document, have you looked into that? Could be the solution if it tracks.numastat hit/misses doesnt tell the whole story sadly, we had this already.
This is according to your own answerns only valid for (bigger) AMD systems as you said that intel does not have a big performance penalty, and that is what I see too. I wonder if you have the same performance penalties with the newer generation high core intel CPUs that also use the chiplet design.If you want to use Proxmox, try to avoid Numa or/and Multisocket Systems.
$ dd if=/dev/zero of=/dev/null bs=15M count=10k
10240+0 records in
10240+0 records out
161061273600 bytes (161 GB, 150 GiB) copied, 2.9481 s, 54.6 GB/s
$ dd if=/dev/zero of=/dev/null bs=16M count=10k
10240+0 records in
10240+0 records out
171798691840 bytes (172 GB, 160 GiB) copied, 4.71228 s, 36.5 GB/s
$ dd if=/dev/zero of=/dev/null bs=17M count=10k
10240+0 records in
10240+0 records out
182536110080 bytes (183 GB, 170 GiB) copied, 6.21028 s, 29.4 GB/s
$ dd if=/dev/zero of=/dev/null bs=32M count=10k
10240+0 records in
10240+0 records out
343597383680 bytes (344 GB, 320 GiB) copied, 20.997 s, 16.4 GB/s
$ dd if=/dev/zero of=/dev/null bs=64M count=10k
10240+0 records in
10240+0 records out
687194767360 bytes (687 GB, 640 GiB) copied, 47.6466 s, 14.4 GB/s
Sure i readed that and tryed almost anything on that AMD sheet.I remembered the discussion, yet couldn't find it anymore before posting. It's this one. I just found this document, have you looked into that? Could be the solution if it tracks.
This is according to your own answerns only valid for (bigger) AMD systems as you said that intel does not have a big performance penalty, and that is what I see too. I wonder if you have the same performance penalties with the newer generation high core intel CPUs that also use the chiplet design.
It doesn't matter because there is no Logic on the Proxmox side, the VM-CPU Tasks are not grouped, the host-kernel doesnt know which tasks belong together.So you can have either big RAM or fastest computation or a compromise. As long as the code and data fits in the CPU cache, like your iperf-test, using a single NUMA node for all processes is faster, but for bigger code and data the distribution of processes to different nodes wins due to the better CPU-cache usage. In my opinion the memory bandwidth is a bigger issue than the core to core speed.
Ok, at least for the case of a single VM we agree now.It doesn't matter because there is no Logic on the Proxmox side, the VM-CPU Tasks are not grouped, the host-kernel doesnt know which tasks belong together.
This means its extremely Random if your VM-CPUTasks are running on the same numa node, or each on own numanode, or 3 on Numa-Node1, 1 on Numa-Node 2, and Numa-Node 3/4 is empty....
If you run only a single VM on Proxmox, there will be no issue, since CPU-Tasks will be balanced with a good Logic. If you run a lot of VMs it will be an issue.
Again, the issue is simply that the Proxmox Host Kernel, doesnt know which VM-Tasks belong together, so for the kernel they are all simply separate Tasks.
So even if you want to run a VM spreaded evenly across all Numa-Nodes, like your example/wish, it won't be the case until you use CPU-Pinning. Its simply random...