Its called split tables or something like that and it is available on any genoa/milan, but you'll get usually 8 or even 16 numa domains per CPU, depending on the core count.
Each Domain per L3-Cache.
Each domain has only 8 cores.
This is presented at the BIOS level with MADT/L3 as NUMA on Dell only systems today - ACPI states
https://uefi.org/sites/default/files/resources/ACPI_5_1release.pdf, page 138. Dell BIOS manual -
https://dl.dell.com/manuals/common/dell-emc-poweredge-15g-set-up-bios.pdf, page 26 & page 27.
- HPE supported this on some of their AMD systems but not all. SMC, Gigabyte, Asus refuse to adopt this. This is my request to SMC on both reddit and via a ticket to adopt the EFI standard. Yielded nothing
https://www.reddit.com/r/supermicro/comments/k5q0ex/req_adding_madt_ccx_as_numa_to_h11_and_h12_amd/
To manage this with cpu pinning, without the support that proxmox tryed to align the cores of one VM, to a single Numa-Node, is insane.
It's not even possible of you have a cluster and migrate VM's around.
10000% agree with you, this is why I pushed the EFI changes down to the SI 4-5 years ago, its also why I only run Dell AMD powered servers today. So core pinning is a non issue and you maintain desired performance across NUMA domains.
If the multi threading application inside your VM uses for example 6 cores and those aren't on the same Numa-Node, the whole L3 Cache is basically not working.
So those tasks of the application cannot share data to the other task over the L3 cache and it will go over memory, which is insanely slower.
In the end you multi threading app runs at around 33% of the speed, it could run.
Not quite how it works, this is classified as a L3 Cache miss and the L3 Cache records between NUMA X cores to NUMA Y cores is refetched causing a drop/dip in performance. This could net 26ns-96ns of missing latency left untuned, which scales with vCPU core counts across the VM. The more L3 domains that are not classified up through the VM the higher that latency gets. I have measured 240ns latency on 8 physical NUMA Domain VMs with a unified virtual NUMA.
Nice write up on this -
http://www.staroceans.org/cache_coherency.htm
So we can safely say that on AMD Milan/Epyc/Genoa platforms, if people don't pin cpu cores, every multi threading application will run around 3x slower.
Not quite, there are many ways to handle this.
-The easiest is to throw money at the build and make sure your CCD's are 8cores wide and your VMs do not exceed 8 virtual cores. If they exceed 8Cores, then to make sure you are limiting your VM to two NUMA domains and building the NUMA virtual topology to match the physical, virtual sockets and all. Just remember, SMT is not used in a meaningful way with KVM under Proxmox today.
-The hardest is to stick with Dell, enable L3 as NUMA and MADT=round robin and make sure your VMs are mapped 1:1 for vCPU:Socket:NUMA until the virtual cores are wrapped back around on the same CCDs, or allow the OSE to suffer latency and control it at your OSE application side with Affinity protection. In this model you can leverage SMT and since they share L3 cache you can wrap twice before needing to break out into more virtual sockets.
-The insane way is to build core topology maps, cron schedule core pinning and follow your HA movements using the same planning. This is a pain in the ass to plan for and mistakes will happen.
On Ryzen it's a complete different story its one chiplet, even if you Ryzen has 2x L3 cache, they are shared.
On genoa/Milan the L3 cache is NOT shared across CCD's, and that's the issue.
This greatly depends on the architecture you are looking at. Zen2 (3000 series, 4000 series and some re-badged low end 5000 series) have dual CCX's in the CCDs. But even then so, the higher core count SKUs have two CCDs and have the same NUMA topology found in their Epyc counter parts. 3900X? two CCDs, four CCX's 12cores split 3+3 3+3. 5950X? two CCDs, two CCX's 16cores split 8 8. CCX's are isolated L3 Cache domains inside of their CCD, where modern CCDs that do not contain CCX's are unified across all cores. The truth applies to Ryzen/Threadripper/Epyc alike.
No CCD's L3 caches are shared BETWEEN CCDs, that is a definition of NUMA. But L3 cache is completely shared INSIDE of the CCD for all 7003/9004/9006/9008 EPYC Skus.
On intel, it works somewhat different, for that i have to dig in, but none of my intel servers needed any sort of numa, except of course if they are dual/quad socket.
Intel is monolithic in design and does not have NUMA presence in the socket today for Xeon(see my next comment..). Big.Little is their first approach of NUMA with the E and P cores. It's coming sooner then later to Xeon, and when it does you can place your bets that Intel is going to have the exact same issues as we see on Epyc here.
I think that intels interconnect between the cihplets on the cpu are simply insanely much faster as on amd side.
And earlier Intel CPUs didn't had anyway the issue, because there were Monolithic.
Intel does not use chiplets in GA production Xeon yet, as they use a singular monolithic Silicon DIE. However, they do have one MCM design with the 56core Xeon but I have only see it in the real-world once and it was powering a really complicated finance system(I cannot say more...). The cost was 4x what you would pay for a 64core Epyc 7003 part, and that Xeon was not without its own issues(such as 560w per socket due to the application load...).
So as conclusion, Genoa/Milan is definitively slow with multi threading apps. Up to 3x
And there is no other way around that as Cpu pinning.
Ryzen/Intel will be 2-3x faster on multi threading apps, for every normal user.
Because no one will pin CPUs here, it's not even manageable on Clusters with migration.
Your conclusion is factually wrong and missing a ton of vital information. I get you have Epyc servers that are not living up to your desired potential. I dont know what vendor they are with, or how they are configured. But to say that Ryzen is faster then Epyc just tells me you do not have your servers configured optimally, or you went with a vendor that is not "fully" supporting Epyc to spec.
I am more then happy to help if you want to take this offline in messages/reddit/discord, but be mindful I am in the middle of my own projects of migrating from VMware to Proxmox across 6 campuses on a mix of AMD and Intel hardware.