Hardware Feedback - Proxmox Ceph Cluster (3-4 nodes)

justinclift · May 30, 2024

Ramalama said:
And additionally the Server Section has Nothing todo with the Customer Section

Same company. That's good enough to get everything ASUS added to the shit list.

justinclift · May 30, 2024

Ramalama said:
i find it really stupid that you provide an absolutely unrelated video without checking first the alternatives.

What are you talking about? I didn't reject your complete solution, just one component of it due to the vendor having become dodgy.

Please don't act like I kicked your pet or something. I like pets.

_--James--_ · May 30, 2024

Ramalama said:
And additionally the Server Section has Nothing todo with the Customer Section, completely different department/support etc.

A lying, cheating, and stealing company is a the same no matter the product line. Asus has different support on the enterprise side, but its same same Execs and VPs running the show on all sides. I'm glad you have good working servers from Asus, meanwhile its anecdotal.

UdoB · May 30, 2024

alexskysilk said:
Its worth noting exactly what may fail in this context.

Yes, sure! I had pointed to the difference between PVE and Ceph some posts above, here in this thread: https://forum.proxmox.com/threads/h...mox-ceph-cluster-3-4-nodes.147718/post-668241 ;-)

adresner · May 30, 2024

Ramalama said:
I bet this is around 30k usd/eur ?
You get for that money already 3x genoa servers, each with:
Asus rs520a-e12-rs12u
9274F + 12x 32gb ddr-5 memory (384gb) + Intel E810 + 12x micron 7450/7500 pro 4TB + 2x pm9a3 2tb for OS in m.2 110 formfactor.

I would go with 9374F and 64gb dimms, but that's out of budget for you and you have 3 servers, so 9274 is more as enough.
And you don't need a ton of memory anyway if you go with ceph and such fast nvme drives.

But however buy from dell that old crap
Cheers

I'm in Thailand with limited access to equipment. I'm not scared to build my own servers... most of my servers are hand built from Supermicro parts.. however I don't do ASUS =) I can look at Supermicro

3 cases here gonna run me 3K
I can pick up on eBay: 3x 7002 + mono + Ram for around 6K
3.9TB 7450 x 48 = 32K
I'm already pushing 40K and haven't even added in NIC and other details.

I have bought off lease Dell's running nicely for 4 years now.. really depends on your source. My guy tests everything and gives me a 2 year warranty.

adresner · May 30, 2024

_--James--_ said:
AMD's Cores are clustered in L3 Cache NUMA domains. These are called CCDs. For Epyc 7002 each CCD has two CCX sub domains. While Epyc 7003+ do not have sub CCX's. This is how AMD was able to reach high core counts and not break the 280w per socket power limitation (staying Eco Green), while maintaining ~3.2ghz per core under max load. Meanwhile the memory channels are unified in a central IOD, making memory uniform physically. The main problem with this is virtualization because of how vCPUs are treated as normal threads on the host. Means smaller core VMs will be limited to resources inside of those CCDs and the singural path to the IOD (96GB reads, 26GB reads) across the PCIE Infinity fabric. On top of that, memory mapping happens local to the CCD in the IOD and you dont actually get much performance above dual channel DDR4-3200/DDR5-5600 when running virtual loads. Take into the fact that a single 7002 Epyc Core is capable of pushing 9.2GB/s, you can quickly saturate out a 8core CCD with a very high compute load.

To combat this one might want to spread those vCPU threads as evenly across the socket as possible, but without creating Memory NUMA as that does create latency. Thats where these two BIOS options come in, CCX as NUMA, This puts the server into multi-numa domains based on the CCX(7002) or CCD(7003+) count, while keepig the IOD unified. The other options is MDAT=Round Robin, this reorders the CPU cores initiation table that so the first CPU on each CCX/CCD is addressed first before going back so the same CCX/CCD. This allows VMs to spread out their compute across the socket, so that they all gain benefit from the PCIE Infinity fabric pathing into the IOD, the same BW access into PCIE bus for directIO pathing, and have access to multiple L3 Cache domains.

Of course, creating multiple physical NUMA domains you have to map it up through to your VM too. The recomended VM deployment would be 1vCPU 1vSocket so that the Virtual OSE knows about the L3 Cache domains. If you need to run 16vCPUs then it would be 2vCPUs per vSocket to map the NUMA domains correctly. As such, an Epyc 7532 has 8CCDs (2c+2c) and the central IOD for 8 memory channels. The above BIOS settings will put a dual socket system with two of these CPUs into 32 NUMA domains bound by the L3 Cache topology found down to the CCX layer.

Now if you find that you do not need the hardware spread for your systems, then You do not need to do any of the above. But I have seen a 25user Finanace DB system push 140GB/s in memory while maintaining 1,500,000 IOPS due to how poorly those queries/SP's were written. But, coming off Intel where everything is monolithic in socket design and true UMA, this is knowledge everyone deploying on AMD needs to have.

Would it be more straight forward to go with Intel based servers?

justinclift · May 30, 2024

adresner said:
3x 7002 + mono

What's a "mono"?

_--James--_ · May 30, 2024

adresner said:
Would it be more straight forward to go with Intel based servers?

No, Because even with the complexity behind AMD's Socket, Intel is still a lot slower over all. Look at benchmarks and reviews at Phoronix

_--James--_ · May 30, 2024

adresner said:
I'm in Thailand with limited access to equipment. I'm not scared to build my own servers... most of my servers are hand built from Supermicro parts.. however I don't do ASUS =) I can look at Supermicro

3 cases here gonna run me 3K
I can pick up on eBay: 3x 7002 + mono + Ram for around 6K
3.9TB 7450 x 48 = 32K
I'm already pushing 40K and haven't even added in NIC and other details.

I have bought off lease Dell's running nicely for 4 years now.. really depends on your source. My guy tests everything and gives me a 2 year warranty.

When buying SMC AMD motherboards, make sure you spec out H11 V2's vs H12's, the H11 V2's support 7001/7002 CPUs while the H11 V1's only support 7001. its due to a smaller SPI for the BIOS found on the V1's. H12 will support 7002/7003 CPUs. But the H11 cost is 30%-60% lower then the H12's due to memory BW support (2666 vs 3200).

*edit* when buying used AMD CPUs make sure you ask the reseller where the CPU came from. There is a Vendor-Fuse that gets burned when the Security Processor inside of the CPU is activated. Once the Fuse is vendor burned, the CPU is vendor locked to that OEM. Its part of the physical security stance AMD took. IE, you cannot take an AMD CPU from Dell and put them into a SMC motherboard, it will not post.

Memory, I suggest looking into Nemix which can be found on Amazon. They use Hynix IC's and I have yet to have any of those DIMMs fail (its been a 2-3 years now..) in our custom built SMC and Dell Refurb servers. Just make sure to follow memory speed, cas, and rank requirements for your build.

Storage is where it always hurts the most. As much as I love HCI/Ceph and such, if the cost cant be met then a SAN/NAS is still the best way through. Using something like TrueNAS since it supports HA could be a good way through as a cost savings. Else, buying up an old Netapp could work. But if you can do a 5node(my personal min requirement) build with 4 storage devices each dedicated for Ceph then that would be enough to get your feet off the ground and ready to go.

Ramalama · May 30, 2024

_--James--_ said:
No, Because even with the complexity behind AMD's Socket, Intel is still a lot slower over all. Look at benchmarks and reviews at Phoronix

Thats not correct.
Phoronix test suite is very optimized edge case benchmarking.
Almost everything has nothing todo with real world.

Proxmox doesn't support numa or it's completely broken with amd CPUs, as long that is the case, you cannot get 100% of the performance in any multi threading application.

I have another thread here about that, people simply don't have clue about numa on this forums, they just think that enabling numa in vm settings changes something.

However, my Genoa Servers gets constantly outperformed, especially in everything that does multi threading (because intel has an ultra fast interconnect).

Even Ryzen Consumer crap like 5800x CPU's are a huge amount faster 2-3x on multi threading apps as any Genoa/Milan/Rome.
Because they have no CCD's.

At the moment, with normal usecase for a company, like ERP-Systems etc...
Especially on Proxmox, Genoa/Milan/Rome is a lot slower as almost any Xeon in the same Generation by a factor of 2.

Cheers

adresner · May 30, 2024

justinclift said:
What's a "mono"?

That is an awful typo due to lack of sleep and sloppy fingers ... lol mobo like motherboard.

alexskysilk · May 30, 2024

justinclift said:
ASUS? Just say no: https://www.youtube.com/watch?v=7pMrssIrKcY

Just fyi, the Asus consumer division and enterprise divisions are effectively different companies. Not to discount the linked behavior. At the enterprise they have a host of different issues

_--James--_ · May 30, 2024

alexskysilk said:
Just fyi, the Asus consumer division and enterprise divisions are effectively different companies. Not to discount the linked behavior. At the enterprise they have a host of different issues

It doesn't matter, its the same leadership on both sides of that same house https://www.asus.com/about-asus-leadership/

_--James--_ · May 30, 2024

Ramalama said:
Thats not correct.
Phoronix test suite is very optimized edge case benchmarking.
Almost everything has nothing todo with real world.

Proxmox doesn't support numa or it's completely broken with amd CPUs, as long that is the case, you cannot get 100% of the performance in any multi threading application.

I have another thread here about that, people simply don't have clue about numa on this forums, they just think that enabling numa in vm settings changes something.

However, my Genoa Servers gets constantly outperformed, especially in everything that does multi threading (because intel has an ultra fast interconnect).

Even Ryzen Consumer crap like 5800x CPU's are a huge amount faster 2-3x on multi threading apps as any Genoa/Milan/Rome.
Because they have no CCD's.

At the moment, with normal usecase for a company, like ERP-Systems etc...
Especially on Proxmox, Genoa/Milan/Rome is a lot slower as almost any Xeon in the same Generation by a factor of 2.

Cheers

Proxmox supports NUMA just fine, as its built into Debian and KVM. What you are talking about is the core localization issue that I have been fighting with on KVM since 7002's dropped. It's not even that its not understood, its a flag in the stupid BIOS that is not present on anything that is not Dell. Goes back to MDAT = Round Robin and L3 Domain as NUMA. The OS cannot address core init tables on its own, it relies on the BIOS for that and that is why we experience NUMA performance issues under the likes of Proxmox. Could you manually map the table out and build a core localization building block that KVM uses under core pinning? Sure, you can. But you would be doing this for each host you touch, and if you run different sized CPUs that table is going to be completely different SKU to SKU.

The other issue with KVM (Or Proxmox, I am not sure which still) we cannot address SMT threads without Pinning. On VMware we can "follow.HT" as a NUMA role to allow the vCPUs to overlap on the same CCX/CCD when we hit those boundaries, but with Proxmox we have no such options. There are good use cases for that like SQL/ERP systems where you don't necessarily need the IC throughput to PCIE/Memory but need access to the threading and do not want to touch NUMA.

You say that a 5800X is faster then Epyc because of the "NUMA issues" and that is just not true. You have not dug in deep enough and/or touched features like Core Pinning to see it. Epyc has the same config as your 5800X example, its just that Epyc is 2-4-6-8 5800X's on the same socket, so you need to "core map" accordingly to get the same rated performance you quoted.

Ramalama · May 30, 2024

Its called split tables or something like that and it is available on any genoa/milan, but you'll get usually 8 or even 16 numa domains per CPU, depending on the core count.
Each Domain per L3-Cache.
Each domain has only 8 cores.

To manage this with cpu pinning, without the support that proxmox tryes to align the cores of one VM, to a single Numa-Node, is insane.
It's not even possible if you have a cluster and migrate VM's around.

If the multi threading application inside your VM uses for example 6 cores and those aren't on the same Numa-Node, the whole L3 Cache is basically not working.
So those tasks of the application cannot share data to the other task over the L3 cache and it will go over memory, which is insanely slower.

In the end your multi threading app runs at around 33% of the speed, it could run.

So we can safely say that on AMD Milan/Epyc/Genoa platforms, if people don't pin cpu cores, every multi threading application will run around 3x slower.

On Ryzen it's a complete different story its one chiplet, even if your Ryzen has 2x L3 cache, they are shared.
On genoa/Milan the L3 cache is NOT shared across CCD's, and that's the issue.

On intel, it works somewhat different, for that i have to dig in, but none of my intel servers needed any sort of numa, except of course if they are dual/quad socket.

I think that intels interconnect between the cihplets on the cpu are simply insanely much faster as on amd side.
And earlier Intel CPUs didn't had anyway the issue, because there were Monotholic.

So as conclusion, Genoa/Milan is definitively slow with multi threading apps. Up to 3x
And there is no other way around that as Cpu pinning.

Ryzen/Intel will be 2-3x faster on multi threading apps, for every normal user.

This is not an issue on Phoronix Test Suite, as he/they test only on Bare Metal. Where apps can be numa-aware or not, depending on the app.
With our usecase as Hypervisor, we run Apps inside VM's, where this is a big issue, because Proxmox dont't assign the Cores of a VM to the same numa-node. So the Apps or Anything that is multithreading inside the VM won't/can't use the L3 Cache without CPU-Pinning on Proxmox side.

No one will pin CPUs here, it's not even manageable on Clusters with migration.
As long as there is no Option in Proxmox for that, a Milan/Genoa Server with Proxmox as Hypervisor will never beat an Intel-Based Server system.

_--James--_ · May 30, 2024

Ramalama said:
Its called split tables or something like that and it is available on any genoa/milan, but you'll get usually 8 or even 16 numa domains per CPU, depending on the core count.
Each Domain per L3-Cache.
Each domain has only 8 cores.

This is presented at the BIOS level with MADT/L3 as NUMA on Dell only systems today - ACPI states https://uefi.org/sites/default/files/resources/ACPI_5_1release.pdf, page 138. Dell BIOS manual - https://dl.dell.com/manuals/common/dell-emc-poweredge-15g-set-up-bios.pdf, page 26 & page 27.

- HPE supported this on some of their AMD systems but not all. SMC, Gigabyte, Asus refuse to adopt this. This is my request to SMC on both reddit and via a ticket to adopt the EFI standard. Yielded nothing https://www.reddit.com/r/supermicro/comments/k5q0ex/req_adding_madt_ccx_as_numa_to_h11_and_h12_amd/

Ramalama said:
To manage this with cpu pinning, without the support that proxmox tryed to align the cores of one VM, to a single Numa-Node, is insane.
It's not even possible of you have a cluster and migrate VM's around.

10000% agree with you, this is why I pushed the EFI changes down to the SI 4-5 years ago, its also why I only run Dell AMD powered servers today. So core pinning is a non issue and you maintain desired performance across NUMA domains.

Ramalama said:
If the multi threading application inside your VM uses for example 6 cores and those aren't on the same Numa-Node, the whole L3 Cache is basically not working.
So those tasks of the application cannot share data to the other task over the L3 cache and it will go over memory, which is insanely slower.

In the end you multi threading app runs at around 33% of the speed, it could run.

Not quite how it works, this is classified as a L3 Cache miss and the L3 Cache records between NUMA X cores to NUMA Y cores is refetched causing a drop/dip in performance. This could net 26ns-96ns of missing latency left untuned, which scales with vCPU core counts across the VM. The more L3 domains that are not classified up through the VM the higher that latency gets. I have measured 240ns latency on 8 physical NUMA Domain VMs with a unified virtual NUMA.

Nice write up on this - http://www.staroceans.org/cache_coherency.htm

Ramalama said:
So we can safely say that on AMD Milan/Epyc/Genoa platforms, if people don't pin cpu cores, every multi threading application will run around 3x slower.

Not quite, there are many ways to handle this.

-The easiest is to throw money at the build and make sure your CCD's are 8cores wide and your VMs do not exceed 8 virtual cores. If they exceed 8Cores, then to make sure you are limiting your VM to two NUMA domains and building the NUMA virtual topology to match the physical, virtual sockets and all. Just remember, SMT is not used in a meaningful way with KVM under Proxmox today.

-The hardest is to stick with Dell, enable L3 as NUMA and MADT=round robin and make sure your VMs are mapped 1:1 for vCPU:Socket:NUMA until the virtual cores are wrapped back around on the same CCDs, or allow the OSE to suffer latency and control it at your OSE application side with Affinity protection. In this model you can leverage SMT and since they share L3 cache you can wrap twice before needing to break out into more virtual sockets.

-The insane way is to build core topology maps, cron schedule core pinning and follow your HA movements using the same planning. This is a pain in the ass to plan for and mistakes will happen.

Ramalama said:
On Ryzen it's a complete different story its one chiplet, even if you Ryzen has 2x L3 cache, they are shared.
On genoa/Milan the L3 cache is NOT shared across CCD's, and that's the issue.

This greatly depends on the architecture you are looking at. Zen2 (3000 series, 4000 series and some re-badged low end 5000 series) have dual CCX's in the CCDs. But even then so, the higher core count SKUs have two CCDs and have the same NUMA topology found in their Epyc counter parts. 3900X? two CCDs, four CCX's 12cores split 3+3 3+3. 5950X? two CCDs, two CCX's 16cores split 8 8. CCX's are isolated L3 Cache domains inside of their CCD, where modern CCDs that do not contain CCX's are unified across all cores. The truth applies to Ryzen/Threadripper/Epyc alike.

No CCD's L3 caches are shared BETWEEN CCDs, that is a definition of NUMA. But L3 cache is completely shared INSIDE of the CCD for all 7003/9004/9006/9008 EPYC Skus.

Ramalama said:
On intel, it works somewhat different, for that i have to dig in, but none of my intel servers needed any sort of numa, except of course if they are dual/quad socket.

Intel is monolithic in design and does not have NUMA presence in the socket today for Xeon(see my next comment..). Big.Little is their first approach of NUMA with the E and P cores. It's coming sooner then later to Xeon, and when it does you can place your bets that Intel is going to have the exact same issues as we see on Epyc here.

Ramalama said:
I think that intels interconnect between the cihplets on the cpu are simply insanely much faster as on amd side.
And earlier Intel CPUs didn't had anyway the issue, because there were Monolithic.

Intel does not use chiplets in GA production Xeon yet, as they use a singular monolithic Silicon DIE. However, they do have one MCM design with the 56core Xeon but I have only see it in the real-world once and it was powering a really complicated finance system(I cannot say more...). The cost was 4x what you would pay for a 64core Epyc 7003 part, and that Xeon was not without its own issues(such as 560w per socket due to the application load...).

Ramalama said:
So as conclusion, Genoa/Milan is definitively slow with multi threading apps. Up to 3x
And there is no other way around that as Cpu pinning.

Ryzen/Intel will be 2-3x faster on multi threading apps, for every normal user.

Because no one will pin CPUs here, it's not even manageable on Clusters with migration.

Your conclusion is factually wrong and missing a ton of vital information. I get you have Epyc servers that are not living up to your desired potential. I dont know what vendor they are with, or how they are configured. But to say that Ryzen is faster then Epyc just tells me you do not have your servers configured optimally, or you went with a vendor that is not "fully" supporting Epyc to spec.

I am more then happy to help if you want to take this offline in messages/reddit/discord, but be mindful I am in the middle of my own projects of migrating from VMware to Proxmox across 6 campuses on a mix of AMD and Intel hardware.

Ramalama · May 30, 2024

You are right, i have to dig further in. I don't say that there is no way around that, just no easy one.
The easiest one is CPU Pinning, which is possible on Proxmox without a lot of knowledge and you don't need even to change bios settings for that, like enabling NPS4 or Numa per L3 Cache etc...
You just need to know which CPU's are on which CCD and that's it.
Thats something a normal user could understand.

Core topology maps and Cron Shedule pinning, is just an automatic way of that. But you would need to restart the VM for that to apply the Qemu Config, even with a Script.
But that gives you somewhat of a dynamic way around the issue. Which is actually great, i didn't thought even of that TBH.
You could make a script that looks at the assigned corecount of a VM, takes that as a "Assumed Utilization" Variable, and calculates out the best balance to PIN Cpus to one Numa-Node per VM, if the assigned cores of a VM is not greater as NUMA-Node Core count.

I could write such a script with ease, you surely either. But the normal user cannot, a normal user usually dont even know what numa is, he knows only that if he has 2 CPU's in his Server, he needs to enable Numa xD

And thats the reason why there has to be actually a Option in Proxmox, like "Assign vCPUs to Same Numa-Node" in VM-Settings. To make that easy. Such an Option would provide every normal user a way to boost the Performance a lot. (On Epyc systems at least)

However, you have definitively more Clue as i initially thought of with Epyc Systems.
Cheers

_--James--_ · May 30, 2024

Ramalama said:
You are right, i have to dig further in. I don't say there there is no way around that, just no easy one.
The easiest one is CPU Pinning, which is possible on Proxmox without a lot of knowledge and you don't need even to change bios settings for that, like enabling NPS4 or Numa per L3 Cache etc...
You just need to know which CPU's are on which CCD and that's it.
Thats something a normal user could understand.

Core topology maps and Cron Shedule pinning, is just an automatic way of that. But you would need to restart the VM for that to apply the Qemu Config, even with a Script.
But that gives you somewhat of a dynamic way around the issue. Which is actually great, i didn't thought even of that TBH.
You could make a script that looks at the assigned corecount of a VM, takes that as a "Assumed Utilization" Variable, and calculates out the best balance to PIN Cpus to what Numa-Node of each VM, if the assigned cores to a VM is not greater as NUMA-Node Core count.

I could write such a script with ease, you surely either. But the normal user cannot, a normal user usually dont even know what numa is, he knows only that if he has 2 CPU's in his Server, he needs to enable Numa xD

And thats the reason why there has to be actually a Option in Proxmox, like "Assign vCPUs to Same Numa-Node" in VM-Settings. To make that easy. Such an Option would provide every normal user a way to boost the Performance a lot. (On Epyc systems at least)

However, you have definitively more Clue as i initially thought of with Epyc Systems.
Cheers

Really what we NEED is the PVE team to work with KVM and adopt the same NUMA core concepts found with VMware. Such as "Numa.preferHT = True" (https://blogs.vmware.com/vsphere/2014/03/perferht-use-2.html) and virtual NUMA topology at the VM level so its exposed to the vOSE correctly (https://docs.vmware.com/en/VMware-v...UID-3E956FB5-8ACB-42C3-B068-664989C3FF44.html). Core pinning, affinity, and all that is just scratching at what actually needs to be done here. I know there are legal issues with pulling some of these concepts from VMware directly and that is most likely why the PVE/KVM team has not really gone here all that much. Also most of this has to be done at the KVM (QEMU) level in order for PVE to adopt it. But since Proxmox is our vendor of choice, they should really champion this through on our behalf.

This is what is available via Redhat today https://access.redhat.com/documenta...ning_optimization_guide-numa-numa_and_libvirt

justinclift · May 30, 2024

_--James--_ said:
Could you manually map the table out and build a core localization building block that KVM uses under core pinning?

That sounds like something that needs automating, thereby improving the quality of life of all EPYC (NUMA) users.

Any ideas if the process has been documented somewhere to a reasonable depth so someone could get it done?

_--James--_ · May 30, 2024

justinclift said:
That sounds like something that needs automating, thereby improving the quality of life of all EPYC (NUMA) users.

Any ideas if the process has been documented somewhere to a reasonable depth so someone could get it done?

Nothing official or public, most of this is of my own design that I rarely share because I do not want to openly support it.

Hardware Feedback - Proxmox Ceph Cluster (3-4 nodes)

Well-Known Member

Well-Known Member

Member

Distinguished Member

Member

Member

Well-Known Member

Member

Member

Renowned Member

Member

Distinguished Member

Member

Member

Renowned Member

Member

Renowned Member

Member

Well-Known Member

Member

We value your privacy