Hardware Feedback - Proxmox Ceph Cluster (3-4 nodes)

adresner · May 26, 2024

After going down many rabbit holes, I have finally come to the conclusion that the best solution (for my office) is a Proxmox cluster with 4 nodes. Depending on my final build, I might be able to only have 3.

For now, I will use both proxmox backup and veeam to backup my VMs to a TrueNAS box and replicate those backups to 2 remote locations.

It's off lease from a trusted supplier. I have been using their gear for 5 years now, excellent support.

Use case: 10 VM running MS Exchange, Active Directory, SQL, Reporting and Accounting Software, File Server, Anti-Virus server, Mattermost Server, Nextcloud, and an array of LXC's with little services. Currently using Hyper-V on a single server... backed up with Veeam to TrueNAS. Had been thinking about a TrueNAS with Proxmox or xcp-ng hypervisor but thats a single point of failure.. even if I add 3 Proxmox, if the Truenas goes down.. i'm down. Circle back to a Proxmox Ceph cluster.. 3 nodes but realize 4 is better.

Was going to use 25GB but realize I need 100GB Mesh network, based on the excellent benchmark guide posted here.

Still learning about CEPH, is 8x drives per node too many? are the drives too big? According to https://florian.ca/ceph-calculator/ ill have about 19TB of safely usable space. Correct?

Went back and forth on Intel AMD, SSD vs NVME, etc. Settled in on the following build.

Dell PowerEdge R7425 24-Bay NVMe 2.5" 2U Rackmount Server
512GB [8x 64GB] DDR4 PC4-3200AA ECC RDIMM
2x AMD EPYC 7532 2.4GHz 32 Core 256MB 200W 2nd Gen Processor
8x Dell 3.2TB NVMe SSD 2.5" Gen3 MU Solid State Drive [51.2TB Raw]
Dell Dual Port 10GBASE-T + Dual Port 1GBASE-T rNDC | Intel X550 I350
PCIe Slot: Dell Dual Port 25GB SFP28 PCI-E CNA | Intel XXV710-DA2
PCIe Slot: Dell Dual M.2 6G PCI-E BOSS-S1 Controller + 2x Dell PE 120GB SATA SSD M.2 6Gbps RI Solid State Drives
2x Dell PE 1600W 100-240V 80+ Platinum AC Power Supplies

**add in 100GB networking, still waiting on that quote.

Would appreciate any feedback on my concepts or hardware choice (like I'm a bonehead for going AMD?)

Thanks!

justinclift · May 26, 2024

adresner said:
with 4 nodes

Isn't using an odd number of nodes one of the more fundamental things people do when clustering, for quorum purposes?

UdoB · May 26, 2024

justinclift said:
Isn't using an odd number of nodes one of the more fundamental things

Regarding Quorum having four nodes is exactly as fine as having three: ONE single node may fail.

The recommendation then is to add a QDev --> five votes --> TWO nodes may fail

justinclift · May 26, 2024

Ahhh yeah. I need to look into that Qdevice stuff. Thanks @UdoB.

adresner · May 28, 2024

From what I have read, 3 nodes is the minimum but its also not optimal? If one node goes down.. you cannot recover? That 4+ is safer?

Alternatively, if someone only has 2 nodes, can they setup replication and use that instead of a Ceph cluster?

UdoB · May 28, 2024

adresner said:
From what I have read, 3 nodes is the minimum but its also not optimal?

"Optimal" depends on the use case. When one node fail the whole system is degraded. No way for automatic repair.

adresner said:
That 4+ is safer?

It depends ;-)

With four nodes one node may fail and Ceph will re-balance the data on the OSDs, re-construction the lost redundant elements of data. At the end everything is fine = NOT degraded.

adresner said:
Alternatively, if someone only has 2 nodes, can they setup replication and use that instead of a Ceph cluster?

Sure! ZFS replication is a complete different beast and it works fine. (A cheap Quorum-Device will be sufficient to keep Quorum in the cluster.)

Sidenode: you also need to differentiate the requirements of Ceph and PVE. They are similar constructed, the terms are sometimes identical, but they are independent and NOT identical when you look at some details. Stupid example for illustration: you may have 10 PVE/Ceph-Nodes with 10 OSDs each. Without any explicit configuration the defaults are active. Now for PVE four Nodes may fail and HA will automatically recover the lost VMs. For Ceph only one single Node may fail and it will revover from this automatically. A second failing Node (during re-balance) will possibly bring the whole cluster to a stop when the placement group goes read-only. (Again: default setting means "size=3/min_size=2". You could increase "size=x" to increase the redundancy in Ceph. If Ceph shall tolerate a loss of four Nodes then "size=6/min_size=2" would suffice.)

Best regards

adresner · May 28, 2024

=) Thank you

No automatic repair, but is there a way to manually repair? if an OSD (i'm trying to learn the verbiage here, OSD means drive?) on node 3 fails, and I replace that OSD or have a hot spare, Ceph won't repair itself? If Node 3 mainboard is borked or the PSU (dual) die... and I have to replace the hardware and then it comes back up, will the Ceph cluster repair?

My use case is to run VMs that keep my company going. Can we go down for an hour or a day to fix things or restore, yes we can... its not e-comm. We are a distributor. I would like to create a situation that is self healing so I can manage my IT from remote locations and not have to worry too much about physically replacing hardware if I am not there. I'm also using second hand equipment, which is from a reliable source but you never know. I'm pretty sure I can buy 4 nodes, but might have to start with 3.

And the rebalance cannot take place with 3 nodes.. if I cannot afford the 3 nodes, I will go in the ZFS replication direction with a Q-device.. just googled that and found https://forum.proxmox.com/threads/pve5-and-quorum-device.37183/page-2#post-252807

I plan to run both PBS and Veeam.. If Ceph goes sideways, I will always have a backup. My PBS will be on its own hardware. Separately Veeam backing up to Truenas, also separate of the cluster.

I hear you on the Ceph settings, I have read that it should be tuned to my use.. and ill be at 4 nodes within 6 months of the first 3. It will probably take me 12+ months to migrate my Hyper-V over and make sure my new structure is setup right. I will ask those Ceph questions separately.

Any comments on the hardware selection? Thinking to buy 1 and test it out first.

UdoB · May 28, 2024

adresner said:
No automatic repair, but is there a way to manually repair? if an OSD (i'm trying to learn the verbiage here, OSD means drive?) on node 3 fails, and I replace that OSD or have a hot spare, Ceph won't repair itself? If Node 3 mainboard is borked or the PSU (dual) die... and I have to replace the hardware and then it comes back up, will the Ceph cluster repair?

Yes.
Having three nodes and one failing is fine. It can not automatically repair that bad status. But adding another computer supplying new replacement OSDs is expected will work. At the end Ceph wants to be able use three computers with at least one OSD at minimum. Then it is able to fulfill that default "size=3" requirement. OSD is an Object-Storage-Daemon. It is software. It usually stands for "one physical disc" at the same time.

Having three nodes with one OSD each is the absolute minimum one could use in a productive setting. This does not mean that it is recommended. Personally I would aim for five nodes. Each node having four discs for Ceph (and two for the PVE OS). With at least 10 GBit/s physically redundant network and switches. And an UPS. And ECC Ram. And... and... and. Your hardware choice is actually aiming higher

Ceph scales into infinity, which is really great. For a small cluster with only a few nodes I really prefer ZFS with replication. This implies other deficits like not being shared storage - ZFS is always local only! A killed node will always result in data loss. But in my own world I can actually live with that. (Replication can be scheduled to run every few minutes.)

Please compare https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pveceph and https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations

Disclaimer, not sure if I mentioned this: I am NOT a Ceph specialist...

adresner said:
Any comments on the hardware selection?

That Dell looks great - but I can not judge the details!

_--James--_ · May 28, 2024

Since you want full remote support here, make sure your server hardware has iDrac enterprise so remote management is a breeze.

Looking at your builds, you want to change a couple things.

First off you are running dual socket servers with only 4DIMMs per socket. You really need 8DIMMs per socket for AMD Epyc to shine due to how CCD localized areas are to NUMA. You want each CCD to have access to dual channel memory. With 4DIMMs per socket, it will be localized single channel, where dual/quad channel BW will be reaching across the IOD increasing latency. I suggest looking at the R6515, R7515 and pricing out a single 64core CPU, you might find it cheaper then two 32core, then you can maintain 8DIMMs.

Don't forget about windows licensing, you MUST license every single core in your cluster that windows touches, else you must fence your nodes. For that core count you might want to consider Datacenter licensing, else you need to buy standard a few times. You can be selective for SQL based on core needs, but SQL 2022 requires SA for virtualization rights, or you must pay the AzureSQL subscription license now.

When you pre-deploy on Dell Epyc servers, change the BIOS over to CCX as NUMA and MDAT to Round Robin, instead of Linear. As you scale our your VM's virtual cores they will light up more resources on that socket, while staying within Memory UMA. If you keep VM's linear then you are limited to your CCX/CCD local resources. 7002 is split CCX instead of the CCD, as such your 32core CPU has 2+2 in each CCD and eight CCDs to create the 32cores. To get more resources out of each VM you either need to change the BIOS setup or give VMs more cores, else you will find your smaller 2-4 core VMs are limited to "dual channel" memory BW, limited access to PCIE bus speeds, and limited L3 Cache.

Secondly, you didnt specify your NVME SSD class. Are they high DWPD endurance SSDs or lower class? Are you buying storage used with this graymarket seller or new? Ceph has a ton of backend writes that can burn down NAND if we use the wrong class SSDs. Since you are running Exchange and Database engines on this setup, you want to make sure your SSDs can handle the writes. As such, I have a small ceph cluster of consumer trash NMVE drives that are down to 12% endurance after a 3week MSSQL torchure run

I am on the fence about needing 100GB for your setup. I know what the benchmarks say, but unless you are scaling out to a dozen+ nodes you might not need it. Running 25GB with LACP across your four nodes will be fine until you need to scale out. But that aside, you want to break out your Ceph front end and backend networks, and have a dedicated network for your VM's LAN traffic. So if you need more NICs to get that done, 100GB makes sense to cover Ceph then leaving 25GB for the VM LAN traffic.

Last, four nodes you want a Qdevice. I have had four node clusters split brain and do weird stuff, the only fix was to have 5node clusters or 4+Qdevice. Dont make the mistake, just plan for a 5th node or adopt a Qdevice. Additionally, you can be selective on what nodes get Ceph and what ones don't. As long as all nodes live on the Ceph Front end network, you can export the RBD/CephFS pools to all hosts in that network. This way you can dedicate nodes for compute compared to a mixed HCI deployment.

justinclift · May 28, 2024

UdoB said:
For a small cluster with only a few nodes I really prefer ZFS with replication.

ZFS replication is also much, much higher performance for databases. If your database(s) are fairly actively used, then Ceph in small clusters can be a problem due to the network latency for transactions. Unlike doing database transactions against local storage where you can do millions of IO's per second.

ZFS replication can be configured to push replication changes at a user defined interval, with 1 minute replication intervals being the shortest configurable gap. That 1 minute delay can be unacceptable for a lots of databases too, so it's all a matter of "do your research and pick the best option for your use cases".

adresner · May 29, 2024

You guys are incredible and I'm so grateful for all the advice.

UdoB said:
Personally I would aim for five nodes. Each node having four discs for Ceph (and two for the PVE OS). With at least 10 GBit/s physically redundant network and switches.

I have read this a few times, where people mention having fewer drives per node? Or have I misunderstood things? Unfortunately 4x 3.2tb drives won't cut it. That's 9TB of usable data? https://florian.ca/ceph-calculator/ Is there a problem with using 8x3.2 or should I try to get 4x larger capacity drives? I need around 25 to 30TB. I feel really ignorant asking this. https://docs.ceph.com/en/latest/start/hardware-recommendations/#hard-disk-drives

The other option isn't looking so bad, setup 2 nodes with ZFS and replication. For my use case, this might be all I need? Add in backups (pbs and veeam), ups of course. If we replicate (worst case) every 15-30 min, that's gonna be enough if something goes wrong. It also sounds like I could get better performance? @justinclift Give up joining the Ceph club for now

My reporting software only updates 2x per day. My accounting team, they can recreate 30m of work in a few min. I have done a lot worse to them! Sent them back a few days before.

Feel like if I can't miss a minute of data, Ceph, but if I can miss a up to a day (or 1 min replication)... maybe ZFS replication for my small case can work?

adresner · May 29, 2024

justinclift said:
ZFS replication is also much, much higher performance for databases. If your database(s) are fairly actively used, then Ceph in small clusters can be a problem due to the network latency for transactions. Unlike doing database transactions against local storage where you can do millions of IO's per second.

ZFS replication can be configured to push replication changes at a user defined interval, with 1 minute replication intervals being the shortest configurable gap. That 1 minute delay can be unacceptable for a lots of databases too, so it's all a matter of "do your research and pick the best option for your use cases".

This might be me... you got me really thinking about ZFS and replication now.

adresner · May 29, 2024

jmounts79 said:
Since you want full remote support here, make sure your server hardware has iDrac enterprise so remote management is a breeze.

Looking at your builds, you want to change a couple things.

First off you are running dual socket servers with only 4DIMMs per socket. You really need 8DIMMs per socket for AMD Epyc

I use idrac enterprise now.. its great, let me know about a drive failure today

Easy fix with the memory, my actual quote is for 16 sticks 1tb of Ram, I was sloppy in cutting that in half..and my reseller wouldn't let that happen.

Understand all the licensing... will be leaving a Hyper-V Datacenter license, sad about that. Best thing about Hyper-V .. a Datacenter license.

jmounts79 said:
When you pre-deploy on Dell Epyc servers, change the BIOS over to CCX as NUMA and MDAT to Round Robin,.......

This! is amazing =) detail! I barely understand can you explain this to me a bit more? Am I going to use CCX and not NUMA? and MDAT setting to Round Robin?

jmounts79 said:
Secondly, you didnt specify your NVME SSD class.

It's a good question and ill ask the Dell Reseller.. if I go ZFS with replication, instead of Ceph.. then this will matter less?

jmounts79 said:
I am on the fence about needing 100GB for your setup. I know what the benchmarks say, but unless you are scaling out to a dozen+ nodes you might not need it. Running 25GB with LACP across your four nodes will be fine until you need to scale out. But that aside, you want to break out your Ceph front end and backend networks, and have a dedicated network for your VM's LAN traffic. So if you need more NICs to get that done, 100GB makes sense to cover Ceph then leaving 25GB for the VM LAN traffic.

Last, four nodes you want a Qdevice. I have had four node clusters split brain and do weird stuff, the only fix was to have 5node clusters or 4+Qdevice. Dont make the mistake, just plan for a 5th node or adopt a Qdevice. Additionally, you can be selective on what nodes get Ceph and what ones don't. As long as all nodes live on the Ceph Front end network, you can export the RBD/CephFS pools to all hosts in that network. This way you can dedicate nodes for compute compared to a mixed HCI deployment.

Dedicated ZFS/Replication or CEPH network at 25GB, either through a switch or directly connected.. I can always upgrade the NIC. If I go ZFS, I could setup 100GB direct connect replication network and 25GB VM traffic to the switch.. with has a 25GB uplink to the 10GB switch.. and 10GB uplink to the 1GB user network.

I currently have a Dell R730 and 13900K Supermicro workstation both running pve. If I add in 4 identical nodes for Ceph/Compute, sounds like I can easily setup one of the current servers as the Qdevice? or4 better off with a little raseberry pi?

Thank you! =)

UdoB · May 29, 2024

adresner said:
Is there a problem with using 8x3.2 or should I try to get 4x larger capacity drives?

No, of course not.

Homelab users tend to use too few drives. The higher the number of OSDs the better!

Example: three Nodes with two OSD each and the default rule "size=3,min_size=2". When one disk fails, the other one in this node has to store the double amount of data as before as re-balancing has to happen on this very node.

For this to happen successfully that drives needs to have more than 50% space free. So both drives can only be used below 50 % (better: below 40 %) of its available volume. This is problematic because it is a behavior which surprises some users ;-)

When you have 8 OSDs and one fails and re-balance will happen on this node there is a higher chance that this lost 12.5% can be stored distributed on the other 7 neighbors without overloading their free space.

And: the more OSDs you have the more IOPS you get. Even if this might not be relevant with the hardware you chose

Best regards

justinclift · May 29, 2024

@adresner How feasible is it for you to put together an initial "test lab" of stuff (cheaply) for testing various ideas in, prior to making the final decisions about the production hardware and software configuration?

Asking because it sounds like having an extra 1-2 months time for trying this stuff "hands on" in a test environment would be super useful prior to making hardware/software configuration and deployment decisions.

I'd personally go the route of grabbing some representative servers from Ebay, add some cheap arse high speed networking (2nd hand Mellanox adapters), cheap arse 2nd hand SAS SSDs, etc.

From a quick look on ebay.com just now (I don't know the sellers at all):

Maybe something like this R730xd: https://www.ebay.com/itm/204779478529
- Because you mentioned already having an R730, so it should be fairly familiar
- That eBay item has only 128GB ram, so depending on what level of testing you're doing you might want to bump that up
Cheap arse high speed networking: https://www.ebay.com/itm/384097637844
- Those are dual port 25GbE network adapters. The cabling will probably cost as much, but it's not too eye watering if bought through (say) fs.com
Maybe a few of these (or similar): https://www.ebay.com/itm/326085063403
- This model is 10DWPD, so probably pretty decent
- Storage sizing really depends on how much storage is needed for reasonably accurate mock testing

_--James--_ · May 29, 2024

adresner said:
This! is amazing =) detail! I barely understand can you explain this to me a bit more? Am I going to use CCX and not NUMA? and MDAT setting to Round Robin?

AMD's Cores are clustered in L3 Cache NUMA domains. These are called CCDs. For Epyc 7002 each CCD has two CCX sub domains. While Epyc 7003+ do not have sub CCX's. This is how AMD was able to reach high core counts and not break the 280w per socket power limitation (staying Eco Green), while maintaining ~3.2ghz per core under max load. Meanwhile the memory channels are unified in a central IOD, making memory uniform physically. The main problem with this is virtualization because of how vCPUs are treated as normal threads on the host. Means smaller core VMs will be limited to resources inside of those CCDs and the singural path to the IOD (96GB reads, 26GB reads) across the PCIE Infinity fabric. On top of that, memory mapping happens local to the CCD in the IOD and you dont actually get much performance above dual channel DDR4-3200/DDR5-5600 when running virtual loads. Take into the fact that a single 7002 Epyc Core is capable of pushing 9.2GB/s, you can quickly saturate out a 8core CCD with a very high compute load.

To combat this one might want to spread those vCPU threads as evenly across the socket as possible, but without creating Memory NUMA as that does create latency. Thats where these two BIOS options come in, CCX as NUMA, This puts the server into multi-numa domains based on the CCX(7002) or CCD(7003+) count, while keepig the IOD unified. The other options is MDAT=Round Robin, this reorders the CPU cores initiation table that so the first CPU on each CCX/CCD is addressed first before going back so the same CCX/CCD. This allows VMs to spread out their compute across the socket, so that they all gain benefit from the PCIE Infinity fabric pathing into the IOD, the same BW access into PCIE bus for directIO pathing, and have access to multiple L3 Cache domains.

Of course, creating multiple physical NUMA domains you have to map it up through to your VM too. The recomended VM deployment would be 1vCPU 1vSocket so that the Virtual OSE knows about the L3 Cache domains. If you need to run 16vCPUs then it would be 2vCPUs per vSocket to map the NUMA domains correctly. As such, an Epyc 7532 has 8CCDs (2c+2c) and the central IOD for 8 memory channels. The above BIOS settings will put a dual socket system with two of these CPUs into 32 NUMA domains bound by the L3 Cache topology found down to the CCX layer.

Now if you find that you do not need the hardware spread for your systems, then You do not need to do any of the above. But I have seen a 25user Finanace DB system push 140GB/s in memory while maintaining 1,500,000 IOPS due to how poorly those queries/SP's were written. But, coming off Intel where everything is monolithic in socket design and true UMA, this is knowledge everyone deploying on AMD needs to have.

alexskysilk · May 29, 2024

UdoB said:
The recommendation then is to add a QDev --> five votes --> TWO nodes may fail

Its worth noting exactly what may fail in this context.

the number of NODES isnt the important factor for CEPH- its the number required for cluster quorum (pve.) The important number for ceph is the number of service providers. service providers in context are OSDs, MONitors, MGRs, MDSs (for cephfs) and RGW (gateways.) If you are keeping ceph nodes and pve cluster node members seperate then yes, any two nodes may fail in the above configuration.

in practice, if you have two OSD bearing nodes out simultaneously you WILL have file system lockup issues because some pgs will not have enough members to operate. Thankfully, server hardware with sufficient redundancy (op's example is a good one) is REALLY unlikely to fail so the only consideration should be how long of an outage to deal with when updating/rebooting.

_--James--_ · May 29, 2024

adresner said:
Understand all the licensing... will be leaving a Hyper-V Datacenter license, sad about that. Best thing about Hyper-V .. a Datacenter license.

Quoted this out specifically, because you still need to ensure you have the core count licensing in Windows Datacenter or Standard for your virtual machines on the hosts, even running KVM/VMware/Other. HyperV licensing has nothing to do with it. I have seen a lot of companies bit by this, in this year alone. All it takes is one pissed off employee to throw you at the BSA to get audited to be found non complient, and Microsoft is not playing around.

MS-SQL needs to be considered too, If you are running SQL 2022 you MUST pay for those virtualization rights. Either an active SA on the licensing, or the Azure SQL subscription model.

I am tired of seeing people get bit for an insane fine+fee+service handling because of these mistakes. The last audit I was brought in on cost the compay 1.3Million and they were about 150users, just saying.

_--James--_ · May 29, 2024

justinclift said:
@adresner How feasible is it for you to put together an initial "test lab" of stuff (cheaply) for testing various ideas in, prior to making the final decisions about the production hardware and software configuration?

So on this note, if the OP plays hardball with Dell there is a Customer facing Lab up in NorCal(bay area) that can be used for this. My advice on this would be to leverage "Nutanix" against proxmox to get dell interested.

Alternatively, SHI has a "innovation" Lab in Texas for the same purpose, but its a paid engagement.

Both are better options for this then dropping on used hardware for testing.

justinclift · May 29, 2024

jmounts79 said:
there is a Customer facing Lab

Interesting idea. That could be the right direction too.

Hardware Feedback - Proxmox Ceph Cluster (3-4 nodes)

Member

Active Member

Distinguished Member

Active Member

Member

Distinguished Member

Member

Distinguished Member

Member

Active Member

Member

Member

Member

Distinguished Member

Active Member

Member

Distinguished Member

Member

Member

Active Member