hardware renewal for three node PVE/Ceph cluster

Oct 28, 2013
308
47
93
www.nadaka.de
Hi,

after almost five years with our three node PVE/Ceph cluster now it's hardware renewal time!

Core requirements are:
- about 24 TB of usable storage (fast and scalable)
- about 512 GB RAM per node (scalable)

Unfortunately we can't go with AMD EPYC CPUs because of Oracle.
Together with Thomas Krenn we created the following suggestion (per node):

- 2x Intel Xeon Gold 6444Y
- 512 GB RAM (8x 64 GB DDR5 4800)
- 2x 480 GB Samsung PM897 (OS and software)
- 4x 6,4 TB Kioxia CM-7V SIE U.3 NVMe SSD (Ceph)
- 2x Broadcom P425G (4x 25G)
- 1x Broadcom P225P (2x 25G)

The details you can find here:
https://www.thomas-krenn.com/loadproduct?id=y36736c694751a1fa&lang=de

Ceph will get a full mesh network with "routed setup with round-robin bonds" with 2x 25G per node.
VM network will be 2x 25G (2x 10G at the switch site)

I would be glad to hear your opinion! Is something missing or inconstinent?

Thanks and greets
Stephan
 
- about 24 TB of usable storage (fast and scalable)

- 4x 6,4 TB Kioxia CM-7V SIE U.3 NVMe SSD (Ceph)
Will you go with the usual 3/2 copies rule?

I am by far no Ceph specialist, but: when (not:if) one OSD fails the other OSDs on this node need to compensate that. If you want this situation get handled as documented (and not stay degraded for a possibly long time or get "disk full"-panic) you can only use three of those four --> 19.2 TB.

And OSDs should not get filled above...90% or so. Above 17 TB a "normal" problem might possibly get you into trouble.

If my not-very-well-founded understanding is right I would go for five OSDs...

Best regard
 
  • Like
Reactions: sherminator
thanks for this - as far as I understand, you're right: When an OSD fails, the pool becomes degraded, which means: Pool is working, but no more redundancy left. But I think that's ok for us. Three times the usable storage must be enough. :D
 
You might consider adding an extra 1G network for Corosync. It is always advisable to have at the very least one dedicated network for Corosync. Corosync does not require much bandwidth but latency is extremely important, and sharing the network with other network intensive processes (e.g. Ceph, backups, migrations, etc) can easily create issues.
 
Yes, we already took this into account. Networking is planned like this:

available:
2x 10G onboard
10x 25G Broadcom cards

usage:
1x 10G management (PVE WebGUI)
1x 10G Corosync
4x 25G for Ceph "bonded full mesh"
2x 25G VM network (bonded)
2x 25G VM Migration (full mesh)
1x 25G Backup
 
I'm guessing the Samsung PM897 is an enterprise SSD with PLP (power-loss prevention) otherwise consumer SSDs will burn out.

May want to look into a full-mesh broadcast setup for Ceph.
 
May want to look into a full-mesh broadcast setup for Ceph.
If he needed a mesh network he'd be in real trouble with all his other links ;)

with the assumption that you have enough switch ports for all your nics, I would only comment that your networking is massive overkill for a 3 node cluster with 12 osds. other then that, I'd suggest you have a second link for corosync, and I assume youre bmc's are just not noted.
 
If he needed a mesh network he'd be in real trouble with all his other links
I still didn't get the point.
I try to clarify the network setup:

these are connections to our switches:
1x 10G management (PVE WebGUI)
1x 10G Corosync
2x 25G VM network (bonded)

and these are direct connections, no switches involved:
4x 25G for Ceph "bonded full mesh"
2x 25G VM Migration (full mesh)
1x 25G Backup

I'd suggest you have a second link for corosync
yes, in a perfect world with enough NICs that would be nice. Unfortunately there are no further options.

and I assume youre bmc's are just not noted.
good point. I guess there's a dedicated BMC, but I will check this.
 
PVE doesnt use a dedicated "vm migration" network. you can use those for your second corosync network.

edit- also, since you're meshed, there's not much point to bonding your ceph interfaces. you'd get better utility in keeping two networks (1 per pair) one for ceph public and one for ceph private networking. it'll work out the same for bandwidth, but you'll have better latency for both.
 
Last edited:
PVE doesnt use a dedicated "vm migration" network.
yes, you can: https://pve.proxmox.com/wiki/Manual:_datacenter.cfg
At the moment our live migrations use the switches and so the (cross-room) connections between the switches. These are 2x 10G and can easily be saturated by live migrations. That leads to higher latency between the switches, which was never a problem, but doesn't feel good.
This is why we plan to build a full mesh for live migration to relieve the switches. And as a nice side effect, live migration will get 25G instead of 10G.

also, since you're meshed, there's not much point to bonding your ceph interfaces.
https://www.proxmox.com/en/download...cumentation/proxmox-ve-ceph-benchmark-2023-12
In these benchmarks you can read that "2x 25 Gbit/s Routed" outperforms every 1x 25G solution. It's even close to a 100G mesh. I get this idea from Thomas Krenn.

two networks (1 per pair) one for ceph public and one for ceph private networking.
We considered separating these networks. But in "normal" three node clusters this doesn't seem to be that important. In fact we did this for the last five years (ceph public and ceph private on the same network), and we didn't face any issues.
 
Last edited:
I never knew that. thank you! point still stands; you can still use that same interface as your second corosync ring.
In these benchmarks you can read that "2x 25 Gbit/s Routed" outperforms every 1x 25G solution. It's even close to a 100G mesh. I get this idea from Thomas Krenn.
These are maximal values for synthetic benchmarks. in your actual use case it may or may not have any impact at all.
But in "normal" three node clusters this doesn't seem to be that important.
you are correct, since there are no cross node pg migrations. However- each lagg necessarily has the latency of a single link. two laggs would have better latency then one- its up to you to decide what is more advantageous.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!