hardware renewal for three node PVE/Ceph cluster

sherminator · Feb 26, 2024

Hi,

after almost five years with our three node PVE/Ceph cluster now it's hardware renewal time!

Core requirements are:
- about 24 TB of usable storage (fast and scalable)
- about 512 GB RAM per node (scalable)

Unfortunately we can't go with AMD EPYC CPUs because of Oracle.
Together with Thomas Krenn we created the following suggestion (per node):

- 2x Intel Xeon Gold 6444Y
- 512 GB RAM (8x 64 GB DDR5 4800)
- 2x 480 GB Samsung PM897 (OS and software)
- 4x 6,4 TB Kioxia CM-7V SIE U.3 NVMe SSD (Ceph)
- 2x Broadcom P425G (4x 25G)
- 1x Broadcom P225P (2x 25G)

The details you can find here:
https://www.thomas-krenn.com/loadproduct?id=y36736c694751a1fa&lang=de

Ceph will get a full mesh network with "routed setup with round-robin bonds" with 2x 25G per node.
VM network will be 2x 25G (2x 10G at the switch site)

I would be glad to hear your opinion! Is something missing or inconstinent?

Thanks and greets
Stephan

UdoB · Feb 26, 2024

sherminator said:
- about 24 TB of usable storage (fast and scalable)

- 4x 6,4 TB Kioxia CM-7V SIE U.3 NVMe SSD (Ceph)

Will you go with the usual 3/2 copies rule?

I am by far no Ceph specialist, but: when (not:if) one OSD fails the other OSDs on this node need to compensate that. If you want this situation get handled as documented (and not stay degraded for a possibly long time or get "disk full"-panic) you can only use three of those four --> 19.2 TB.

And OSDs should not get filled above...90% or so. Above 17 TB a "normal" problem might possibly get you into trouble.

If my not-very-well-founded understanding is right I would go for five OSDs...

Best regard

sherminator · Feb 26, 2024

thanks for this - as far as I understand, you're right: When an OSD fails, the pool becomes degraded, which means: Pool is working, but no more redundancy left. But I think that's ok for us. Three times the usable storage must be enough.

sherminator · Feb 28, 2024

any other opinions? (first and last bump

)

Maximiliano · Feb 28, 2024

You might consider adding an extra 1G network for Corosync. It is always advisable to have at the very least one dedicated network for Corosync. Corosync does not require much bandwidth but latency is extremely important, and sharing the network with other network intensive processes (e.g. Ceph, backups, migrations, etc) can easily create issues.

sherminator · Feb 28, 2024

Yes, we already took this into account. Networking is planned like this:

available:
2x 10G onboard
10x 25G Broadcom cards

usage:
1x 10G management (PVE WebGUI)
1x 10G Corosync
4x 25G for Ceph "bonded full mesh"
2x 25G VM network (bonded)
2x 25G VM Migration (full mesh)
1x 25G Backup

jdancer · Mar 3, 2024

I'm guessing the Samsung PM897 is an enterprise SSD with PLP (power-loss prevention) otherwise consumer SSDs will burn out.

May want to look into a full-mesh broadcast setup for Ceph.

sherminator · Mar 3, 2024

jdancer said:
I'm guessing the Samsung PM897 is an enterprise SSD with PLP

Yes, it is.

jdancer said:
May want to look into a full-mesh broadcast setup for Ceph.

What exactly do you mean by this?

alexskysilk · Mar 3, 2024

jdancer said:
May want to look into a full-mesh broadcast setup for Ceph.

If he needed a mesh network he'd be in real trouble with all his other links

with the assumption that you have enough switch ports for all your nics, I would only comment that your networking is massive overkill for a 3 node cluster with 12 osds. other then that, I'd suggest you have a second link for corosync, and I assume youre bmc's are just not noted.

sherminator · Mar 3, 2024

alexskysilk said:
If he needed a mesh network he'd be in real trouble with all his other links

I still didn't get the point.
I try to clarify the network setup:

these are connections to our switches:
1x 10G management (PVE WebGUI)
1x 10G Corosync
2x 25G VM network (bonded)

and these are direct connections, no switches involved:
4x 25G for Ceph "bonded full mesh"
2x 25G VM Migration (full mesh)
1x 25G Backup

alexskysilk said:
I'd suggest you have a second link for corosync

yes, in a perfect world with enough NICs that would be nice. Unfortunately there are no further options.

alexskysilk said:
and I assume youre bmc's are just not noted.

good point. I guess there's a dedicated BMC, but I will check this.

alexskysilk · Mar 3, 2024

PVE doesnt use a dedicated "vm migration" network. you can use those for your second corosync network.

edit- also, since you're meshed, there's not much point to bonding your ceph interfaces. you'd get better utility in keeping two networks (1 per pair) one for ceph public and one for ceph private networking. it'll work out the same for bandwidth, but you'll have better latency for both.

sherminator · Mar 4, 2024

alexskysilk said:
PVE doesnt use a dedicated "vm migration" network.

yes, you can: https://pve.proxmox.com/wiki/Manual:_datacenter.cfg
At the moment our live migrations use the switches and so the (cross-room) connections between the switches. These are 2x 10G and can easily be saturated by live migrations. That leads to higher latency between the switches, which was never a problem, but doesn't feel good.
This is why we plan to build a full mesh for live migration to relieve the switches. And as a nice side effect, live migration will get 25G instead of 10G.

alexskysilk said:
also, since you're meshed, there's not much point to bonding your ceph interfaces.

https://www.proxmox.com/en/download...cumentation/proxmox-ve-ceph-benchmark-2023-12
In these benchmarks you can read that "2x 25 Gbit/s Routed" outperforms every 1x 25G solution. It's even close to a 100G mesh. I get this idea from Thomas Krenn.

alexskysilk said:
two networks (1 per pair) one for ceph public and one for ceph private networking.

We considered separating these networks. But in "normal" three node clusters this doesn't seem to be that important. In fact we did this for the last five years (ceph public and ceph private on the same network), and we didn't face any issues.

sherminator · Mar 4, 2024

alexskysilk said:
and I assume youre bmc's are just not noted.

yes, there is dedicated BMC which I didn't mention.

alexskysilk · Mar 4, 2024

sherminator said:
yes, you can: https://pve.proxmox.com/wiki/Manual:_datacenter.cfg

I never knew that. thank you! point still stands; you can still use that same interface as your second corosync ring.

sherminator said:
In these benchmarks you can read that "2x 25 Gbit/s Routed" outperforms every 1x 25G solution. It's even close to a 100G mesh. I get this idea from Thomas Krenn.

These are maximal values for synthetic benchmarks. in your actual use case it may or may not have any impact at all.

sherminator said:
But in "normal" three node clusters this doesn't seem to be that important.

you are correct, since there are no cross node pg migrations. However- each lagg necessarily has the latency of a single link. two laggs would have better latency then one- its up to you to decide what is more advantageous.

Search

Search

hardware renewal for three node PVE/Ceph cluster

sherminator

Renowned Member

UdoB

Distinguished Member

sherminator

Renowned Member

sherminator

Renowned Member

Maximiliano

Proxmox Staff Member

sherminator

Renowned Member

jdancer

Renowned Member

sherminator

Renowned Member

alexskysilk

Distinguished Member

sherminator

Renowned Member

alexskysilk

Distinguished Member

sherminator

Renowned Member

sherminator

Renowned Member

alexskysilk

Distinguished Member

We value your privacy