Is a 3-node Full Mesh Setup For Ceph and Corosync Good or Bad

cluster expand
We do not foresee any cluster expansion in the future. We sized our new, potential, hardware based on our current hardware, and our current hardware in a 2-node cluster has a ton of spare capacity. We would most likely just scale each node up (bigger CPU, more RAM, more disks) if needed.

can you imagine how much less cable mess do we have that way
That is a big reason of why we want to go 25Gb and mesh. All the mesh network cables would be nice and tight next to the nodes. On a tangent, if we did go with a switched architecture, we would need 2x10Gb links, bonded, for both Ceph private and public networks = 4 cables per node. That doesn't connect us to redundant switches though...

mental quest for me - I like it!
I appreciate your thoughtfulness into your reply!
 
Have a few production PVE+Ceph clusters using a custom FRR with fallback setup, using corosync both inside the mesh and the outside nics. Zero issues at all for years. Usually deploy them when customer has a tight budget and they require Ceph on 25G+ (or plan to add more nodes in the foreseeable future). If 10G network is enough, we just buy switches as they are affordable enough and simplify adding/replacing nodes.
This sounds pretty similar to the architecture we are going for, and the limitations we are under (minus the adding more nodes in the future).

Are you saying that if 10Gb is enough for Ceph traffic, you use a switched network instead of mesh, you only go the mesh route once you need 25Gb of bandwidth?
 
I should add that going with 25Gb on a switched network also helps with cable count. We would want redundant switches and redundant network cables per node. So with 25Gb, we need 2 cables for Ceph public and 2 Cables for Ceph private, one of each going to each switch. If a switch dies, no problem, all traffic goes through other switch. If a cable is pulled, no problem, all traffic goes through other cable.

If we were to go 10Gb networking, we would want bonded 2x10Gb for each link, therefore requiring 8 cables per node.

Then we still have another 4 cables for VM-LAN traffic and management plus at least 1 cable for Corosync (failover on managment link). 13 cables per node with 10Gb networking to multiple switches, yuck! 9 cables with 25Gb switched. 5 with 25Gb mesh.
 
I spoke with the vendor advising against mesh networking again. They now say that 2x bonded 10Gb is better performing, even though total bandwidth is lower, due to multipathing to OSDs. I have a hard time believing an OSD could saturate a 25Gb connection, let alone under our low utilization case.
 
I've just completed transition from 2x10G (LACP) to 25G mesh.
25G outperforms 2x10G - because hashing does not take link load into account... you can have one link overwhelmed by multiple heavy transfers and the other link completely bored.
`rados bench` (same parameters for both network types) gives higher values for 25G...
 
  • Like
Reactions: x509 and Johannes S
3 x AMD EPYC 7713, 512GB RAM, 2x1TB SSD (RAID 1, OS), 5 x 3.84TB NVMe (Ceph)
Networking:
- 2 port embedded 1G NIC: 1 port for public internet access, 1 port for private network - both connected to switch ports acting as access ports to different VLANs

10G setup was:
2 port 10G NIC connected to 10G switch via fiber links, ports we're bound in 802.2 LACP (layer2)

25G setup is:
2 port 25G NIC connected directly to other nodes via DAC cables - no bridges, no bonds - routed setup (ports connected by single cable have their own /30 network).
On top of that I've setup up FRR with OSPF to handle access via "identity" IPs - IPs assigned to loopback interface, routing is handled by FRR.
FRR is very quick in handling "node down" scenarios, so Ceph doesn't suffer when i shutdown/reboot node for maintenance.
Ceph uses that network for public and private communication, additionally Proxmox is set to use this network for migrations.

This setup is working very well for me.
I've had a single situation that needs investigation, but i cannot reproduce it - for some reason one of my VMs was stuck on IO during single node maintenance, however Ceph did not report any PGs unavailable, and other VMs we're fully functional. VM regained fully operational status once the node was booted up. I've spent 2 hours migrating VMs back and forth, restarting different nodes and I couldn't reproduce the issue.
 
  • Like
Reactions: x509
@Nexces Thanks so much for the info! Our plan would be to implement something pretty similar to what you have done.

I had not considered running both Ceph public and private networks over the same 25Gb connection. Certainly makes things a bit simpler. No I/O performance/contention concerns? Our use case sounds less demanding than yours (we would have far less CPU cores, RAM, and storage).

Which network are you running corosync on?

I assume the 2x 10Gb is for node-to-LAN traffic?

Do you have any redundancy for Ceph networks (failing over to 10Gb) , or do you just assume that since the 25Gb is directly connected node to node this is unnecessary?

Thanks again. It is great to hear from someone with a similar build to what we want to do in (what sounds like) a business-centric environment.
 
@x509
corosync is on 1G "private" link
no redundancy for 25G - single node failure is accepted as datacenter in which those servers are has 24h service and spare parts - faulty node will be up in matter of minutes / hours
there is total of 70 VMs on that cluster, from which 14 VMs are dedicated to handle IoT data processing (postgresql, rabbit, emqx, our own software), others are for monitoring, utilities, other services (fairly light compared to IoT processing)
given that we have few people getting notified about any failure in cluster - there is almost always someone who can push NOOUT on Ceph so it will not commit seppuku after node failure - so we can even push Ceph storage above some safety margins

Whole load can be handled by two nodes - they are pushed to the limits on CPUs - but we can survive in that state for quite some time.

So to summarize it all: those 3 nodes configured in that way are quite enough for our use case.
 
  • Like
Reactions: Johannes S and x509
@shanreich Thank you for the document. Based on what I see in the charts, 25Gb is more than sufficient for our needs (expected since that is the our storage traffic runs on currently).

Do you have any input on the design @Nexces has used (specifically, the networking)?