4 node CEPH cluster - No Switch

bigobx · Jul 27, 2023

https://forum.proxmox.com/threads/ceph-network-configuration-3-servers-without-switch.66539/
https://forum.proxmox.com/threads/10gbe-cluster-network-s-3-nodes-without-switch.101425/

So kinda similar to the above i want to do a 4 node cluster for home labbing so it doesn't have to be the most performant or fully redundant system

To make this simple i have one "Master" node that has a Mellanox ConnectX3 and I have the 40gb to 4X 10gb break out cable

The (3) "slave" nodes all have 10gb Mellanox cards that i want to use JUST for CEPH.

Again, i understand that the entire setup becomes dependent on the master node but im ok with that dependency and potential performance hit.

Is this doable and reasonable?

Each node also has Dual 1GB LACP bonds to my main network for normal data and access to the rest of my lan.

three nodes have 500gb samsung evos and one only has a 256gb drive (not to jumble issues but is this an issue?)

These drives are JUST for ceph, i have separate boot drives, and possibly even a spare SSD per node

each node also has atleast 32gb mem (master has 48gb)

Master = HP ML360 - DUAL 2.9 XEONS - 48GB ECC MEM
Slave1 - HP EliteDesk - i5-4950 @ 3.3 - 32gb NON ECC
Slaves 3-4 - Dell OptiPlex - i5-6500 @ 3.2gb NON ECC

For context im doing ALL of this to learn a bit, i would like to do K8 cluster on this using the learn linux videos from youtube as a baseline for this project

https://www.youtube.com/watch?v=U1VzcjCB_sY&t=806s

B.Otto · Jul 27, 2023

Hello,

in theory you can also do a 'meshed' setup with 4 nodes. The problem is that in a meshed network with n hosts every host needs n-1 network ports for the mesh. In the classic 3-node setup that is two ports each (and 3 cables total), but in a 4 -node setup that is already 3 ports (and 6 cables total).

So Meshes doesnt scale well.

Secondly, it is not advised to put an even number of hosts in a cluster since quorum is based on majority (and if your cluster splits 2-2- there will be no subgroup with majority). There are Qdevices, but that only works for the Proxmox clustering, not for the Ceph part.

Oh, and there is nothing like 'Master' or 'Slaves' in Proxmox. Every Host in a Cluster has the same priority.

three nodes have 500gb samsung evos and one only has a 256gb drive (not to jumble issues but is this an issue?)

Proxmox itself doesnt need much space. The question is how much space your VMs need.

Also, consumer-grade SSDs (like Samsungs Evo) are terrible for virtualized environments. They dont offer PLP and get incredibly slow once you get past the small write cache. Also they have a very low DWPD and die really quick under standard server load. They might work for getting to know Proxmox in a test environment, but will not give you any fun in any production environment.

Kind regards,
Benedikt

aaron · Jul 27, 2023

Should work as well if you build a ring, either with the RSTP or routed (with fallback) variant: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

Redundancy won't be great, but that isn't an issue here.

Do you have at least 2 disks? One for the OS, and one for a Ceph OSD?

bigobx · Jul 27, 2023

B.Otto said:
Hello,

in theory you can also do a 'meshed' setup with 4 nodes. The problem is that in a meshed network with n hosts every host needs n-1 network ports for the mesh. In the classic 3-node setup that is two ports each (and 3 cables total), but in a 4 -node setup that is already 3 ports (and 6 cables total).

So Meshes doesnt scale well.

Secondly, it is not advised to put an even number of hosts in a cluster since quorum is based on majority (and if your cluster splits 2-2- there will be no subgroup with majority). There are Qdevices, but that only works for the Proxmox clustering, not for the Ceph part.

Oh, and there is nothing like 'Master' or 'Slaves' in Proxmox. Every Host in a Cluster has the same priority.

Proxmox itself doesnt need much space. The question is how much space your VMs need.

Also, consumer-grade SSDs (like Samsungs Evo) are terrible for virtualized environments. They dont offer PLP and get incredibly slow once you get past the small write cache. Also they have a very low DWPD and die really quick under standard server load. They might work for getting to know Proxmox in a test environment, but will not give you any fun in any production environment.

Kind regards,
Benedikt

Thanks for the insight,

Makes sense on the even number of nodes, since im just testing i could technically NOT use one in the cluster but had the hardware in place so built the stack as best i could hahaha

As for space, again a bit unknown BUT any space needed would be boot drive space, i have over 250TB of synology spinning rust for any "data" storage.

My end goal would be a few Ubuntu servers running K8's with *arr stack running on it for my own use and a second K8's stack for my automation buddies to do some similar testing on.

I technically dont need ceph BUT it looks super cool so figured might be a cool experiment to try to set it up this way

bigobx · Jul 27, 2023

aaron said:
Should work as well if you build a ring, either with the RSTP or routed (with fallback) variant: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

Redundancy won't be great, but that isn't an issue here.

Do you have at least 2 disks? One for the OS, and one for a Ceph OSD?

Ok my networking skills arent that awesome but ill look into that some more, any other supporting docs i can read?

Yes it doesnt need to be redundant, maybe later on down the line i can score a better switch with more 10g ports and that can change

Yes, on the disks, i updated the OP as well.

I have spinning rust for boot drives in most since the base OS doesnt need much and im homelabbing so building on the cheap here lol.

The evos are for dedicated Ceph use and i can probably scratch in another smaller SSD if it really helps, maybe like a 128 or 256 gb as i might have enough laying around

goal is to run a few ubuntu servers and *arr stack for souring new goodies from the net. Probably add a Plex server mostly for back end services as i have dedicated machines for those today. And a handfull of containers for the smaller more isolated things like Pihole or other "services"

bigobx · Jul 27, 2023


Node with Dual 40GB Network cards

auto lo
iface lo inet loopback

auto enp3s4f0
iface enp3s4f0 inet manual

auto enp3s4f1
iface enp3s4f1 inet manual

iface ens3f0 inet manual - UnUsed 40GB

iface ens3f1 inet manual - UnUsed 40GB

auto bond0
iface bond0 inet manual
        bond-slaves enp3s4f0 enp3s4f1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3

auto vmbr0
iface vmbr0 inet static
        address 192.168.128.53/24
        gateway 192.168.128.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0


Sample Node with Single 10GB

auto lo
iface lo inet loopback

iface enp0s31f6 inet manual - Unused 1GB

iface enp1s0 inet manual - Unused 10GB

auto enp3s0f0
iface enp3s0f0 inet manual

auto enp3s0f1
iface enp3s0f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves enp3s0f0 enp3s0f1
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet static
        address 192.168.128.45/24
        gateway 192.168.128.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

is 21 the physical port and 18 and 19 the virtual OVS ports running on it?

20 in my case would be my current LACP bond so kinda leave that in place

auto lo
iface lo inet loopback

iface ens20 inet manual

auto ens21
iface ens21 inet static
address 10.14.14.51
netmask 255.255.255.0

auto vmbr0
iface vmbr0 inet static
address 192.168.2.51
netmask 255.255.240.0
gateway 192.168.2.1
bridge_ports ens20
bridge_stp off
bridge_fd 0

auto ens18
iface ens18 inet manual
ovs_type OVSPort
ovs_bridge vmbr1
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged

auto ens19
iface ens19 inet manual
ovs_type OVSPort
ovs_bridge vmbr1
ovs_options other_config:rstp-enable=true other_config:rstp-path-cost=150 other_config:rstp-port-admin-edge=false other_config:rstp-port-auto-edge=false other_config:rstp-port-mcheck=true vlan_mode=native-untagged

auto vmbr1
iface vmbr1 inet static
address 10.15.15.50/24
ovs_type OVSBridge
ovs_ports ens18 ens19
up ovs-vsctl set Bridge ${IFACE} rstp_enable=true other_config:rstp-priority=32768 other_config:rstp-forward-delay=4 other_config:rstp-max-age=6
post-up sleep 10

bigobx · Jul 31, 2023

aaron said:
Should work as well if you build a ring, either with the RSTP or routed (with fallback) variant: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

Redundancy won't be great, but that isn't an issue here.

Do you have at least 2 disks? One for the OS, and one for a Ceph OSD?

Sorry for double quote but this felt easier.

So took some time to actually read the guide slowly and i think i got it working:

ens18, ens19 will be used for the actual full mesh. Physical direct connections to the other two servers, 10.15.15.y/24
ens20 connection to WAN (internet/router), using at vmbr0 192.168.2.y

In my case the one machine has three NIC's in it and the rest only have a single.

Tried Ceph and didnt understand the network and heartbeat and public so had to redo stuff and decided to take some lessons learned and wiped the stack and started over and it seems to be working well so far minus any VM's but thats the point, I wanted to focus first on building a good platform and going from there.

Had to quickly dust off / learn some networking stuff to create a few different networks but got that working too:

- CoroSync 1G dedicated network on a small unmanaged switch
- RSTP 10G deployed with one host serving as the switch (ordered a Mikrotik 4+1 port jammie so ill update this)
- Management / LAN - 2GB LACP from each NODE back to my core

All nodes have dedicated boot drives NOT listed as they are JUST for pve

Ceph Dedicated
Storage is as Follows:

OptiPlex[1&2] -
500gb Evo - nvme

EliteDesk
500gb Evo - nvme
256gb Samsung SSD - sata

Proliant
256gb Skynix - nvme
256gb Skynix SSD - sata (DB for SAS drives)
128gb Kingston - sata (DB for Sata drives)
[2] 256gb WD Blue - hdd
[8] 146gb HP SAS (10KRPM) - hhd

Its giving me about 1TB of shared storage in the FS which should work great for my needs. I understand that i can "tier" the storage a bit which is perfect as i can have some boot and scratch disk available and then copy to NAS for backup and true DATA ingestion storage.

Thanks so much for the help, i hope you read into this more and give me some more great advice as i move along

aaron · Aug 1, 2023

bigobx said:
In my case the one machine has three NIC's in it and the rest only have a single.

wait, in the first post you mentioned that one machine has 40Gbit which you can split with a breakout cable, and the other machines have 2x 10Gbit and 1x 1Gbit?

If that is not correct, then the overall setup might not work. If that is the case, then my idea was to build something like this:

Code:

┌─────────┐    ┌─────────┐
│  node1  ├────┤  node2  │
└────┬────┘    └────┬────┘
     │              │
┌────┴────┐    ┌────┴────┐
│  node4  ├────┤  node3  │
└─────────┘    └─────────┘

Where each of the 2x 10Gbit NICs (or breakout cable) connect to one of the other nodes, building a ring. Then either the Routed (with fallback) or RSTP Loop variants should technically work.

Can the 40 Gbit NIC handle a breakout cable and present 4 independent NICs to the OS?

bigobx said:
Its giving me about 1TB of shared storage in the FS which should work great for my needs. I understand that i can "tier" the storage a bit which is perfect as i can have some boot and scratch disk available and then copy to NAS for backup and true DATA ingestion storage.

It really depends on how you configure the OSDs. Keep in mind that Ceph wants to have 3 replicas by default. Therefore, each data object will be stored 3 times on different nodes.

This limits you when you want to set up multiple storages and separate by speed (SSD, vs. HDD). As you only have one Node with HDDs and not 3.
Device classes and rules that match them are what you'd use:

Code:

ceph osd crush rule create-replicated replicated_ssd default host ssd

You could of course change the rule for the HDD base pool to distribute the data not on the node, but OSD level. That way the cluster is only protected against loss of OSDs, but not the full node. It is definitely not a production setup, and should you have any issues, you should make that clear as people will not expect such a setup.

bigobx · Aug 1, 2023

Yeah sorry for my not explaining things correctly and changing things around a bit, let me set the stage by responding and updating my current working config

Hardware Info:
Master = HP ML360 - DUAL 2.9 XEONS - 48gb ECC MEM
Slave1 - HP EliteDesk - i5-4950 @ 3.3 - 32gb NON ECC
Slaves 3-4 - Dell OptiPlex - i5-6500 @ 3.2 - 32gb NON ECC

wait, in the first post you mentioned that one machine has 40Gbit which you can split with a breakout cable, and the other machines have 2x 10Gbit and 1x 1Gbit?

Network Info:

Corosync is on a separate physical MANAGED switch (GST108TV1)

Proliant - HP-ML360G6 - MANY PCIE Ports and Lanes - The Master for explanation purposes ONLY i get in clusters there is no real master

Screen Shot 2023-08-01 at 12.46.14 PM.png

Currently has 3 10GB interfaces for Ceph only
Dual SFP Intel and single SFP Mallanox in a bridge with RSTP setup

Dual 1GB in LACP for LAN/MNGT

1gb Corosync

Elite:

Screen Shot 2023-08-01 at 12.49.29 PM.png

Single Mellanox with RSTP for Ceph with DAC hooked up to Proliant

Dual 1GB in LACP for LAN/MNGT

1gb Corosync

Opti 1&2 are identical:

Screen Shot 2023-08-01 at 12.49.45 PM.png

Single Mellanox with RSTP for Ceph with DAC hooked up to Proliant

Dual 1GB in LACP for LAN/MNGT

1gb Corosync

OSD Layout

Screen Shot 2023-08-01 at 12.47.21 PM.png

The large array of drives are on Proliant which is also why its dubbed "The Master" for discussion purposes
The spinning rust drives have SSD for DB drive in Ceph
2*250gb ->128gb SSD for DB
8*146GB SAS -> 256gb SSD

I like the one large pool idea and ceph just handles the distribution logic under the hood if understand things correctly, with that logic i threw everything at the "array" as shown and explained above

On, i think a good side note, since those 40gb cards and breakout cables didnt work, i returned them and ordered a Mikrotik sfp switch to use instead of the advanced config above

Also a new kinda issue is im having a problem with Proliant and corosync, it was loosing connectivity a bit and not sure why.

Secondarily and only cause maybe its related, one of my dell optiplex nodes onboard i217 that im using for corosync is defaulting to 100MB. I think its a hardware defect as i tried 6 different cables and 3 different upstream switches all with the same result.

I recently purchased that node from someone and im going to swap it out this week at some point.

floh8 · Aug 7, 2023

B.Otto said:
Secondly, it is not advised to put an even number of hosts in a cluster since quorum is based on majority (and if your cluster splits 2-2- there will be no subgroup with majority).

I have a understanding problem according split brain. When I use a even Cluster with more than 3 Nodes in one locality. How can then happen a split brain? All Nodes are connected via Switch in a star net. If I use 4 Nodes in circle like here mentioned, then there is also no chance of a split brain- only if 2 networkcabel fail at once at the right position.
Only if I use a geo cluster there is a chance of a split brain. Do i think right or i'm wrong?

aaron · Aug 11, 2023

floh8 said:
I have a understanding problem according split brain. When I use a even Cluster with more than 3 Nodes in one locality. How can then happen a split brain? All Nodes are connected via Switch in a star net. If I use 4 Nodes in circle like here mentioned, then there is also no chance of a split brain- only if 2 networkcabel fail at once at the right position.
Only if I use a geo cluster there is a chance of a split brain. Do i think right or i'm wrong?

Chances are low, but it could still happen. Maybe some firewall rules that are a bit weird/wrong. What about having multiple networks for Corosync that are also connected to different switches. If you stumble over the cables just right, it could happen.

Don't underestimate the situations that can result from all kinds of causes that are not even on your radar

floh8 · Aug 11, 2023

aaron said:
What about having multiple networks for Corosync that are also connected to different switches.

Normally u have a switch stack not only for the storage network but also for the cluster network. I take that as a given. Thats why I think there is no chance.

Search

Search

4 node CEPH cluster - No Switch

bigobx

New Member

B.Otto

Active Member

aaron

Proxmox Staff Member

bigobx

New Member

bigobx

New Member

bigobx

New Member

bigobx

New Member

aaron

Proxmox Staff Member

bigobx

New Member

floh8

Well-Known Member

aaron

Proxmox Staff Member

floh8

Well-Known Member