Ceph network recommendations

daros · Feb 19, 2021

Hello all,

Currently we use network bond (2x 10gb) with different vlan's for ceph, internet and between vm's network.
We use 2x Arista 7050s switches with channel groups.

We're think about expanding our network and add a dual 10gb to every node.
What will be the best option?
1) make a bond for ceph (20gb) and a bond for the rest(20gb)?
2) Make a single bond of 40gbs?
3) Or for the current setup it's not necessary to extend our network with extra network cards?

Cluster information:
6x node (each node 256gb 2x e5-2620v4)
ceph full ssd
38 osd's
osd's mix of:
19x 1TB PM863a
19x 4TB PM883

Example network config:
.15 internal network
.16 ceph network
MTU is default 1500

Code:

root@prox-s05:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface ens1f0 inet manual
iface ens1f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens1f0 ens1f1
        bond-miimon 100
        bond-mode 4
        bond-downdelay 400
        bond-updelay 800

auto vmbr0
        iface vmbr0 inet static
        address 192.168.15.95
        netmask  255.255.255.0
        gateway 192.168.15.251
        bridge-ports bond0.15
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

auto vmbr2
        iface vmbr2 inet static
        address 192.168.16.95
        netmask 255.255.255.0
        bridge-ports bond0.25
        bridge-stp off
        bridge-fd 0

daros · Feb 21, 2021

Nobody?

Klaus Steinberger · Feb 22, 2021

daros said:
Hello all,

Currently we use network bond (2x 10gb) with different vlan's for ceph, internet and between vm's network.
We use 2x Arista 7050s switches with channel groups.

We're think about expanding our network and add a dual 10gb to every node.
What will be the best option?
1) make a bond for ceph (20gb) and a bond for the rest(20gb)?
2) Make a single bond of 40gbs?
3) Or for the current setup it's not necessary to extend our network with extra network cards?

Cluster information:
6x node (each node 256gb 2x e5-2620v4)
ceph full ssd
38 osd's
osd's mix of:
19x 1TB PM863a
19x 4TB PM883

Example network config:
.15 internal network
.16 ceph network
MTU is default 1500

Code:

root@prox-s05:~# cat /etc/network/interfaces auto lo iface lo inet loopback iface ens1f0 inet manual iface ens1f1 inet manual auto bond0 iface bond0 inet manual bond-slaves ens1f0 ens1f1 bond-miimon 100 bond-mode 4 bond-downdelay 400 bond-updelay 800 auto vmbr0 iface vmbr0 inet static address 192.168.15.95 netmask 255.255.255.0 gateway 192.168.15.251 bridge-ports bond0.15 bridge-stp off bridge-fd 0 auto vmbr1 iface vmbr1 inet manual bridge-ports bond0 bridge-stp off bridge-fd 0 auto vmbr2 iface vmbr2 inet static address 192.168.16.95 netmask 255.255.255.0 bridge-ports bond0.25 bridge-stp off bridge-fd 0

The recommendation for CEPH is to have CEPH backend traffic on a separate network, also corosync traffic should be ideally on a separate network

LACP Bonding over 4 ports would probably not have the expected effect, as the bandwith between two dedicated nodes would not exceed 10G.

So I would recommend your Option 1) make a bond for ceph (20gb) and a bond for the rest(20gb)?

If you have additional 1 G Ports available on the hosts I would recommend setting up a second dedicated network (must not be bonded) for a second corosync Ring.

daros · Feb 23, 2021

Klaus Steinberger said:
The recommendation for CEPH is to have CEPH backend traffic on a separate network, also corosync traffic should be ideally on a separate network

LACP Bonding over 4 ports would probably not have the expected effect, as the bandwith between two dedicated nodes would not exceed 10G.

So I would recommend your Option 1) make a bond for ceph (20gb) and a bond for the rest(20gb)?

If you have additional 1 G Ports available on the hosts I would recommend setting up a second dedicated network (must not be bonded) for a second corosync Ring.

Thank you.

Will do that.

And because we got 2 10gb nic with 2 sfp+ ports.
Is it best to create an bond
bond0 nic1-sfp1 nic2-sfp1
bond1 nic1-sfp2 nic2-sfp2

Or
bond0 nic1-sfp1 nic1-sfp2
bond1 nic2-sfp2 nic2-sfp2

Klaus Steinberger · Feb 24, 2021

daros said:
Thank you.

Will do that.

And because we got 2 10gb nic with 2 sfp+ ports.
Is it best to create an bond
bond0 nic1-sfp1 nic2-sfp1
bond1 nic1-sfp2 nic2-sfp2

Or
bond0 nic1-sfp1 nic1-sfp2
bond1 nic2-sfp2 nic2-sfp2

If you spread the bond over the 2 nic boards you can survive a defective board, but from performance point of view its the same (at least if both PCIe slots can support the data rates)

daros · Feb 25, 2021

Klaus Steinberger said:
If you spread the bond over the 2 nic boards you can survive a defective board, but from performance point of view its the same (at least if both PCIe slots can support the data rates)

We got an 6 node cluster, so we can have an single nic failure it that means the node is gone.
But it would nice to have if it doesn't come with limits.

Maybe some else know whats best to do?

Christian St. · Feb 26, 2021

This is the structure proxmox recomands:

The most essential part in a Proxmox HA Cluster is the Corosync link. Corosync should have minimum 1 (better 2) seperate physical link (not shared with any other traffic) because of latency. Corosync links do not need a high bandwidth connection, but a stable and low latency connection. You could come into big troubles if there is some interrupt on the cluster (corosync) link and a ceph node gets restarted. When it is up again the storage traffic for rebalancing ceph is high and your latency for corosync becomes high again, maybe the next node starts again...

If you have 2 powerful switches, the configuration do not really need 6 switches, it can be 2 as well, but it is neccesary that you have those (minimum 5) 6 physical connections from each node. I would use one 20Gbit Bond for the Storage (Ceph) network with LACP, and one 20GBit Bond for (Public VM) also with LACP and additional VLANs to virtually seperate VMs if needed. For the Cluster Network (which is Corosync) there should be, as described minimum 1 seperated, better 2 seperated physical links. Do you have some 1GBit ports on your nodes?
If you have enought ports on your switches (18 ports per switch for this 6 node cluster) use your two Aristas. If you cannot spend 6 10Gbit Ports per Switch for the (just) 1GBit Corosync connection you need two more switches, which can be also some older 1GBit models, but you should check the latency. Corosync works with an redundand ring protocol (https://pve.proxmox.com/wiki/Separate_Cluster_Network) Do not use LACP for Corosync. If you have one seperated ring on each switch (a subnet specially for Corosync witch connects all nodes via one switch, and another subnet specially for the 2nd ring wich connects all nodes via the second switch) Corosync can work in active mode like a LACP. (Must be configured) The standard passive mode changes the link automatically to the better connection and this is also without a single point of failure.
Use the bonds over 2 network cards, to have each network connected, if one adapter card has a failure.

Always test all possible szenarios before running the cluster in production. (Eg. Failure of one switch, failure of each node, failure of each link, ect.)

daros · Mar 10, 2021

Update:

Currently al nodes got an second bond.
goes for the easy Solution so 1 bond for every nic

Search

Search

Ceph network recommendations

daros

Renowned Member

daros

Renowned Member

Klaus Steinberger

Renowned Member

daros

Renowned Member

Klaus Steinberger

Renowned Member

daros

Renowned Member

Christian St.

Member

daros

Renowned Member