Ceph network recommendations

daros

Renowned Member
Jul 22, 2014
55
2
73
Hello all,

Currently we use network bond (2x 10gb) with different vlan's for ceph, internet and between vm's network.
We use 2x Arista 7050s switches with channel groups.

We're think about expanding our network and add a dual 10gb to every node.
What will be the best option?
1) make a bond for ceph (20gb) and a bond for the rest(20gb)?
2) Make a single bond of 40gbs?
3) Or for the current setup it's not necessary to extend our network with extra network cards?

Cluster information:
6x node (each node 256gb 2x e5-2620v4)
ceph full ssd
38 osd's
osd's mix of:
19x 1TB PM863a
19x 4TB PM883

Example network config:
.15 internal network
.16 ceph network
MTU is default 1500

Code:
root@prox-s05:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface ens1f0 inet manual
iface ens1f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens1f0 ens1f1
        bond-miimon 100
        bond-mode 4
        bond-downdelay 400
        bond-updelay 800

auto vmbr0
        iface vmbr0 inet static
        address 192.168.15.95
        netmask  255.255.255.0
        gateway 192.168.15.251
        bridge-ports bond0.15
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

auto vmbr2
        iface vmbr2 inet static
        address 192.168.16.95
        netmask 255.255.255.0
        bridge-ports bond0.25
        bridge-stp off
        bridge-fd 0
 
Last edited:
Hello all,

Currently we use network bond (2x 10gb) with different vlan's for ceph, internet and between vm's network.
We use 2x Arista 7050s switches with channel groups.

We're think about expanding our network and add a dual 10gb to every node.
What will be the best option?
1) make a bond for ceph (20gb) and a bond for the rest(20gb)?
2) Make a single bond of 40gbs?
3) Or for the current setup it's not necessary to extend our network with extra network cards?

Cluster information:
6x node (each node 256gb 2x e5-2620v4)
ceph full ssd
38 osd's
osd's mix of:
19x 1TB PM863a
19x 4TB PM883

Example network config:
.15 internal network
.16 ceph network
MTU is default 1500

Code:
root@prox-s05:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface ens1f0 inet manual
iface ens1f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens1f0 ens1f1
        bond-miimon 100
        bond-mode 4
        bond-downdelay 400
        bond-updelay 800

auto vmbr0
        iface vmbr0 inet static
        address 192.168.15.95
        netmask  255.255.255.0
        gateway 192.168.15.251
        bridge-ports bond0.15
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

auto vmbr2
        iface vmbr2 inet static
        address 192.168.16.95
        netmask 255.255.255.0
        bridge-ports bond0.25
        bridge-stp off
        bridge-fd 0
The recommendation for CEPH is to have CEPH backend traffic on a separate network, also corosync traffic should be ideally on a separate network

LACP Bonding over 4 ports would probably not have the expected effect, as the bandwith between two dedicated nodes would not exceed 10G.

So I would recommend your Option 1) make a bond for ceph (20gb) and a bond for the rest(20gb)?

If you have additional 1 G Ports available on the hosts I would recommend setting up a second dedicated network (must not be bonded) for a second corosync Ring.
 
The recommendation for CEPH is to have CEPH backend traffic on a separate network, also corosync traffic should be ideally on a separate network

LACP Bonding over 4 ports would probably not have the expected effect, as the bandwith between two dedicated nodes would not exceed 10G.

So I would recommend your Option 1) make a bond for ceph (20gb) and a bond for the rest(20gb)?

If you have additional 1 G Ports available on the hosts I would recommend setting up a second dedicated network (must not be bonded) for a second corosync Ring.
Thank you.

Will do that.

And because we got 2 10gb nic with 2 sfp+ ports.
Is it best to create an bond
bond0 nic1-sfp1 nic2-sfp1
bond1 nic1-sfp2 nic2-sfp2

Or
bond0 nic1-sfp1 nic1-sfp2
bond1 nic2-sfp2 nic2-sfp2
 
Thank you.

Will do that.

And because we got 2 10gb nic with 2 sfp+ ports.
Is it best to create an bond
bond0 nic1-sfp1 nic2-sfp1
bond1 nic1-sfp2 nic2-sfp2

Or
bond0 nic1-sfp1 nic1-sfp2
bond1 nic2-sfp2 nic2-sfp2
If you spread the bond over the 2 nic boards you can survive a defective board, but from performance point of view its the same (at least if both PCIe slots can support the data rates)
 
If you spread the bond over the 2 nic boards you can survive a defective board, but from performance point of view its the same (at least if both PCIe slots can support the data rates)
We got an 6 node cluster, so we can have an single nic failure it that means the node is gone.
But it would nice to have if it doesn't come with limits.

Maybe some else know whats best to do?
 
This is the structure proxmox recomands:
Proxmox Ceph_small.PNG

The most essential part in a Proxmox HA Cluster is the Corosync link. Corosync should have minimum 1 (better 2) seperate physical link (not shared with any other traffic) because of latency. Corosync links do not need a high bandwidth connection, but a stable and low latency connection. You could come into big troubles if there is some interrupt on the cluster (corosync) link and a ceph node gets restarted. When it is up again the storage traffic for rebalancing ceph is high and your latency for corosync becomes high again, maybe the next node starts again...

If you have 2 powerful switches, the configuration do not really need 6 switches, it can be 2 as well, but it is neccesary that you have those (minimum 5) 6 physical connections from each node. I would use one 20Gbit Bond for the Storage (Ceph) network with LACP, and one 20GBit Bond for (Public VM) also with LACP and additional VLANs to virtually seperate VMs if needed. For the Cluster Network (which is Corosync) there should be, as described minimum 1 seperated, better 2 seperated physical links. Do you have some 1GBit ports on your nodes?
If you have enought ports on your switches (18 ports per switch for this 6 node cluster) use your two Aristas. If you cannot spend 6 10Gbit Ports per Switch for the (just) 1GBit Corosync connection you need two more switches, which can be also some older 1GBit models, but you should check the latency. Corosync works with an redundand ring protocol (https://pve.proxmox.com/wiki/Separate_Cluster_Network) Do not use LACP for Corosync. If you have one seperated ring on each switch (a subnet specially for Corosync witch connects all nodes via one switch, and another subnet specially for the 2nd ring wich connects all nodes via the second switch) Corosync can work in active mode like a LACP. (Must be configured) The standard passive mode changes the link automatically to the better connection and this is also without a single point of failure.
Use the bonds over 2 network cards, to have each network connected, if one adapter card has a failure.

Always test all possible szenarios before running the cluster in production. (Eg. Failure of one switch, failure of each node, failure of each link, ect.)
 
Last edited:
  • Like
Reactions: bobmc
Update:

Currently al nodes got an second bond.
goes for the easy Solution so 1 bond for every nic
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!