Evaluate my network setup?

godzilla · May 24, 2021

Hi all,
pretty new to Proxmox and Ceph. We've been running a test cluster on three nodes, each node on Gigabit network for the Ceph network as well and so far we are satisfied with performance and resiliency. So we're planning to deploy a production cluster soon.

The new cluster is starting with 6 nodes, each one hosting 4x1TB SSDs for OSD. I plan to scale it to possibly 30+ nodes.

I would like to setup each node's network cards like this:

VM public bridge: 2x1GBps NICs, connected to distinct switches
Management (for node reachability/management): 1Gbps NIC connected to dedicated switch
OOB (iLO) network: dedicated port on server, connected to dedicated switch
Ceph public network: 2x10Gbps, each one connected to a different switch and with distinct IP address (e.g. 10.0.0.1/24 and 10.0.1.1/24)
Ceph cluster network: 1xGbps, connected to dedicated switch

Three questions:

Can I actually set up the Ceph public network on two distinct network trunks? Or am I forced to use a meshed setup like this?
Is 1Gbps enough for Ceph cluster network (OSD replication + heartbeat according to the Proxmox VE Administration Guide)?
Should I create a separate cluster network for corosync?

Thanks in advance!

aaron · May 27, 2021

If you post the contents of /etc/network/interfaces it will be the easiest to spot any problems in the network config

For the Ceph network: 2 NICs connected to different switches is a good idea. In case a switch fails or needs to be rebootet. In that case, consider setting up a bond using those two NICs, then there is no need for two different subnets. Placing the optional Ceph Cluster network on a 1Gbit link is a bad idea. Either don't use it at all or also place it on the 10Gbit link. You can define a vlan on top of the bond and configure the IP for the cluster network.

The replication traffic between the OSDs is one of the larger parts of Ceph traffic and if you place it on a 1Gbit link you will severely limit the performance.

Corosync can itself switch between different networks if one becomes unavailable. You can configure up to 8 corosync links. The recommendation is to have at least one dedicated physical corosync link to avoid any problems that might occur if other services use the same link. If you want to use HA it is of the utmost importance that corosync is working. Corosync is used in the PVE HA stack to determine if a node is still part of the cluster. If the corosync connection fails for too long (I think about 2min) then the node will fence itself -> like pushing the reset button. It does this to make sure that any HA guests on it are not running anymore and it is safe to start them on the remaining nodes.

Now if corosync is sharing its network with other services that might take up a lot of bandwidth from time to time, for example Ceph, backups, .... then it can easily happen that these other service take up almost all bandwidth which causes the latency for the corosync packets to go up. With a bit of bad luck the latency for corosync will go up so high that corosync will consider that link as unusable. If there is no other link available to switch to that is stable, it can happen that all nodes are in the situation that they think they lost the connection to the cluster and what you see is that the whole cluster will restart out of the blue.

godzilla · Jun 8, 2021

aaron said:
If you post the contents of /etc/network/interfaces it will be the easiest to spot any problems in the network config

[...]

Hi Aaron,
thank you very much! I didn't install any node (yet) but if I have any doubt I'll post /etc/network/interfaces for sure

Anyway here's what I'm going to do based on your suggestions:

VM public bridge: bond between two Gigabit interfaces, connected to distinct switches for redudancy
Management interface: unchanged
OOB (iLO): unchanged
Corosync:
- one link on the public bond (so I have at least one redundant link)
- one link on the management interface as a fallback (the management interface should never take up a lot of bandwidth anyway)
Ceph public network: bond between two 10Gbps interfaces, connected to distinct switches (I'm going to use SSD for all OSD so I need enough bandwidth)
Ceph cluster network: VLAN on the 10Gbps bond

Do you think is it good enough?

Also, I'm considering setting up QinQ on my "public" switches so I can assign arbitrary VLANs to my customers and have the packets tagged to my public VLAN in order to reach the gateway. Do you think is it a good idea overall?

Thank you again!

godzilla · Aug 3, 2021

Hi,

here's my /etc/network/interfaces:

Code:

# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto eno3
iface eno3 inet manual

iface eno4 inet manual

auto eno49
iface eno49 inet static
    address 172.27.0.1/24
    mtu 9000
#ceph public

auto eno50
iface eno50 inet static
    address 172.28.0.1/24
    mtu 9000
#ceph cluster

auto bond0
iface bond0 inet manual
    bond-slaves eno2 eno3
    bond-miimon 100
    bond-mode balance-rr

auto vmbr0
iface vmbr0 inet static
    address 10.10.27.1/20
    gateway 10.10.16.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
#VM traffic

As you see I have vmbr1 over bond0 which I use for VM traffic, and two separate 10G networks for Ceph public and Ceph cluster. Each of them on a dedicated switch.

My only problem so far is: what if the switch for Ceph public network goes down? I tested rebooting it and (as expected) all the I/O completely stopped.

Could I manage to add both the 10GBe interfaces to a bond and the separate the traffic with VLANs? Or maybe just make the bond and let the cluster traffic use the doubled bandwidth?

Thank you!

Search

Search

Evaluate my network setup?

godzilla

Active Member

aaron

Proxmox Staff Member

godzilla

Active Member

godzilla

Active Member

We value your privacy