Evaluate my network setup?

godzilla

Member
May 20, 2021
76
5
13
42
Hi all,
pretty new to Proxmox and Ceph. We've been running a test cluster on three nodes, each node on Gigabit network for the Ceph network as well and so far we are satisfied with performance and resiliency. So we're planning to deploy a production cluster soon.

The new cluster is starting with 6 nodes, each one hosting 4x1TB SSDs for OSD. I plan to scale it to possibly 30+ nodes.

I would like to setup each node's network cards like this:
  • VM public bridge: 2x1GBps NICs, connected to distinct switches
  • Management (for node reachability/management): 1Gbps NIC connected to dedicated switch
  • OOB (iLO) network: dedicated port on server, connected to dedicated switch
  • Ceph public network: 2x10Gbps, each one connected to a different switch and with distinct IP address (e.g. 10.0.0.1/24 and 10.0.1.1/24)
  • Ceph cluster network: 1xGbps, connected to dedicated switch
Three questions:
  1. Can I actually set up the Ceph public network on two distinct network trunks? Or am I forced to use a meshed setup like this?
  2. Is 1Gbps enough for Ceph cluster network (OSD replication + heartbeat according to the Proxmox VE Administration Guide)?
  3. Should I create a separate cluster network for corosync?
Thanks in advance!
 
Last edited:
If you post the contents of /etc/network/interfaces it will be the easiest to spot any problems in the network config :)

For the Ceph network: 2 NICs connected to different switches is a good idea. In case a switch fails or needs to be rebootet. In that case, consider setting up a bond using those two NICs, then there is no need for two different subnets. Placing the optional Ceph Cluster network on a 1Gbit link is a bad idea. Either don't use it at all or also place it on the 10Gbit link. You can define a vlan on top of the bond and configure the IP for the cluster network.

The replication traffic between the OSDs is one of the larger parts of Ceph traffic and if you place it on a 1Gbit link you will severely limit the performance.

Corosync can itself switch between different networks if one becomes unavailable. You can configure up to 8 corosync links. The recommendation is to have at least one dedicated physical corosync link to avoid any problems that might occur if other services use the same link. If you want to use HA it is of the utmost importance that corosync is working. Corosync is used in the PVE HA stack to determine if a node is still part of the cluster. If the corosync connection fails for too long (I think about 2min) then the node will fence itself -> like pushing the reset button. It does this to make sure that any HA guests on it are not running anymore and it is safe to start them on the remaining nodes.

Now if corosync is sharing its network with other services that might take up a lot of bandwidth from time to time, for example Ceph, backups, .... then it can easily happen that these other service take up almost all bandwidth which causes the latency for the corosync packets to go up. With a bit of bad luck the latency for corosync will go up so high that corosync will consider that link as unusable. If there is no other link available to switch to that is stable, it can happen that all nodes are in the situation that they think they lost the connection to the cluster and what you see is that the whole cluster will restart out of the blue.
 
If you post the contents of /etc/network/interfaces it will be the easiest to spot any problems in the network config :)

[...]

Hi Aaron,
thank you very much! I didn't install any node (yet) but if I have any doubt I'll post /etc/network/interfaces for sure :)

Anyway here's what I'm going to do based on your suggestions:
  • VM public bridge: bond between two Gigabit interfaces, connected to distinct switches for redudancy
  • Management interface: unchanged
  • OOB (iLO): unchanged
  • Corosync:
    • one link on the public bond (so I have at least one redundant link)
    • one link on the management interface as a fallback (the management interface should never take up a lot of bandwidth anyway)
  • Ceph public network: bond between two 10Gbps interfaces, connected to distinct switches (I'm going to use SSD for all OSD so I need enough bandwidth)
  • Ceph cluster network: VLAN on the 10Gbps bond
Do you think is it good enough?

Also, I'm considering setting up QinQ on my "public" switches so I can assign arbitrary VLANs to my customers and have the packets tagged to my public VLAN in order to reach the gateway. Do you think is it a good idea overall?

Thank you again!
 
Last edited:
Hi,

here's my /etc/network/interfaces:

Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto eno3
iface eno3 inet manual

iface eno4 inet manual

auto eno49
iface eno49 inet static
    address 172.27.0.1/24
    mtu 9000
#ceph public

auto eno50
iface eno50 inet static
    address 172.28.0.1/24
    mtu 9000
#ceph cluster

auto bond0
iface bond0 inet manual
    bond-slaves eno2 eno3
    bond-miimon 100
    bond-mode balance-rr

auto vmbr0
iface vmbr0 inet static
    address 10.10.27.1/20
    gateway 10.10.16.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
#VM traffic

As you see I have vmbr1 over bond0 which I use for VM traffic, and two separate 10G networks for Ceph public and Ceph cluster. Each of them on a dedicated switch.

My only problem so far is: what if the switch for Ceph public network goes down? I tested rebooting it and (as expected) all the I/O completely stopped.

Could I manage to add both the 10GBe interfaces to a bond and the separate the traffic with VLANs? Or maybe just make the bond and let the cluster traffic use the doubled bandwidth?

Thank you!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!