[SOLVED] Proxmox Ceph Redundant Network Setup help required

Mario Minati

Active Member
Jun 11, 2018
9
0
41
47
Hello @all,

we are new to Proxmox. Currently we are using Univention Corporate Server to virtualize 15 machines with 3 physical servers. We are lacking a shared storage and HA. Therefore we would like to setup a proxmox cluster with 5 physical machines, 3 identically configured machines for ceph and 2 machines for virtualisation.

We read a lot of posts in the forum, the wiki, the docs and thing we have enough background to start setting things up. We would like to ask if out network setup is suitable for the goals of high availabilty and redundance.

Attached you find a diagram of our network setup:
- 2 seperate 1 GBE networks for coro sync ring 0 and ring 1 with seperate switches from which we use 1 network for management (external access to proxmox web interfaces and lights out management)
- 2 seperate 10 GBE networks as ceph public networks with seperate switches and usage bonding
- 2 seperate 10 GBE network as ceph cluster network with seperate switches and usage of bonding
- 1 seperate 1 GBE network to access the virtual machines from the outside (DMZ / Intranet)

Questions:
- Is this suitable for redundancy?
- Is this suitable for good performance?
- Is the selected bond_mode (balance_rr) ok for use in a configration with seperate switches to acchieve also a good performance?

Thanks for your suggestions!

Best greets,

Mario Minati
 

Attachments

  • 2018-06-12 Netzwerkstruktur - Vereinfacht.pdf
    70.2 KB · Views: 217
- 2 seperate 1 GBE networks for coro sync ring 0 and ring 1 with seperate switches from which we use 1 network for management (external access to proxmox web interfaces and lights out management)
+1 or :thumbsup: ; Depending on the needs of your backup (backup/iso/templates), you may need more than 1 GbE.

- 2 seperate 10 GBE networks as ceph public networks with seperate switches and usage bonding
- 2 seperate 10 GBE network as ceph cluster network with seperate switches and usage of bonding
Looks good on the redundancy level, but check the latency, Ceph is very sensitive to latency and as lower the better.

- 1 seperate 1 GBE network to access the virtual machines from the outside (DMZ / Intranet)
If 1 GbE is sufficient, enough for peak traffic? This isn't redundant, is it?

- Is this suitable for redundancy?
- Is this suitable for good performance?
- Is the selected bond_mode (balance_rr) ok for use in a configration with seperate switches to acchieve also a good performance?
First, +1 for the nice network diagram. For redundancy, my comments above. The balance_rr mode will send TCP packets out of order as traffic increases, this will trigger a retransmit and stall your ceph network. Better use a active+backup or LACP. For connecting two switches you may need MLAG. I guess a "easier" setup might be to use a 2x10 GbE bond on both switches and separat ceph's public and cluster network through VLAN and a active+backup setup that utilizes each bond individually, up to a switch failure. On failure, both networks would be put on one bond. This still keeps redundancy, but you don't need MLAG or any other method to have inter-switch LACP.
 
  • Like
Reactions: herzkerl
Hello Alwin,

thanks for your advices. We improved our network setup according to your suggestions (new network diagram is attached) :

- The management net ist now seperated from corosync ring 0 net, we use additional 10GBE network ports, so we can change the exisiting 1 GBE network switch if we get bandwidth problems on that network

- To add redundancy and improve peak bandwidth to the outer world (1 GBE dmz network) we use an additional 10 GBE network port in bonding configuration with an 1 GBE port. Is the bond_mode balance_rr suitable for that connection or should it also be an active-backup bond?

- The bond mode for ceph private and public networks are changed to active-backup, but we still would like to use seperated switches, which should provide use with the desired redundancy, right? I personally dont't like using VLAN that much, as it offers one more step of complexity where we can make mistakes.

After setup of ceph private and public network we will check latency with the test commands given in the docs... We expect low latency.

If you would like our kind of network diagram for documentation we can provide you with the LibreOffice Draw file. :)


Best greets,

Mario Minati
 

Attachments

  • 2018-06-15 Netzwerkstruktur - Vereinfacht.pdf
    84.7 KB · Views: 209
- The management net ist now seperated from corosync ring 0 net, we use additional 10GBE network ports, so we can change the exisiting 1 GBE network switch if we get bandwidth problems on that network
Good, corosync separation will save the bacon. ;)

- To add redundancy and improve peak bandwidth to the outer world (1 GBE dmz network) we use an additional 10 GBE network port in bonding configuration with an 1 GBE port. Is the bond_mode balance_rr suitable for that connection or should it also be an active-backup bond?
As written on the post above, as the packets might be out-of-order, the network card has the extra job to put all packets in sequence again. This may or may not work, depending on your application. If you have a 10 GbE connection then a active-backup with primary 10 GbE would not only give you more bandwidth but also lower latency.

- The bond mode for ceph private and public networks are changed to active-backup, but we still would like to use seperated switches, which should provide use with the desired redundancy, right? I personally dont't like using VLAN that much, as it offers one more step of complexity where we can make mistakes.
I guess, my description was a little bit confusing. In the case you can afford extra interfaces on both machines, then you don't need my idea. But if you want redundancy with no extra hardware, then the idea is as follows.

| eth0.100 & eth1.100 => bond0 => primary (eth0) on switch1 (cluster)
| eth0.101 & eth1.101 => bond1 => primary (eth1) on switch2 (public)

I hope it illustrates what I meant. On failure both, public & cluster reside on the same member of the bond. At normal operation both run separated.

After setup of ceph private and public network we will check latency with the test commands given in the docs... We expect low latency.
You may compare your results with our benchmarks and comparisons from other users in the thread.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

If you would like our kind of network diagram for documentation we can provide you with the LibreOffice Draw file. :)
Thanks for the offer, but I must decline. Definitely a good reference for the discussion though.
 
Hello @Alwin,
halle @all,

we have setup the hardware according to our plans (set attached network scheme) and have trouble on testing the bonded interfaces.
If running a ping from one Proxmox node to an other over a bonded interface and pulling the cable out oh one of the two NICs, than die ping doesn't recover.
Even if we reattach the cable, the ping doesn't resume.
After reattaching the cable to NIC1 we have to disconnect NIC2 that the ping command receives answers.

We discover the same behaviour if watching the quorum state of the ceph network over the Proxmox webinterface and pulling on of the two bonded nics of the ceph internal network.

What can be wrong with our configuration? Here you find the network configuration of the three nodes.

#
# pub-ceph-node-01
#

auto lo
iface lo inet loopback

auto enp66s0f1
iface enp66s0f1 inet static
address 10.247.11.11
netmask 255.255.0.0
gateway 10.247.1.1
#man.pub.intranet

auto eno1
iface eno1 inet static
address 10.246.11.11
netmask 255.255.255.0
#sync1.pub.intranet

auto eno2
iface eno2 inet static
address 10.246.12.11
netmask 255.255.255.0
#sync2.pub.intranet

iface enp8s0f0 inet manual
#san1.pub.intranet

iface enp8s0f1 inet manual
#ceph1.pub.intranet

iface enp65s0f0 inet manual
#san2.pub.intranet

iface enp65s0f1 inet manual
#ceph2.pub.intranet

iface enp66s0f0 inet manual

auto bond0
iface bond0 inet static
address 10.246.21.11
netmask 255.255.255.0
slaves enp65s0f0 enp8s0f0
bond_miimon 100
bond_mode active-backup
#san.pub.intranet

auto bond1
iface bond1 inet static
address 10.246.31.11
netmask 255.255.255.0
slaves enp65s0f1 enp8s0f1
bond_miimon 100
bond_mode active-backup
#ceph.pub.intranet



#
# pub-ceph-node-02
#

auto lo
iface lo inet loopback

auto enp66s0f1
iface enp66s0f1 inet static
address 10.247.11.12
netmask 255.255.0.0
gateway 10.247.1.1
#man.pub.intranet

auto eno1
iface eno1 inet static
address 10.246.11.12
netmask 255.255.255.0
#sync1.pub.intranet

auto eno2
iface eno2 inet static
address 10.246.12.12
netmask 255.255.255.0
#sync2.pub.intranet

iface enp65s0f0 inet manual
#san2.pub.intranet

iface enp65s0f1 inet manual
#ceph2.pub.intranet

iface enp66s0f0 inet manual

iface enp8s0f0 inet manual
#san1.pub.intranet

iface enp8s0f1 inet manual
#ceph1.pub.intranet

auto bond0
iface bond0 inet static
address 10.246.21.12
netmask 255.255.255.0
slaves enp65s0f0 enp8s0f0
bond_miimon 100
bond_mode active-backup
#san.pub.intranet

auto bond1
iface bond1 inet static
address 10.246.31.12
netmask 255.255.255.0
slaves enp65s0f1 enp8s0f1
bond_miimon 100
bond_mode active-backup
#ceph.pub.intranet



#
# pub-ceph-node-03
#

auto lo
iface lo inet loopback

auto enp66s0f1
iface enp66s0f1 inet static
address 10.247.11.13
netmask 255.255.0.0
gateway 10.247.1.1
#man.pub.intranet

auto eno1
iface eno1 inet static
address 10.246.11.13
netmask 255.255.255.0
#sync1.pub.intranet

auto eno2
iface eno2 inet static
address 10.246.12.13
netmask 255.255.255.0
#sync2.pub.intranet

iface enp8s0f0 inet manual
#san1.pub.intranet

iface enp8s0f1 inet manual
#ceph1.pub.intranet

iface enp65s0f0 inet manual
#san2.pub.intranet

iface enp65s0f1 inet manual
#ceph2.pub.intranet

iface enp66s0f0 inet manual

auto bond0
iface bond0 inet static
address 10.246.21.13
netmask 255.255.255.0
slaves enp65s0f0 enp8s0f0
bond_miimon 100
bond_mode active-backup
#san.pub.intranet

auto bond1
iface bond1 inet static
address 10.246.31.13
netmask 255.255.255.0
slaves enp65s0f1 enp8s0f1
bond_miimon 100
bond_mode active-backup
#ceph.pub.intranet


Any tips into which direction we should investigate is very welcome.

Best greets,
Mario
 
You can see the state of the bond at '/proc/net/bonding/bondX'. Secondly you may want to set the primary slave interface, so the bond will switch back, when the interface comes back online.
primary
A string (eth0, eth2, etc) specifying which slave is the primary device. The specified device will always be the active slave while it is available. Only when the primary is off-line will alternate devices be used. This is useful when one slave is preferred over another, e.g., when one slave has higher throughput than another. The primary option is only valid for active-backup mode.
https://wiki.linuxfoundation.org/networking/bonding

Last but not least, check your networking with each interface, to have all the scenarios covered. Routing issues?
 
Hello Alwin,

thanks for the quick reply.

You can see the state of the bond at '/proc/net/bonding/bondX'. Secondly you may want to set the primary slave interface, so the bond will switch back, when the interface comes back online.

I know about them and can set the primary interface manually in /etc/network/interfaces, it cann't be set via the GUI, right?

But I still have trouble understanding active-backup bonding after reading the bonding docs more than once:

If I pull e. g. node01 ceph cluster oder ceph public network connecting of the switch, the ping command stops, it doesn't recover even after waiting a while. My understanding of active-backup bonding mode is that the connection will be recovered automatically, isn't it?

As we are using separate switches for both networks (pyhsical separation) do all nodes on the network have to switch to the other network? Otherwise the disconnected (on one of both bonding ports) node will not receive the packets on the net from which he was separated, right?

Is maybe the configuration with active-backup bonding not correct in our case? Do we have to switch to broadcast?

Sorry many questions as my knowledge on bonding networking is not very deep... I hope you can enlighten me and guide us to a really redundant setup


Best greets,

Mario Minati
 
I know about them and can set the primary interface manually in /etc/network/interfaces, it cann't be set via the GUI, right?
Not through the GUI.

If I pull e. g. node01 ceph cluster oder ceph public network connecting of the switch, the ping command stops, it doesn't recover even after waiting a while. My understanding of active-backup bonding mode is that the connection will be recovered automatically, isn't it?

As we are using separate switches for both networks (pyhsical separation) do all nodes on the network have to switch to the other network? Otherwise the disconnected (on one of both bonding ports) node will not receive the packets on the net from which he was separated, right?
In active-backup, the link that fails will be replaced by one of the other slaves in the bond. But this happens only on the bond with the failed link. All other network traffic stays untouched. This means your network needs to be able to route traffic to each interface on the other nodes. With two switches, those two need to be connect through a trunk.

active-backup or 1

Active-backup policy: Only one slave in the bond is
active. A different slave becomes active if, and only
if, the active slave fails. The bond's MAC address is
externally visible on only one port (network adapter)
to avoid confusing the switch.
https://www.kernel.org/doc/Documentation/networking/bonding.txt

Is maybe the configuration with active-backup bonding not correct in our case? Do we have to switch to broadcast?
You don't want to flood your network.

Scenarios with active-backup bond:
  • One switch dies, all connected bonds switch to the remaining connected slave (traffic happens completely on the other switch)
  • A link of a NIC port in the bond fails, the bond switches to its remaining connected slave (traffic is running through both switches)
  • NIC dies, if the bond has two different NICs it can switch otherwise the node is dark.
 
Hello Alwin,

the trunk cable between the two switches was the missing link. After connecting the switches the redundancy worked fine.

Thank you very much for your help!


Best greets,

Mario Minati
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!