Ceph in Mesh network, fault tolerance

Stefano Giunchi

Renowned Member
Jan 17, 2016
91
13
73
50
Forlì, Italy
www.soasi.com
I'm following the Full Mesh guide, method 2 (routing, not broadcast), and everything works.
I want to add faul tolerance, to handle cable/nic port failures.

At first, I thought to use bonding: I have 3 nodes, with 4 10Gb ports each. I connected each node with each other with 2 bonded cables. It works, but I have 10%-50% packets lost if one of the two connections fails. Also, I found a RedHat document that states that bonding without a switch is an unsupported method, highly dependant on NIC hardware.

The other option I'm thinking of, is to use route to reach all nodes using the "other" node if the primary connection fails. I mean:

NODEA: 10.10.2.10
NODEB: 10.10.2.11
NODEC: 10.10.2.12



Code:
root@NODEA:~# ip a
[...]
6: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ac:1f:6b:ba:a4:7a brd ff:ff:ff:ff:ff:ff
    inet 10.10.2.10/24 brd 10.10.2.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:feba:a47a/64 scope link
       valid_lft forever preferred_lft forever
[...]
8: enp24s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ac:1f:6b:ba:a4:78 brd ff:ff:ff:ff:ff:ff
    inet 10.10.2.10/24 brd 10.10.2.255 scope global enp24s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:feba:a478/64 scope link
       valid_lft forever preferred_lft forever
[...]
Code:
root@NODEA:~# ip route add 10.10.2.11/32 nexthop  dev enp24s0f0  weight 10 nexthop dev eno1 weight 1
root@NODEA:~# ip r
[...]
10.10.2.11
        nexthop dev enp24s0f0 weight 10
        nexthop dev eno1 weight 1
10.10.2.12 dev eno1 scope link

I enable ip forwarding in NODEC:
Code:
root@NODEC:~# echo 1 > /proc/sys/net/ipv4/ip_forward
I ping NODEB from NODEA
Code:
root@NODEA:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
64 bytes from 10.10.2.11: icmp_seq=1 ttl=64 time=0.156 ms
I ping NODEB from NODEC
Code:
root@NODEC:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
64 bytes from 10.10.2.11: icmp_seq=1 ttl=64 time=0.208 ms
Bring down primary NODEA-NODEB connection
Code:
root@NODEA:~# ip link set enp24s0f0 down

Now, I would like that NODEA reaches NODEB through NODEC, but it doesn't work:
Code:
root@NODEA:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
^C
--- 10.10.2.11 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

All advice welcome.
 
At first, I thought to use bonding: I have 3 nodes, with 4 10Gb ports each. I connected each node with each other with 2 bonded cables. It works, but I have 10%-50% packets lost if one of the two connections fails. Also, I found a RedHat document that states that bonding without a switch is an unsupported method, highly dependant on NIC hardware.
Yes, but it depends on the bond mode too.

The other option I'm thinking of, is to use route to reach all nodes using the "other" node if the primary connection fails. I mean:
You need to allow ip forwarding, otherwise the packets will just be dropped.
 
Hi Alwin, thanks for your answer.

Yes, but it depends on the bond mode too.
I tried balance-rr, balance-alb, active-backup. But I don't want to insist on this, as the routing method seems more elegant to me, and I use only two NIC ports per server.

You need to allow ip forwarding, otherwise the packets will just be dropped.
I did already enabled it in NODEC (the "middle" one), I tried enabling it in NODEA (the "pinging" one) but it still doesn't work.

I tried to capture ICMP traffic in NODEC (tcpdump -i eno2 -n icmp), but it doesn't receive nothing, it seems that pings don't pass in the lower weight interface of NODEA, once the higher weight interface is down.
 
How does the complete routing table look like (ip route)?
 
How does the complete routing table look like (ip route)?

This is the routing table of NODEA:
Code:
root@NODEA:/etc/network# ip r
default via 10.10.1.254 dev vmbr0 onlink
10.10.1.0/24 dev vmbr0 proto kernel scope link src 10.10.1.10
10.10.2.0/24 dev eno1 proto kernel scope link src 10.10.2.10
10.10.2.0/24 dev enp24s0f0 proto kernel scope link src 10.10.2.10
10.10.2.11
        nexthop dev enp24s0f0 weight 10
        nexthop dev eno1 weight 1
10.10.2.12 dev eno1 scope link
10.10.3.0/24 dev bond1 proto kernel scope link src 10.10.3.10 linkdown

And this is the interfaces file:
Code:
auto lo
iface lo inet loopback

auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.10
        netmask  24
#       up ip route add 10.10.2.11/32 dev enp24s0f0
        up  ip route add 10.10.2.11/32 nexthop  dev enp24s0f0  weight 10 nexthop dev eno1 weight 1
        down ip route del 10.10.2.11
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.10
        netmask  24
        up ip route add 10.10.2.12/32 dev eno1
        down ip route del 10.10.2.12
#10GB Ceph Sync NodeC

iface enp101s0f0 inet manual
#Backup

iface enp101s0f1 inet manual
#Backup

iface enp101s0f2 inet manual
#Public

iface enp101s0f3 inet manual
#Public

auto bond1
iface bond1 inet static
        address 10.10.3.10
        netmask 24
        bond-slaves enp101s0f0 enp101s0f1
        bond-miimon 100
        bond-mode active-backup
#Backup / Migration / Corosync

auto bond0
iface bond0 inet manual
        bond-slaves enp101s0f2 enp101s0f3
        bond-miimon 100
        bond-mode active-backup
#Public

auto vmbr0
iface vmbr0 inet static
        address  10.10.1.10
        netmask  255.255.255.0
        gateway  10.10.1.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
 
10.10.2.0/24 dev eno1 proto kernel scope link src 10.10.2.10 10.10.2.0/24 dev enp24s0f0 proto kernel scope link src 10.10.2.10
I suppose these entries will have priority. Set the netmask to /32 for 10.10.2.10 and add the 'up ip route' to each interface specifically.

For Example (from the top of my head):
Code:
auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.10
        netmask  32       
        up ip route add 10.10.2.11/32 dev enp24s0f0
        up  ip route add 10.10.2.12/32 nexthop  dev enp24s0f0 weight 10
        down ip route del 10.10.2.11/32 dev enp24s0f0
        down ip route del 10.10.2.12/32 dev enp24s0f0
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.10
        netmask  32
        up ip route add 10.10.2.12/32 dev eno1
        up  ip route add 10.10.2.11/32 nexthop  dev eno1 weight 10
        down ip route del 10.10.2.11/32 dev eno1
        down ip route del 10.10.2.12/32 dev eno1
#10GB Ceph Sync NodeC
 
It doesn't work:
Code:
root@NODEA:/etc/network# ifup enp24s0f0
root@NODEA:/etc/network# ifup eno1
RTNETLINK answers: File exists
ifup: failed to bring up eno1

It seems I can't add a second nexthop from the same ip source (even if with different dev).
 
I'm almost there.

Once all links are up, I can issue these commands in NODEA and NODEB:
Code:
root@NODEA:~#ip route add 10.10.2.11 nexthop dev enp24s0f0  weight 2 nexthop via 10.10.2.12
root@NODEB:~#ip route add 10.10.2.10 nexthop dev enp24s0f0  weight 2 nexthop via 10.10.2.12

And everything works.
The problem now is to put it in the interfaces file.
If I put the full command in the up of both interfaces, I get errors at boot when the first interface is up, because the "other" interface is still not up. Also, if one of the two interfaces is not working, this way the route never comes up:
Code:
auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.11
        netmask  24
        up ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        up ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
#        down ip route del 10.10.2.10
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.11
        netmask  24
        up ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        up ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
#        down ip route del 10.10.2.12
#10GB Ceph Sync NodeC

If I try to split the command in two, which should be the best, the second one fails probably because the source ip address is the same:
Code:
root@NODEB:~# ip route add 10.10.2.10 nexthop dev enp24s0f0 weight 2
root@NODEB:~# ip route add 10.10.2.10 nexthop via 10.10.2.12
RTNETLINK answers: File exists

In the end, I could create a script in if-up.d like this, if nothing else works:
Code:
if eno1 and enp24s0f0 are up:
        ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
if only eno1 is up
        ip route add 10.10.2.10/32 via 10.10.2.12
        ip route add 10.10.2.12/32 nexthop dev eno1
if only enp24s0f0 is up
        ip route add 10.10.2.12/32 via 10.10.2.10
        ip route add 10.10.2.10/32 nexthop dev enp24s0f0

I don't like it very much, and I'm trying to find a solution which could be good enough to be used in your ceph mesh documentation.

Thanks
Stefano
 
The netmask of the interfaces are still /24, that will create the default route for the whole network on both interfaces. That could have an impact.
 
After creating a script in if-up.d and if-post-down.d, and banging my head on various exception, I settled to use balance-rr bonding, even if it doesn't work very well when bringing down an interface. I'll see how it works when fisically disconnecting a cable.
This is my final configuration:
Code:
auto bond2
iface bond2 inet static
        address  10.10.2.10
        netmask  24
        bond-slaves enp24s0f0 enp24s0f1
        bond-miimon 100
        bond-mode balance-rr
        up ip route add 10.10.2.11/32 dev bond2
        down ip route del 10.10.2.11
#Ceph Sync NodeB
auto bond3
iface bond3 inet static
        address  10.10.2.10
        netmask  24
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode balance-rr
        up ip route add 10.10.2.12/32 dev bond3
        down ip route del 10.10.2.12
#Ceph Sync NodeC

Thanks
Stefano
 
Just an update: bond with balance-rr works perfectly if I disconnect a cable.
I had problems when keeping down an interface with if-down (I did lose 50% of pings) but if I disconnect the cable, all traffic is correctly routed to the still connected interface of the bond.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!