Ceph in Mesh network, fault tolerance

Jan 17, 2016
55
3
28
46
Forlì, Italy
www.soasi.com
I'm following the Full Mesh guide, method 2 (routing, not broadcast), and everything works.
I want to add faul tolerance, to handle cable/nic port failures.

At first, I thought to use bonding: I have 3 nodes, with 4 10Gb ports each. I connected each node with each other with 2 bonded cables. It works, but I have 10%-50% packets lost if one of the two connections fails. Also, I found a RedHat document that states that bonding without a switch is an unsupported method, highly dependant on NIC hardware.

The other option I'm thinking of, is to use route to reach all nodes using the "other" node if the primary connection fails. I mean:

NODEA: 10.10.2.10
NODEB: 10.10.2.11
NODEC: 10.10.2.12



Code:
root@NODEA:~# ip a
[...]
6: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ac:1f:6b:ba:a4:7a brd ff:ff:ff:ff:ff:ff
    inet 10.10.2.10/24 brd 10.10.2.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:feba:a47a/64 scope link
       valid_lft forever preferred_lft forever
[...]
8: enp24s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ac:1f:6b:ba:a4:78 brd ff:ff:ff:ff:ff:ff
    inet 10.10.2.10/24 brd 10.10.2.255 scope global enp24s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:feba:a478/64 scope link
       valid_lft forever preferred_lft forever
[...]
Code:
root@NODEA:~# ip route add 10.10.2.11/32 nexthop  dev enp24s0f0  weight 10 nexthop dev eno1 weight 1
root@NODEA:~# ip r
[...]
10.10.2.11
        nexthop dev enp24s0f0 weight 10
        nexthop dev eno1 weight 1
10.10.2.12 dev eno1 scope link
I enable ip forwarding in NODEC:
Code:
root@NODEC:~# echo 1 > /proc/sys/net/ipv4/ip_forward
I ping NODEB from NODEA
Code:
root@NODEA:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
64 bytes from 10.10.2.11: icmp_seq=1 ttl=64 time=0.156 ms
I ping NODEB from NODEC
Code:
root@NODEC:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
64 bytes from 10.10.2.11: icmp_seq=1 ttl=64 time=0.208 ms
Bring down primary NODEA-NODEB connection
Code:
root@NODEA:~# ip link set enp24s0f0 down
Now, I would like that NODEA reaches NODEB through NODEC, but it doesn't work:
Code:
root@NODEA:~# ping 10.10.2.11
PING 10.10.2.11 (10.10.2.11) 56(84) bytes of data.
^C
--- 10.10.2.11 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
All advice welcome.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,681
333
88
At first, I thought to use bonding: I have 3 nodes, with 4 10Gb ports each. I connected each node with each other with 2 bonded cables. It works, but I have 10%-50% packets lost if one of the two connections fails. Also, I found a RedHat document that states that bonding without a switch is an unsupported method, highly dependant on NIC hardware.
Yes, but it depends on the bond mode too.

The other option I'm thinking of, is to use route to reach all nodes using the "other" node if the primary connection fails. I mean:
You need to allow ip forwarding, otherwise the packets will just be dropped.
 
Jan 17, 2016
55
3
28
46
Forlì, Italy
www.soasi.com
Hi Alwin, thanks for your answer.

Yes, but it depends on the bond mode too.
I tried balance-rr, balance-alb, active-backup. But I don't want to insist on this, as the routing method seems more elegant to me, and I use only two NIC ports per server.

You need to allow ip forwarding, otherwise the packets will just be dropped.
I did already enabled it in NODEC (the "middle" one), I tried enabling it in NODEA (the "pinging" one) but it still doesn't work.

I tried to capture ICMP traffic in NODEC (tcpdump -i eno2 -n icmp), but it doesn't receive nothing, it seems that pings don't pass in the lower weight interface of NODEA, once the higher weight interface is down.
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,681
333
88
How does the complete routing table look like (ip route)?
 
Jan 17, 2016
55
3
28
46
Forlì, Italy
www.soasi.com
How does the complete routing table look like (ip route)?
This is the routing table of NODEA:
Code:
root@NODEA:/etc/network# ip r
default via 10.10.1.254 dev vmbr0 onlink
10.10.1.0/24 dev vmbr0 proto kernel scope link src 10.10.1.10
10.10.2.0/24 dev eno1 proto kernel scope link src 10.10.2.10
10.10.2.0/24 dev enp24s0f0 proto kernel scope link src 10.10.2.10
10.10.2.11
        nexthop dev enp24s0f0 weight 10
        nexthop dev eno1 weight 1
10.10.2.12 dev eno1 scope link
10.10.3.0/24 dev bond1 proto kernel scope link src 10.10.3.10 linkdown
And this is the interfaces file:
Code:
auto lo
iface lo inet loopback

auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.10
        netmask  24
#       up ip route add 10.10.2.11/32 dev enp24s0f0
        up  ip route add 10.10.2.11/32 nexthop  dev enp24s0f0  weight 10 nexthop dev eno1 weight 1
        down ip route del 10.10.2.11
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.10
        netmask  24
        up ip route add 10.10.2.12/32 dev eno1
        down ip route del 10.10.2.12
#10GB Ceph Sync NodeC

iface enp101s0f0 inet manual
#Backup

iface enp101s0f1 inet manual
#Backup

iface enp101s0f2 inet manual
#Public

iface enp101s0f3 inet manual
#Public

auto bond1
iface bond1 inet static
        address 10.10.3.10
        netmask 24
        bond-slaves enp101s0f0 enp101s0f1
        bond-miimon 100
        bond-mode active-backup
#Backup / Migration / Corosync

auto bond0
iface bond0 inet manual
        bond-slaves enp101s0f2 enp101s0f3
        bond-miimon 100
        bond-mode active-backup
#Public

auto vmbr0
iface vmbr0 inet static
        address  10.10.1.10
        netmask  255.255.255.0
        gateway  10.10.1.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,681
333
88
10.10.2.0/24 dev eno1 proto kernel scope link src 10.10.2.10 10.10.2.0/24 dev enp24s0f0 proto kernel scope link src 10.10.2.10
I suppose these entries will have priority. Set the netmask to /32 for 10.10.2.10 and add the 'up ip route' to each interface specifically.

For Example (from the top of my head):
Code:
auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.10
        netmask  32       
        up ip route add 10.10.2.11/32 dev enp24s0f0
        up  ip route add 10.10.2.12/32 nexthop  dev enp24s0f0 weight 10
        down ip route del 10.10.2.11/32 dev enp24s0f0
        down ip route del 10.10.2.12/32 dev enp24s0f0
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.10
        netmask  32
        up ip route add 10.10.2.12/32 dev eno1
        up  ip route add 10.10.2.11/32 nexthop  dev eno1 weight 10
        down ip route del 10.10.2.11/32 dev eno1
        down ip route del 10.10.2.12/32 dev eno1
#10GB Ceph Sync NodeC
 
Jan 17, 2016
55
3
28
46
Forlì, Italy
www.soasi.com
I'm almost there.

Once all links are up, I can issue these commands in NODEA and NODEB:
Code:
root@NODEA:~#ip route add 10.10.2.11 nexthop dev enp24s0f0  weight 2 nexthop via 10.10.2.12
root@NODEB:~#ip route add 10.10.2.10 nexthop dev enp24s0f0  weight 2 nexthop via 10.10.2.12
And everything works.
The problem now is to put it in the interfaces file.
If I put the full command in the up of both interfaces, I get errors at boot when the first interface is up, because the "other" interface is still not up. Also, if one of the two interfaces is not working, this way the route never comes up:
Code:
auto enp24s0f0
iface enp24s0f0 inet static
        address  10.10.2.11
        netmask  24
        up ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        up ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
#        down ip route del 10.10.2.10
#10GB Ceph Sync NodeB

auto eno1
iface eno1 inet static
        address  10.10.2.11
        netmask  24
        up ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        up ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
#        down ip route del 10.10.2.12
#10GB Ceph Sync NodeC
If I try to split the command in two, which should be the best, the second one fails probably because the source ip address is the same:
Code:
root@NODEB:~# ip route add 10.10.2.10 nexthop dev enp24s0f0 weight 2
root@NODEB:~# ip route add 10.10.2.10 nexthop via 10.10.2.12
RTNETLINK answers: File exists
In the end, I could create a script in if-up.d like this, if nothing else works:
Code:
if eno1 and enp24s0f0 are up:
        ip route add 10.10.2.10/32 nexthop dev enp24s0f0 weight 2 nexthop via 10.10.2.12
        ip route add 10.10.2.12/32 nexthop dev eno1 weight 2 nexthop via 10.10.2.12
if only eno1 is up
        ip route add 10.10.2.10/32 via 10.10.2.12
        ip route add 10.10.2.12/32 nexthop dev eno1
if only enp24s0f0 is up
        ip route add 10.10.2.12/32 via 10.10.2.10
        ip route add 10.10.2.10/32 nexthop dev enp24s0f0
I don't like it very much, and I'm trying to find a solution which could be good enough to be used in your ceph mesh documentation.

Thanks
Stefano
 

Alwin

Proxmox Staff Member
Staff member
Aug 1, 2017
3,681
333
88
The netmask of the interfaces are still /24, that will create the default route for the whole network on both interfaces. That could have an impact.
 
Jan 17, 2016
55
3
28
46
Forlì, Italy
www.soasi.com
After creating a script in if-up.d and if-post-down.d, and banging my head on various exception, I settled to use balance-rr bonding, even if it doesn't work very well when bringing down an interface. I'll see how it works when fisically disconnecting a cable.
This is my final configuration:
Code:
auto bond2
iface bond2 inet static
        address  10.10.2.10
        netmask  24
        bond-slaves enp24s0f0 enp24s0f1
        bond-miimon 100
        bond-mode balance-rr
        up ip route add 10.10.2.11/32 dev bond2
        down ip route del 10.10.2.11
#Ceph Sync NodeB
auto bond3
iface bond3 inet static
        address  10.10.2.10
        netmask  24
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode balance-rr
        up ip route add 10.10.2.12/32 dev bond3
        down ip route del 10.10.2.12
#Ceph Sync NodeC
Thanks
Stefano
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!