3 node cluster replication error (networking)

jaykavathe · Dec 3, 2023

Spent a while today learning a bit about networking, traffic and setup 3 different networks on my 3 node cluster.
Each of 3 node have 2x 1GB and 2x 10GB ports.

I used eno3 for management, eno4 for cluster and eno1/eno2 for point to point full mesh. Now when I try to start replication, I am getting "-get_migration_ip' failed: exit code 255".

I am quite sure this is because I can't ping nodes using Full mesh address. Can someone help me figure out the error? If I recall right, I was able to ping nodes using 10.15.15.x address earlier and that stopped working after I created Linux Bond for "Broadcast".

192.168.1.201/24 (202/203) - Management network/ip
192.168.30.201/24 (202/203) - Cluster network/ip
10.15.15.51/32 (52/53) - Full mesh 10GB network/ip for migrataiton

Code:

root@myst1:~# ip route
default via 192.168.1.1 dev vmbr0 proto kernel onlink
10.15.15.52 nhid 21 via 10.15.15.52 dev eno1 proto openfabric metric 20 onlink
10.15.15.53 nhid 24 via 10.15.15.53 dev eno2 proto openfabric metric 20 onlink
192.168.1.0/24 dev vmbr0 proto kernel scope link src 192.168.1.201
192.168.30.0/24 dev eno4 proto kernel scope link src 192.168.30.201

Code:

#/etc/pve/network/interfaces.d
auto lo
iface lo inet loopback

iface eno3 inet manual

auto eno4
iface eno4 inet static
        address 192.168.30.201/24
#cluster

auto eno1
iface eno1 inet static
        mtu 9000

auto eno2
iface eno2 inet static
        mtu 9000

auto bond0
iface bond0 inet static
        address 10.15.15.51/32
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode broadcast
#Full Mesh

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.201/24
        gateway 192.168.1.1
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0

Code:

#/etc/frr/frr.conf
frr defaults traditional
hostname myst1
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.15.15.51/32
 ip router openfabric 1
 openfabric passive
!
interface eno1
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface eno2
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0001.1111.1111.1111.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

Code:

root@myst1:~# vtysh

Hello, this is FRRouting (version 8.5.2).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

myst1# show openfabric neighbor
Area 1:
  System Id           Interface   L  State        Holdtime SNPA
 myst2               eno1        2  Up            2        2020.2020.2020
 myst3               eno2        2  Up            2        2020.2020.2020
myst1# show openfabric route
Area 1:
IS-IS L2 IPv4 routing table:

 Prefix          Metric  Interface  Nexthop      Label(s)
 ----------------------------------------------------------
 10.15.15.51/32  0       -          -            -
 10.15.15.52/32  20      eno1       10.15.15.52  -
 10.15.15.53/32  20      eno2       10.15.15.53  -

Code:

root@myst1:~# ping 10.15.15.52
PING 10.15.15.52 (10.15.15.52) 56(84) bytes of data.
From 10.15.15.51 icmp_seq=1 Destination Host Unreachable
From 10.15.15.51 icmp_seq=2 Destination Host Unreachable
....
14 packets transmitted, 0 received, +8 errors, 100% packet loss, time 13288ms
pipe 4
root@myst1:~#

sb-jw · Dec 3, 2023

jaykavathe said:
auto bond0
iface bond0 inet static
address 10.15.15.51/32
bond-slaves eno1 eno2
bond-miimon 100
bond-mode broadcast

According to the instructions at: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server it should look like this:

Code:

auto bond0
iface bond0 inet static
    address  10.15.15.51/32
    slaves eno1 eno2
    bond_miimon 100
    bond_mode broadcast
#Full Mesh

When it comes to bond options, you often don't know what the correct spelling is or whether they all work. So I would just take that out of the instructions.

Furthermore, a /32 only has one IP, which is why that alone cannot and should not work. For point to point you have to take at least a /31 (https://datatracker.ietf.org/doc/html/rfc3021).

jaykavathe · Dec 4, 2023

Thank you, updating the 10G subnet to /24 helped and at least I can ping each other but now I am not able to migrate still.


"2023-12-03 18:39:02 100-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=myst2' root@192.168.1.202 pvecm mtunnel -migration_network 192.168.30.201/24 -get_migration_ip' failed: exit code 255"

Testing migration on cluster network since it didnt work on full mesh. Should I remove nodes from cluster and add them back in?

Search

Search

3 node cluster replication error (networking)

jaykavathe

Active Member

sb-jw

Famous Member

jaykavathe

Active Member

We value your privacy