3 node cluster replication error (networking)

jaykavathe

Member
Feb 25, 2021
36
3
13
43
Spent a while today learning a bit about networking, traffic and setup 3 different networks on my 3 node cluster.
Each of 3 node have 2x 1GB and 2x 10GB ports.

I used eno3 for management, eno4 for cluster and eno1/eno2 for point to point full mesh. Now when I try to start replication, I am getting "-get_migration_ip' failed: exit code 255".

I am quite sure this is because I can't ping nodes using Full mesh address. Can someone help me figure out the error? If I recall right, I was able to ping nodes using 10.15.15.x address earlier and that stopped working after I created Linux Bond for "Broadcast".

192.168.1.201/24 (202/203) - Management network/ip
192.168.30.201/24 (202/203) - Cluster network/ip
10.15.15.51/32 (52/53) - Full mesh 10GB network/ip for migrataiton

Code:
root@myst1:~# ip route
default via 192.168.1.1 dev vmbr0 proto kernel onlink
10.15.15.52 nhid 21 via 10.15.15.52 dev eno1 proto openfabric metric 20 onlink
10.15.15.53 nhid 24 via 10.15.15.53 dev eno2 proto openfabric metric 20 onlink
192.168.1.0/24 dev vmbr0 proto kernel scope link src 192.168.1.201
192.168.30.0/24 dev eno4 proto kernel scope link src 192.168.30.201

Code:
#/etc/pve/network/interfaces.d
auto lo
iface lo inet loopback

iface eno3 inet manual

auto eno4
iface eno4 inet static
        address 192.168.30.201/24
#cluster

auto eno1
iface eno1 inet static
        mtu 9000

auto eno2
iface eno2 inet static
        mtu 9000

auto bond0
iface bond0 inet static
        address 10.15.15.51/32
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode broadcast
#Full Mesh

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.201/24
        gateway 192.168.1.1
        bridge-ports eno3
        bridge-stp off
        bridge-fd 0

Code:
#/etc/frr/frr.conf
frr defaults traditional
hostname myst1
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.15.15.51/32
 ip router openfabric 1
 openfabric passive
!
interface eno1
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
interface eno2
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
!
line vty
!
router openfabric 1
 net 49.0001.1111.1111.1111.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

Code:
root@myst1:~# vtysh

Hello, this is FRRouting (version 8.5.2).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

myst1# show openfabric neighbor
Area 1:
  System Id           Interface   L  State        Holdtime SNPA
 myst2               eno1        2  Up            2        2020.2020.2020
 myst3               eno2        2  Up            2        2020.2020.2020
myst1# show openfabric route
Area 1:
IS-IS L2 IPv4 routing table:

 Prefix          Metric  Interface  Nexthop      Label(s)
 ----------------------------------------------------------
 10.15.15.51/32  0       -          -            -
 10.15.15.52/32  20      eno1       10.15.15.52  -
 10.15.15.53/32  20      eno2       10.15.15.53  -

Code:
root@myst1:~# ping 10.15.15.52
PING 10.15.15.52 (10.15.15.52) 56(84) bytes of data.
From 10.15.15.51 icmp_seq=1 Destination Host Unreachable
From 10.15.15.51 icmp_seq=2 Destination Host Unreachable
....
14 packets transmitted, 0 received, +8 errors, 100% packet loss, time 13288ms
pipe 4
root@myst1:~#
 
Last edited:
auto bond0
iface bond0 inet static
address 10.15.15.51/32
bond-slaves eno1 eno2
bond-miimon 100
bond-mode broadcast
According to the instructions at: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server it should look like this:
Code:
auto bond0
iface bond0 inet static
    address  10.15.15.51/32
    slaves eno1 eno2
    bond_miimon 100
    bond_mode broadcast
#Full Mesh
When it comes to bond options, you often don't know what the correct spelling is or whether they all work. So I would just take that out of the instructions.

Furthermore, a /32 only has one IP, which is why that alone cannot and should not work. For point to point you have to take at least a /31 (https://datatracker.ietf.org/doc/html/rfc3021).
 
Thank you, updating the 10G subnet to /24 helped and at least I can ping each other but now I am not able to migrate still.

"2023-12-03 18:39:02 100-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=myst2' root@192.168.1.202 pvecm mtunnel -migration_network 192.168.30.201/24 -get_migration_ip' failed: exit code 255"

Testing migration on cluster network since it didnt work on full mesh. Should I remove nodes from cluster and add them back in?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!