Linux Bond balance-rr and Ceph Squid 19.2.1 OSD's Lost after Setting Up Bond

jeromehd

New Member
Dec 23, 2023
1
0
1
Here's what I did. When I first set up the 3-node cluster I only had 1 cable for the storage network, ie. where each host is connected 10Gb/s to a switch so they can use a shared storage scheme (Ceph - to which I'm fairly new and just learning)
This week we got the extra cables ordered (because we weren't satisfied with the few cables we had) and I went about installing the extra connections from each host and then set up in Networking a Linux Bond, like so:

Code:
auto lo
iface lo inet loopback

iface eno8303 inet manual

iface eno8403 inet manual

auto eno12399np0
iface eno12399np0 inet static
        address 10.10.0.6/24

auto eno12409np1
iface eno12409np1 inet static
        address 10.10.0.3/24

iface ens2f0np0 inet manual

iface ens2f1np1 inet manual

auto bond0
iface bond0 inet static
        address 10.10.0.9/24
        bond-slaves eno12399np0 eno12409np1
        bond-miimon 100
        bond-mode balance-rr

auto vmbr0
iface vmbr0 inet static
        address 192.168.5.127/24
        gateway 192.168.5.1
        bridge-ports eno8303
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*

Anyway my changes caused some kind of disruption with Ceph, and I ended up losing connections to the RBD pools I had made. The good thing is I just set up this cluster, and had only started migrating a few vm's from our old cluster. After all of this I had to redo my Ceph setup from scratch, including issues with the keyfile (was getting rados_connect failed - Operation not supported and another time failure to authenticate) somehow managed to get it online again, made new pools and all is fixed now - after restoring my vm's from the weekly backups.

Has any one else seen this sort if thing or has a better understanding of what happened here?

I'm also planning to upgrade to Proxmox 9 next week, I already tested a successful upgrade on the old cluster, which I set up last year. Any hints on how I can test my new cluster and what to look out for on it before upgrading? Concerned about Ceph breaking again as it seems like it will go to Ceph Squid 19.2.3.

Thanks!