Minor Change - Thunderbolt Networking

B-C · Jun 10, 2024

Reviewed a bit and went for the Thunderbolt networking setup provided here for my MS-01 Cluster.

https://gist.github.com/scyto/67fdc9a517faefa68f730f82d7fa3570
&
https://gist.github.com/scyto/4c664734535da122f4ab2951b22b2085

Differences which are giving me grief is that I already had the base up and running in latest version on the 2.5g interfaces - no ring just plain + Ceph on the same network. (to get things up and running)

The isolated network is the 10.99.99.21-23 IPs

However if I change the Migration network in DataCenter > Options to the 10.99.99.0/24 network I get these errors on normal migrations
98% sure I missed something! ;p

Code:

could not get migration ip: no IP address configured on local node for network '10.99.99.21/32'
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve03/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.22.20.23 pvecm mtunnel -migration_network 10.99.99.21/32 -get_migration_ip' failed: exit code 255

Switching Migration back to 10.22.20.0/24 does work temporarily back to the slower Net

Ran these on each host to each host:
ssh -o 'HostKeyAlias=pve01' root@10.99.99.21
ssh -o 'HostKeyAlias=pve02' root@10.99.99.22
ssh -o 'HostKeyAlias=pve03' root@10.99.99.23

and they will connect without issue using keys. (just have to accept the fingerprint one time for each)

Just wondering what I need to correct to get it to correctly use that network for migrations?

Then additionally It looks like I'd follow these steps to move the Ceph network over to that same 10G+ Net as well as smoothly as possible with limited outage which one at a time seems possible if reading correctly:
https://forum.proxmox.com/threads/ceph-changing-public-network.119116/

markhaines · Monday at 16:20

Did you get this resolved in the end - has it been reliable for you? Am just about to setup 2 x MS-01 in a cluster and have a thunderbolt cable between them ready to set this up.

B-C · Monday at 19:17

Nope - ended up not being good for me...
would have needed to edit / modify a bunch of configs that didn't want to do on a production cluster.

TLDR:
If I was started from scratch or in full development - I'd love to test more on this.

When I implemented the Mesh it was already setup and running and would have had to migrate machines off the Ceph and rebuild the OSDs on the correct network, so abandoned it.

--- Long version ---

subsequent updates to the MS-01s seems to have broken the USB networking and I never looked back to review.
unable to ping across those interfaces at least.

So holding on my my 2.5g interface and its been stable and no MEBx - using a ipkvmv4 mini, to get to them if they fail.

All 3 on i9-13900H 0x4121 microcode
1x is still on 1.22 vs 1.24
2x of the 3 MS-01s are running with reduced RAM speeds 4400 vs max 5200/5600 via test firmware 1.24 and now the cluster is holding well past 30+ days without failures. (currently at 28 days since last power issue at the site that outlasted the UPS, but came back up without issue on its own)

-----

Apologies I couldn't be of more help!

markhaines · Monday at 21:32

@B-C thanks for the update, that's really detailed. I'm setting these up as basically fresh so am tempted to have a play before putting it in production. Have updated both to latest 1.26 firmware and done the microcode updates (via the tteck scripts). I've not tried proxmox with Ceph before - but would be useful for a couple of VMs where i could do with HA, what's the performance like?

B-C · Monday at 21:45

performance good - small office nothing major
Those little MS-01s scream pretty darn good - haven't done any benchmarks though.

2 Nodes is a bit of a "hack" as it requires 3x nodes, but like you saw his scripts might make it doable!

beyond that nice on the 1.26 - I'll probably need to get myself updated to that - haven't been monitoring much last couple months.!

--- One huge Favor ---
Document everything you can -
I might buy another set of 3 just for testing here in a few months - but interested in how the 2 node cluster goes
Especially with tbolt mesh - would really like to re-attempt that -

my test cluster is Optiplex 3040 i5s on 1gb nic only.

M.2 SATA 2TB and ended up using partitions with OSD on ceph (also not recommended) but HA and all is holding well and healthy just not "Optimal"
(these only have a single SATA connection - so had to do a SATA to M.2 adapter - its a dual M.2 but only sees them as a single drive so ended up populating only one m.2

have a Micro 7060 that has M.2 as #4 loaded esxi on that one to test migrations from vmware... worked well enough.

Search

Search

Minor Change - Thunderbolt Networking

B-C

New Member

markhaines

Member

B-C

New Member

markhaines

Member

B-C

New Member