Intel Nuc 13 Pro Thunderbolt Ring Network Ceph Cluster

scyto · Apr 5, 2024

esReveRse said:
I'm so grateful to all who documented their process of building a home lab cluster like this! I've just acquired 3 Intel NUC 13 Pro i7's for this exact purpose, and the information here, as well as the gist from @scyto will help immensely.

Glad it is useful, apologies for any mistakes in my docs and thanks to others on this community that helped me. Let me take a stab at some answers.

1. I just upgraded to 6.5.13-3 no ill effects so far (i know typing that is tempting fate) though i don't use ZFS so if you do YMMV.
2. I use the pve-no-subscription repo and keep everything upgraded using the UI (unless i am doing something very specific)
3. If you use 6.5.13-3 i am not aware of any need for kernel patches for the things i documented
4. I migrated to reef ages ago, i would say start there as pacific is about to the be EOS/EOL

All my answers are give for homelabbers, i would be more conservative in enterprise grade production.

esReveRse · Apr 7, 2024

scyto said:
Glad it is useful, apologies for any mistakes in my docs and thanks to others on this community that helped me. Let me take a stab at some answers.

1. I just upgraded to 6.5.13-3 no ill effects so far (i know typing that is tempting fate) though i don't use ZFS so if you do YMMV.
2. I use the pve-no-subscription repo and keep everything upgraded using the UI (unless i am doing something very specific)
3. If you use 6.5.13-3 i am not aware of any need for kernel patches for the things i documented
4. I migrated to reef ages ago, i would say start there as pacific is about to the be EOS/EOL

All my answers are give for homelabbers, i would be more conservative in enterprise grade production.

Thanks! I won't be using ZFS either, so my results will likely reflect yours. I appreciate the additional guidance. I'll be tackling this tomorrow!

rene.bayer · Apr 9, 2024

Just another hint ... My en0x interfaces sporadically didn't came up every time after a reboot.
Changing auto en0x in the interfaces file to allow-hotplug en0x did the trick there.

Running reef on ipv6 for a while now, but already had two cluster freezes where I had do completely cut the power source to bring it up again, sadly didn't gathered any logs so far

scyto · Apr 9, 2024

rene.bayer said:
Just another hint ... My en0x interfaces sporadically didn't came up every time after a reboot.
Changing auto en0x in the interfaces file to allow-hotplug en0x did the trick there.

Running reef on ipv6 for a while now, but already had two cluster freezes where I had do completely cut the power source to bring it up again, sadly didn't gathered any logs so far

great tip, what kernel are you on? for mine all i ever needed was this

Code:

auto en05
iface en05 inet manual
        mtu 65520

auto en06
iface en06 inet manual
        mtu 65520

i have had no cluster freezes at all on reef, i did have issues on one of the kernels around 6mo ago but after that none, i will keep monitoring the latest kernel and post back if i also get freezes - how often do you see them?

rene.bayer · Apr 10, 2024

scyto said:
what kernel are you on?

Currently running on 6.5.13-1-pve.

scyto said:
how often do you see them?

They only occurred twice so far, I installed the cluster around October last year.

Hope it was just a kernel bug in the version used ^^

scyto · Apr 10, 2024

thanks @rene.bayer

i am looking more deeply at my system now the new kernel is on mine - i found my entire mesh was down (i was lucky i had static routes on my router so ceph traffic was getting to the IPv6 addresses via my 2.5gb interfaces).

i. have moved to hot plug, that seems to have helped, but one node appears to be completely dead wrt TB (one thunderbolt message in dmesg and that was it - very weird).

very interesting... this is like the original retimer issuers i hit and was fixed in backports, i am hoping there isn't a TB regression in the latest kernels.... i will keep digging....

--edit--
hmm reboot of the bad node fixed it (also for bonus I updated the NUC firmware, seems ASUS are better than intel at FW updates for the intel NUCs)

esReveRse · Apr 10, 2024

My NUC 13 Pro cluster is up and running! I didn't hit any major speed bumps thanks to you guys.

I'm posting for two reasons:

To share my method for backing up my entire proxmox cluster configuration at a disk image level since it still took a bit of time to run through the steps to get it all working well, and I like the clean state it's in.
To share a few benchmarks in order to compare notes, with a secondary goal of figuring out how to get my Thunderbolt network to perform at 26Gbps (it's currently "only" at 21Gbps).
- Specifically, I saw earlier in this thread that @scyto posted a 26Gbps benchmark using the same exact hardware as mine, but that was back in August (and apparently before moving from the OSFP routing method to the Openfabric routing method).
- So, before I dive down the rabbit hole, I wanted to check - is the 26Gbps performance target still relevant, or am I getting the best I can at 21 Gbps? (Would you be willing to re-test @scyto?)
- Any ideas/clues as to why it would be different, especially considering I attempted to replicate the same setup (hardware, software, and config)?
- I've posted my config and TB4 cables below as well

BACKUP & RESTORE

Options

You can backup to a SMB share or a local USB disk
These instructions cover the SMB share
To backup to a local USB disk, you simply need to modify the "mount" command below to mount the device/filesystem format you have (and you don't need assign an IP address)

Context

Each of my nodes has a 1TB SSD as my boot disk (sda), and a 4TB NVMe dedicated to Ceph (nvme0n1)
These instructions assume the following IP addresses:
- Synology SMB server at 192.168.0.10
- Proxmox cluster node at 192.168.0.51

Instructions (Backup)

Download SuSE Enterprise Desktop Online Install ISO
- https://www.suse.com/download/sled/
- Download the "online" installer, (e.g. SLE-15-SP5-Online-x86_64-GM-Media1.iso)
- Note that you will need a free SuSE account to download this ISO
- Use Balena Etcher (or similar) to create bootable USB thumb drive
Boot
- Press F10 on BIOS screen to choose USB drive
- Select Rescue System (under the "More…" boot menu)
- Be patient - it will take about 3 minutes to boot
- Once booted, log in as root (no password)
Assign IP address
- ip a add 192.168.0.51/24 dev eth0
  - (or .52, .53, etc) - make sure you don’t choose the same IP address as another device!!!
- ip link set dev eth0 up
- ping 192.168.0.10
Mount SMB share
- mkdir /media/synology
- mount -t cifs -o username=Proxmox //192.168.0.10/Backup-Proxmox-Manual /media/synology
  - Enter password when prompted
- Verify it mounted with:
  - ls -lah /media/synology
- mkdir /media/synology/NUC-Cluster-Nodes-Raw-Disk-Backups
Use dd + pigz to image and compress entire disk
- dd bs=10000000 if=/dev/sda | pigz -9 > /media/synology/NUC-Cluster-Nodes-Raw-Disk-Backups/NUC-1_sda_Full-Image-1TB-Crucial-Boot-Disk_2024-04-10_0830_3NodeClusterFullySetUp_NoVMs.dd.pigz
- NOTE: You can run more than one instance (e.g. one for each physical disk) simultaneously by using a second terminal window (Ctrl-Alt-F1, F2, F3, F4, F5, F6)
- To backup the NVMe, change "sda" to "nvme0n1"
- I backed up all 3 nodes at the same time, using the same SuSE USB boot drive. Once the rescue system is loaded and you're logged in, you can unplug the USB drive and use it on the next node
- It took about 30 minutes for my 1TB SSD, and about 70 minutes for my 4TB NVMe
- Each SSD image ended up being 4.5GB, and each NVMe image 3.5GB

Instructions (Restore):

Boot and Mount SMB share with instructions above
unpigz -d -c /media/synology/NUC-Cluster-Nodes-Raw-Disk-Backups/NUC-1_sda_Full-Image-1TB-Crucial-Boot-Disk_2024-04-10_0830_3NodeClusterFullySetUp_NoVMs.dd.pigz | dd of=/dev/sda bs=10000000
NOTE: I know I can restore the SSD in this way, but I'm not as familiar with restoring NVMe at the block device level like this (especially with Ceph installed) so that part is untested at this point, but it should work in theory

BENCHMARKS

Each node has one Samsung 990 Pro NVMe dedicated to Ceph.
Prep: I created a storage pool called ceph-benchmark with 32 PGs.

Code:

root@nuc1:~# rados -p ceph-benchmark bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_nuc1_22541
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       193       177   707.957       708    0.028086   0.0883301
    2      16       510       494    987.92      1268   0.0338487   0.0641823
    3      16       826       810   1079.91      1264    0.233411   0.0586165
    4      16      1029      1013   1012.86       812   0.0261111   0.0609416
    5      16      1173      1157   925.464       576   0.0123211   0.0645023
    6      16      1367      1351   900.517       776   0.0147871   0.0694326
    7      16      1600      1584   905.002       932   0.0416352    0.070401
    8      16      1825      1809   904.353       900   0.0228454   0.0699957
    9      16      2029      2013   894.526       816   0.0112502   0.0698831
   10      16      2224      2208   883.065       780   0.0287672   0.0723378
Total time run:         10.027
Total writes made:      2224
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     887.207
Stddev Bandwidth:       224.236
Max bandwidth (MB/sec): 1268
Min bandwidth (MB/sec): 576
Average IOPS:           221
Stddev IOPS:            56.0591
Max IOPS:               317
Min IOPS:               144
Average Latency(s):     0.0720637
Stddev Latency(s):      0.0767273
Max latency(s):         0.516354
Min latency(s):         0.00831436

Code:

root@nuc1:~# rados -p ceph-benchmark bench 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0      16        16         0         0         0           -           0
    1      16       729       713   2851.33      2852  0.00964088   0.0213249
    2      16      1444      1428   2855.45      2860   0.0153747   0.0213478
    3      16      2185      2169   2891.54      2964   0.0229736   0.0211632
Total time run:       3.07695
Total reads made:     2224
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2891.17
Average IOPS:         722
Stddev IOPS:          15.6205
Max IOPS:             741
Min IOPS:             713
Average Latency(s):   0.0211742
Max latency(s):       0.0738357
Min latency(s):       0.00305585

All 3 nodes perform at this level, even when running simultaneously (Node 1>>2, Node 2>>3, Node 3>>1), and running with -P 10 yields identical performance

Code:

root@nuc1:~# iperf3 -c fc00::112
Connecting to host fc00::112, port 5201
[  5] local fc00::111 port 42512 connected to fc00::112 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.49 GBytes  21.4 Gbits/sec  813   1023 KBytes  
[  5]   1.00-2.00   sec  2.49 GBytes  21.4 Gbits/sec  731   1.31 MBytes  
[  5]   2.00-3.00   sec  2.49 GBytes  21.4 Gbits/sec  829    575 KBytes  
[  5]   3.00-4.00   sec  2.47 GBytes  21.2 Gbits/sec  1032   1.37 MBytes  
[  5]   4.00-5.00   sec  2.49 GBytes  21.4 Gbits/sec  1106   1.06 MBytes  
[  5]   5.00-6.00   sec  2.48 GBytes  21.3 Gbits/sec  1103   1.19 MBytes  
[  5]   6.00-7.00   sec  2.48 GBytes  21.3 Gbits/sec  1013   1.12 MBytes  
[  5]   7.00-8.00   sec  2.50 GBytes  21.5 Gbits/sec  1019   1.06 MBytes  
[  5]   8.00-9.00   sec  2.03 GBytes  17.5 Gbits/sec  803   1.37 MBytes  
[  5]   9.00-10.00  sec  2.52 GBytes  21.6 Gbits/sec  1156   1023 KBytes  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec  9605             sender
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec                  receiver

Code:

root@nuc1:~# iperf3 -c 192.168.10.112
Connecting to host 192.168.10.112, port 5201
[  5] local 192.168.10.111 port 49330 connected to 192.168.10.112 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   284 MBytes  2.38 Gbits/sec    0    711 KBytes  
[  5]   1.00-2.00   sec   280 MBytes  2.35 Gbits/sec    0    711 KBytes  
[  5]   2.00-3.00   sec   280 MBytes  2.35 Gbits/sec    0    711 KBytes  
[  5]   3.00-4.00   sec   281 MBytes  2.36 Gbits/sec    0    711 KBytes  
[  5]   4.00-5.00   sec   280 MBytes  2.35 Gbits/sec    0    711 KBytes  
[  5]   5.00-6.00   sec   281 MBytes  2.36 Gbits/sec    0    711 KBytes  
[  5]   6.00-7.00   sec   280 MBytes  2.35 Gbits/sec    0    711 KBytes  
[  5]   7.00-8.00   sec   281 MBytes  2.36 Gbits/sec    0    711 KBytes  
[  5]   8.00-9.00   sec   280 MBytes  2.35 Gbits/sec    0    711 KBytes  
[  5]   9.00-10.00  sec   281 MBytes  2.36 Gbits/sec    0    711 KBytes  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.74 GBytes  2.35 Gbits/sec                  receiver

MY CONFIG

Code:

root@nuc1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto lo:0
iface lo:0 inet static
    address 10.0.0.111/32
   
auto lo:6
iface lo:6 inet static
    address fc00::111/128

iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.10.111/23
    gateway 192.168.10.1
    bridge-ports enp86s0
    bridge-stp off
    bridge-fd 0

iface enp87s0 inet manual

iface wlo1 inet manual

auto en05
iface en05 inet manual
    mtu 65520

iface en05 inet6 manual
    mtu 65520

auto en06
iface en06 inet manual
    mtu 65520

iface en06 inet6 manual
    mtu 65520

source /etc/network/interfaces.d/*

# This must be the last line in the file
post-up /usr/bin/systemctl restart frr.service

Code:

root@nuc1:~# cat /etc/systemd/network/00-thunderbolt0.link
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en05

Code:

root@nuc1:~# cat /etc/systemd/network/00-thunderbolt1.link
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en06

Code:

root@nuc1:~# cat /etc/sysctl.conf | grep forward
# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1
# Uncomment the next line to enable packet forwarding for IPv6
net.ipv6.conf.all.forwarding=1

Code:

root@nuc1:~# vtysh

Hello, this is FRRouting (version 8.5.2).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

nuc1# show running-config
Building configuration...

Current configuration:
!
frr version 8.5.2
frr defaults traditional
hostname nuc1
log syslog informational
service integrated-vtysh-config
!
interface en05
 ipv6 router openfabric 1
exit
!
interface en06
 ipv6 router openfabric 1
exit
!
interface lo
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric passive
exit
!
router openfabric 1
 net 49.0000.0000.0001.00
exit
!
end

Code:

root@nuc1:~# cat /etc/pve/datacenter.cfg

crs: ha-rebalance-on-start=1
ha: shutdown_policy=migrate
keyboard: en-us
migration: insecure,network=fc00::110/125

My TB4 Cables: Amazon

scyto · Apr 10, 2024

congratulations!

interesting node backup approach, synology annoys the hell out of me they don't keep their agent update for newer kernels

on speed, the upper limit (and why you don't get 40gbps) is the DMA controller on intel platforms, 27Gbps on 13th gen and before systems is the upper limit according to the guy at intel, i see this on most of my connections

Code:

Connecting to host fc00::83, port 5201
[  5] local fc00::82 port 37090 connected to fc00::83 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.07 GBytes  26.4 Gbits/sec   21   1.87 MBytes      
[  5]   1.00-2.00   sec  3.12 GBytes  26.8 Gbits/sec    0   1.87 MBytes      
[  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   3.06 MBytes      
[  5]   3.00-4.00   sec  3.12 GBytes  26.8 Gbits/sec    0   3.06 MBytes      
[  5]   4.00-5.00   sec  3.12 GBytes  26.8 Gbits/sec    0   3.06 MBytes      
[  5]   5.00-6.00   sec  3.12 GBytes  26.8 Gbits/sec    0   3.06 MBytes      
[  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec   23   2.93 MBytes      
[  5]   7.00-8.00   sec  3.12 GBytes  26.8 Gbits/sec    0   2.93 MBytes      
[  5]   8.00-9.00   sec  3.12 GBytes  26.8 Gbits/sec    0   2.93 MBytes      
[  5]   9.00-10.00  sec  3.08 GBytes  26.4 Gbits/sec   17   2.06 MBytes      
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  31.1 GBytes  26.7 Gbits/sec   61             sender
[  5]   0.00-10.00  sec  31.1 GBytes  26.7 Gbits/sec                  receiver

oddly back in the opposite direction i only get 18Gbps, not sure why - now i need to test all 9 combinations of to and from, lol.
the node that is slow when the server is on it is pve1 it has the following differences:

i915 video drivers intsalled
secure boot enabled

thats it...

esReveRse · Apr 11, 2024

Thanks! Ok - It sounds like I don't need to dig into performance as an isolated problem on my end necessarily then. But I've never seen 26Gbps in any direction. They are all consistently running at 21Gbps (even simultaneously).

Other config notes

I am running the latest kernel (installed with Proxmox 8.1-2 ISO then updated everything)
I am running Intel's BIOS version ANRPL357.0027.2023.0607.1754
I don't have secure boot enabled on any of these nodes
I don't have the i915 video drivers installed in the method you describe in your gist
However, I do use a different i915 enablement method, which allows for Plex in a Linux VM to do hardware transcoding. If I recall, this method doesn't necessarily work for Windows VMs (or at least nobody seems to have found a way). A brief overview of the steps I took for that are below
Unrelated, I just switched to the "allow-hotplug" in interfaces and realized I haven't really tested my cluster's failover/HA behavior, so I'll plan to do that soon

PVE Host Implementation

nano /etc/default/grub
- On the line that says "GRUB_CMDLINE_LINUX_DEFAULT"...
- Change "quiet" --> "quiet intel_iommu=on iommu=pt"
update-grub
update-initramfs -u -k all
reboot

PVE Host Verification

dmesg | grep -e DMAR -e IOMMU
- Should see something like "DMAR: IOMMU enabled" in the output
lspci -k | grep -A 4 "VGA"
- Should see something like "Kernel Modules: i915" in the output
- Note the device id number to the left of "VGA" (e.g. 00:02.0)
dmesg | grep 'remapping'
- Should see something like the following in the output:
- DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping
  DMAR-IR: Enabled IRQ remapping in x2apic mode

Ubuntu VM Config

Note that this may disable the nice Console view in Proxmox, but it will enable the GPU in the VM
SSH and RDP should both continue to work fine
Add PCI Device
- Shut down the VM
- VM --> Hardware --> Add: PCI Device
  - RAW Device
  - Select the VGA device ID you noted earlier and verify the device description makes sense as well
  - All Functions: Yes
  - Primary GPU: No (this is guaranteed to disable the Console)
  - (Advanced) ROM Bar: Yes
  - (Advanced) PCI-Express: Yes
Verify GPU is appearing in VM
- Start VM and run this:
- ls -l /dev/dri
- Output should show at least one "render" item, such as "renderD128"

Plex Config (via docker in Ubuntu VM)

The "/dev/dri:/dev/dri" line is the one that matters for the GPU
sudo nano /opt/plex/docker-compose.yml

Code:

version: '2'
services:
  plex:
    image: plexinc/pms-docker
    container_name: plex
    network_mode: host
    restart: unless-stopped
    privileged: true
    environment:
      - PUID=1000
      - PGID=1000
      - TZ="America/Denver"
    volumes:
      - /media/Data/plex/config:/config
      - /dev/shm:/transcode
      - /media/Data/plex/deb:/deb
      - /media/Media:/media
      - /media/Media/IN.to.Sort/Plex-DVR:/dvr
    devices:
      - /dev/dri:/dev/dri

scyto · Apr 11, 2024

thanks thats helpful

I uninstalled this variant of the i915 and reboot and speeds are now fine - not saying the two were related, lol its also possible that TB was negotiating weirdly while my nodes had different firmware...

I also know that length and quality of the TB cable can matter

As an aside I upgraded all my nuc i13 nodes to Version: ANRPL357.0031.2024.0207.1420 last night / today - will keep folks posted how it passes after a few days

esReveRse · Apr 11, 2024

Yes - I'll be interested in the results of your BIOS firmware upgrade. Since your cluster has been so stable running on the version I have currently, I'm less inclined to upgrade without a compelling reason.

Could you remind me what problem you're trying to resolve at the moment that inspired you to update the BIOS?

Also, regarding the "auto" vs "allow-hotplug" interface, I came across this Q+A that made me wonder whether it might cause unintended side effects. I don't think so, since it technically is a hotpluggable interface, but I thought I'd share.

As for my TB4 cables, perhaps I'll buy and test the same cables you have. The ones I have seem to be good from all the reviews though, and they are a good length at 1.6 ft, but I'll happily replace them if they're affecting performance.

Edit: Ok, I just ordered these cables to test. Amazon is saying they won't be here until Saturday though...

scyto · Apr 11, 2024

Ah why did i update the bios, why did i try enabling secure boot?
All for iGPU passthrough to windows vGPU / host CPU setting / Win11 Auto repair loops | Proxmox Support Forum

esReveRse · Apr 11, 2024

Interesting. Ok - so is the “allow-hotplug” a fix for problems you’re experiencing in trying to get that working, or is it just related to compatibility of the latest kernel?

In other words, if I’m not setting up the Windows vGPU driver, would you recommend I stay with “auto”, or should I use “allow-hotplug”? The fact that your setup was so stable and had reliable failover behavior for so long - that’s what I’m trying to achieve, of course.

scyto · Apr 11, 2024

The hot-plug was something some one else recommended, makes sense as the cables are easily and accidentally removed. I did a quick Google and it looked like it can avoid some edge case issues.

esReveRse · Apr 12, 2024

Ah - gotcha. Seems like I don’t really need to pursue that course at this point.

Then again, I did just discover that I likely won’t be able to auto-migrate my 3 Plex VMs (one on each node) during cluster failover because they each require a local PCI resource - the GPU. If a vGPU gets around the limitation by virtualizing the physical interface, that may be worth looking into. Currently each Plex server will simply go down with the node if it goes down.

esReveRse · Apr 12, 2024

My cluster seems to be working great now... except for one thing.

Manual migration works well (and fast!) over the TB network. Also, when I shut down a node, the auto-migration of VMs/CTs DOES work as intended. However, when I bring the node back up, I get the following error for each VM/CT when Proxmox tries to migrate them back:

Code:

task started by HA resource agent
2024-04-12 03:42:41 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=nuc3' root@fc00::113 /bin/true
2024-04-12 03:42:41 ssh: connect to host fc00::113 port 22: Network is unreachable
2024-04-12 03:42:41 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

I've followed advice on getting ssh certs/keys in place in case that was the problem, but to no avail. The part that stands out to me is "Network is unreachable". It appears that my Thunderbolt mesh network isn't coming up soon enough to support the migration back to the original node. To test this, I set my main 2.5GbE LAN network as the migration network and it fixed the issue.

I'd prefer to use the Thunderbolt network for migration if possible (due to the extra bandwidth/speed).

Has anyone had a similar experience? Any ideas for delaying the auto-failback migration until the TB network is up for certain? Any better solutions?

/etc/pve/datacenter.cfg:
migration: network=fc00::110/125,type=insecure

scyto · Apr 12, 2024

esReveRse said:
If a vGPU gets around the limitation by virtualizing the physical interface

i believe its beiing worked on for nvidia professional cards, but thats it
note for non-live migration (failover when node goes down) IIRC it should work, thats what the pool is for

scyto · Apr 12, 2024

esReveRse said:
Has anyone had a similar experience?

Well i had wondered why my failback stopped working, lol, havent checked logs to see if it is the same thing.

ideas (all centered on making virtualization start after network - note this is off top of my head):

FRR can take a while to converge - maybe we should make that part of the network service systemd chain?
Make qemu dependent on FRR being up?
Given FRR needs to be up and converged maybe more....

qq have you seen a meaningful difference in speed doing migration with mesh network vs the standard?

esReveRse · Apr 12, 2024

scyto said:
qq have you seen a meaningful difference in speed doing migration with mesh network vs the standard?

Now that I've been testing it more, not really. Ceph makes the migration just about as fast over either. I haven't timed it, but I imagine it's only a few seconds difference normally. Perhaps longer if there's major amount of state to transfer in a live migration (if I have the conceptual model in my head right about how it works).

So, there's not necessarily a compelling reason to get the TB network to be the migration network... I just want it to be used, out of principle. Having a dedicated node-to-node network that has 10x the bandwidth (but which I can't use for this) seems like a waste.

Ideally automatic migration will happen so infrequently that it won't matter, so I'm just relying on my 2.5GbE LAN for now. Maybe I'll get curious and test more later.

----------
EDIT: I just tested the online migration of an Ubuntu VM with 4GB RAM (amount of RAM will have the greatest impact on migration time). Ran a few tests, and these are the results (average time):

Thunderbolt: 7 Seconds, timed by Proxmox (could remote desktop into it in 9 seconds)
2.5GbE LAN: 16 Seconds, timed by Proxmox (could remote desktop into it in 12 seconds)

scyto · Apr 13, 2024

esReveRse said:
seems like a waste

agreed, i like these puzzles, its why i have a promox home system

i looked at startup order and to my uneducated eyes both frr and ceph have to be up before the cluster services
for example pve-storage.target has to be up before pve-ha-crm.service and pve-ha-lrm.service are up

so this needs to be less about service start and is the FRR service functional.... i don't think this is about thinderbolt per-se as that starts very very early....

Code:

root@pve1:/lib/systemd/system# dmesg | grep thunder
    1.538859] ACPI: bus type thunderbolt registered
    2.660658] thunderbolt 0-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
    3.791349] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
    7.969976] thunderbolt 0-1: new host found, vendor=0x8086 device=0x1
    7.969979] thunderbolt 0-1: Intel Corp. pve3
    7.979902] thunderbolt-net 0-1.0 en05: renamed from thunderbolt0
    9.411509] thunderbolt 1-1: new host found, vendor=0x8086 device=0x1
    9.411513] thunderbolt 1-1: Intel Corp. pve2
    9.412838] thunderbolt-net 1-1.0 en06: renamed from thunderbolt0

hmm though i note it starts AFTER the ceph service.....

Code:

root@pve1:/lib/systemd/system# dmesg | grep thunder
[    1.538859] ACPI: bus type thunderbolt registered
[    2.660658] thunderbolt 0-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[    3.791349] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[    7.969976] thunderbolt 0-1: new host found, vendor=0x8086 device=0x1
[    7.969979] thunderbolt 0-1: Intel Corp. pve3
[    7.979902] thunderbolt-net 0-1.0 en05: renamed from thunderbolt0
[    9.411509] thunderbolt 1-1: new host found, vendor=0x8086 device=0x1
[    9.411513] thunderbolt 1-1: Intel Corp. pve2
[    9.412838] thunderbolt-net 1-1.0 en06: renamed from thunderbolt0
root@pve1:/lib/systemd/system# dmesg | grep free
[    0.084664] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
[    0.221051] HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
[    0.221051] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
root@pve1:/lib/systemd/system# dmesg | grep network
[    1.198101] drop_monitor: Initializing network drop monitor service
root@pve1:/lib/systemd/system# dmesg | grep net
[    0.214839] audit: initializing netlink subsys (disabled)
[    1.198101] drop_monitor: Initializing network drop monitor service
[    1.528763] Intel(R) 2.5G Ethernet Linux Driver
[    1.576903] igc 0000:56:00.0 (unnamed net_device) (uninitialized): PHC added
[    1.664992] igc 0000:57:00.0 (unnamed net_device) (uninitialized): PHC added
[    7.979902] thunderbolt-net 0-1.0 en05: renamed from thunderbolt0
[    9.412838] thunderbolt-net 1-1.0 en06: renamed from thunderbolt0



root@pve1:/lib/systemd/system# dmesg | grep ceph
[    4.744503] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    4.745084] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    4.745256] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    4.745435] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    4.745605] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    4.797853] systemd[1]: Created slice system-ceph\x2dmds.slice - Slice /system/ceph-mds.
[    4.798015] systemd[1]: Created slice system-ceph\x2dmgr.slice - Slice /system/ceph-mgr.
[    4.798150] systemd[1]: Created slice system-ceph\x2dmon.slice - Slice /system/ceph-mon.
[    4.798269] systemd[1]: Created slice system-ceph\x2dvolume.slice - Slice /system/ceph-volume.
[    4.799074] systemd[1]: Reached target ceph-fuse.target - ceph target allowing to start/stop all ceph-fuse@.service instances at once.
[   30.834473] Key type ceph registered
[   30.834519] libceph: loaded (mon/osd proto 15/24)
[   30.847651] ceph: loaded (mds proto 32)
[   32.966288] libceph: mon2 (1)[fc00::83]:6789 session established
[   32.966770] libceph: client16576734 fsid 5e55fd50-d135-413d-bffe-9d0fae0ef5fa
[   33.152985] libceph: mon2 (1)[fc00::83]:6789 session established
[   33.153572] libceph: client16576758 fsid 5e55fd50-d135-413d-bffe-9d0fae0ef5fa

so i wonder how to delay ceph startup until thunderbolt is up and renamed?

Intel Nuc 13 Pro Thunderbolt Ring Network Ceph Cluster

Active Member

New Member

Member

Active Member

Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

New Member

Active Member

Active Member

New Member

Active Member