Intel Nuc 13 Pro Thunderbolt Ring Network Ceph Cluster

Ah ok, so it looks like there's a bit of overhead in getting the packets routed in the other direction. Not too bad I guess considering it's still pretty fast. I wonder if any adjustments can be made to make it just as fast. Once I get my third node, I'll be testing this to see what mine does. Right now I only have two nodes. I get the full 26Gb speeds but I do get some small amount of retries when I do iperf3 tests. I get about 5-25 retries but it's still getting the full 26Gb speeds. I tried two different cables and it didn't have any effect on it - one of those being the OWC cable thats supposed to be one of the best ones you can get.
 
  • Like
Reactions: rene.bayer
Have you managed to figure out something?
I did not figure out what the FRR routing performance issue was with a simulated failure of a direct connection, and got tired of it all and wiped my nodes and started fresh.
For what it's worth and unrelated to that routing issue - I was running Kingston SEDC600M/960G SATA SSDs as the CEPH drives and 980 Pro NVMe 1TB as boot drives.

This is what I was getting over Thunderbolt and those Kingston enterprise SATA SSDs. I didn't make any modifications to cache which could have probably been done since they have PLP.
Code:
rados bench -p ceph-vm 120 write -b 4M -t 16 --run-name pve02 --no-cleanup
Total time run:         120.017
Total writes made:      13831
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     460.969 (Kingston advertises 530MB/s)
Stddev Bandwidth:       40.2623
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 316
Average IOPS:           115 (?!?Not even close to the advertised tens of thousands of IOPS)
Stddev IOPS:            10.0656
Max IOPS:               128
Min IOPS:               79
Average Latency(s):     0.138835
Stddev Latency(s):      0.0341182
Max latency(s):         0.588993
Min latency(s):         0.0137876

Since wiping, I reversed it, and now the SATA SSDs are the boot drives. I was going to set it all up again and rerun tests.
BUT I suffer from analysis paralysis. Debating doing CEPH at all. Debating using FRR and Thunderbolt. Debating ditching IPv6 and just a simple OSPF FRR implementation. Debating pinning PVE kernel to a lower version and virtualizing my iGPU's with SRIOV. Debating burning it all down to the ground and just enjoying the summer.

I'm also waiting to see how Jim's Garage gets on with his YouTube MS-01 and Thunderbolt implementation.
 
I did not figure out what the FRR routing performance issue was with a simulated failure of a direct connection, and got tired of it all and wiped my nodes and started fresh.
For what it's worth and unrelated to that routing issue - I was running Kingston SEDC600M/960G SATA SSDs as the CEPH drives and 980 Pro NVMe 1TB as boot drives.

This is what I was getting over Thunderbolt and those Kingston enterprise SATA SSDs. I didn't make any modifications to cache which could have probably been done since they have PLP.
Code:
rados bench -p ceph-vm 120 write -b 4M -t 16 --run-name pve02 --no-cleanup
Total time run:         120.017
Total writes made:      13831
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     460.969 (Kingston advertises 530MB/s)
Stddev Bandwidth:       40.2623
Max bandwidth (MB/sec): 512
Min bandwidth (MB/sec): 316
Average IOPS:           115 (?!?Not even close to the advertised tens of thousands of IOPS)
Stddev IOPS:            10.0656
Max IOPS:               128
Min IOPS:               79
Average Latency(s):     0.138835
Stddev Latency(s):      0.0341182
Max latency(s):         0.588993
Min latency(s):         0.0137876

Since wiping, I reversed it, and now the SATA SSDs are the boot drives. I was going to set it all up again and rerun tests.
BUT I suffer from analysis paralysis. Debating doing CEPH at all. Debating using FRR and Thunderbolt. Debating ditching IPv6 and just a simple OSPF FRR implementation. Debating pinning PVE kernel to a lower version and virtualizing my iGPU's with SRIOV. Debating burning it all down to the ground and just enjoying the summer.

I'm also waiting to see how Jim's Garage gets on with his YouTube MS-01 and Thunderbolt implementation.
I think i know what the issues is, what is the MTU size that you have set for the en05 en06 interfaces? I only see these abysmal forwarding speeds when my interface MTU is at 1500 if the MTU is set at 65520 i get normal speeds.
 
I think i know what the issues is, what is the MTU size that you have set for the en05 en06 interfaces? I only see these abysmal forwarding speeds when my interface MTU is at 1500 if the MTU is set at 65520 i get normal speeds.
I believe I had it set to 1500 when I got the bad result so it sounds like you're onto something. I did not perform the failure test when I had 65520 set for comparison.
I was initially using 65520 but since I didn't notice any measurable performance gains, and higher retransmits, I reverted to the standard 1500.
 
I believe I had it set to 1500 when I got the bad result so it sounds like you're onto something. I did not perform the failure test when I had 65520 set for comparison.
I was initially using 65520 but since I didn't notice any measurable performance gains, and higher retransmits, I reverted to the standard 1500.
I see quite a big difference in the ceph write performance with MTU65520 i get around (600Mb/s) vs over 1GB/s on MTU 1500. But in case of failure the cluster is basically unsusable since forwarding speed is so low ceph will basically just lock up.

Also im on the latest kernel so that might be related, as this kernel is a bit worse with higher MTU setting as i wrote previous post.
 
I see quite a big difference in the ceph write performance with MTU65520 i get around (600Mb/s) vs over 1GB/s on MTU 1500. But in case of failure the cluster is basically unsusable since forwarding speed is so low ceph will basically just lock up.

Also im on the latest kernel so that might be related, as this kernel is a bit worse with higher MTU setting as i wrote previous post.
Have you found a reliable method to run the affinity changing script by any chance? I tried the if-up.d method and it seems hit and miss so far.
 
Have you found a reliable method to run the affinity changing script by any chance? I tried the if-up.d method and it seems hit and miss so far.
I have it there for the affinity, havent had any issues with it since i set it up. Ceph performace is always roud 1Gb/s write and 2Gb/s read. Been solid for a while now.
 
I see quite a big difference in the ceph write performance with MTU65520 i get around (600Mb/s) vs over 1GB/s on MTU 1500. But in case of failure the cluster is basically unsusable since forwarding speed is so low ceph will basically just lock up.

Also im on the latest kernel so that might be related, as this kernel is a bit worse with higher MTU setting as i wrote previous post.
I finally got around to working on the pve cluster and testing out MTU size.
For whatever reason I can't get top bandwidth when using MTU 65520, but the routing performance is great. MTU 1500 gives me top bandwidth but abysmal routing.

Code:
root@pve03:~# iperf3 -c 10.0.0.81
Connecting to host 10.0.0.81, port 5201
[  5] local 10.0.0.83 port 55860 connected to 10.0.0.81 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.76 GBytes  23.7 Gbits/sec   93   1.29 MBytes       
[  5]   1.00-2.00   sec  2.90 GBytes  24.9 Gbits/sec  450   1.21 MBytes       
[  5]   2.00-3.00   sec  2.83 GBytes  24.3 Gbits/sec  495   1.51 MBytes       
[  5]   3.00-4.00   sec  2.85 GBytes  24.5 Gbits/sec  540   1.38 MBytes       
[  5]   4.00-5.00   sec  2.82 GBytes  24.3 Gbits/sec  271   1.17 MBytes       
[  5]   5.00-6.00   sec  2.82 GBytes  24.2 Gbits/sec  182   1.27 MBytes       
[  5]   6.00-7.00   sec  2.85 GBytes  24.5 Gbits/sec  315   1.47 MBytes       
[  5]   7.00-8.00   sec  2.84 GBytes  24.4 Gbits/sec  585   1.42 MBytes       
[  5]   8.00-9.00   sec  2.85 GBytes  24.5 Gbits/sec  363   1.12 MBytes       
[  5]   9.00-10.00  sec  2.84 GBytes  24.4 Gbits/sec  360   1.26 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  28.4 GBytes  24.4 Gbits/sec  3654             sender
[  5]   0.00-10.00  sec  28.4 GBytes  24.4 Gbits/sec                  receiver


Code:
root@pve03:~# iperf3 -c 10.0.0.81
Connecting to host 10.0.0.81, port 5201
[  5] local 10.0.0.83 port 52084 connected to 10.0.0.81 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   488 KBytes  4.00 Mbits/sec   38   2.83 KBytes       
[  5]   1.00-2.00   sec   950 KBytes  7.78 Mbits/sec   40   2.83 KBytes       
[  5]   2.00-3.00   sec   158 KBytes  1.30 Mbits/sec   26   5.66 KBytes       
[  5]   3.00-4.00   sec   160 KBytes  1.31 Mbits/sec   28   2.83 KBytes       
[  5]   4.00-5.00   sec   164 KBytes  1.34 Mbits/sec   26   4.24 KBytes       
[  5]   5.00-6.00   sec   236 KBytes  1.93 Mbits/sec   28   2.83 KBytes       
[  5]   6.00-7.00   sec   161 KBytes  1.32 Mbits/sec   26   7.07 KBytes       
[  5]   7.00-8.00   sec   163 KBytes  1.33 Mbits/sec   28   2.83 KBytes       
[  5]   8.00-9.00   sec  80.6 KBytes   660 Kbits/sec   26   2.83 KBytes       
[  5]   9.00-10.00  sec   478 KBytes  3.92 Mbits/sec   36   2.83 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.97 MBytes  2.49 Mbits/sec  302             sender
[  5]   0.00-10.00  sec  2.91 MBytes  2.44 Mbits/sec                  receiver


Code:
root@pve03:~# iperf3 -c 10.0.0.81
Connecting to host 10.0.0.81, port 5201
[  5] local 10.0.0.83 port 47986 connected to 10.0.0.81 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.16 GBytes  18.6 Gbits/sec  519   3.18 MBytes       
[  5]   1.00-2.00   sec  1.62 GBytes  13.9 Gbits/sec  395   1.50 MBytes       
[  5]   2.00-3.00   sec   999 MBytes  8.38 Gbits/sec  329   1.62 MBytes       
[  5]   3.00-4.00   sec  1.54 GBytes  13.3 Gbits/sec  573   1.69 MBytes       
[  5]   4.00-5.00   sec  1.74 GBytes  14.9 Gbits/sec  451   1.75 MBytes       
[  5]   5.00-6.00   sec  1.93 GBytes  16.5 Gbits/sec  457   1.62 MBytes       
[  5]   6.00-7.00   sec  2.67 GBytes  23.0 Gbits/sec  724   2.31 MBytes       
[  5]   7.00-8.00   sec  2.29 GBytes  19.6 Gbits/sec  674   1.19 MBytes       
[  5]   8.00-9.00   sec  1.99 GBytes  17.1 Gbits/sec  468   1.62 MBytes       
[  5]   9.00-10.00  sec  2.71 GBytes  23.2 Gbits/sec  663   1.69 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  19.6 GBytes  16.9 Gbits/sec  5253             sender
[  5]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec                  receiver

iperf Done.
root@pve03:~# ip route
default via 192.168.10.1 dev vmbr0 proto kernel onlink
10.0.0.81 nhid 18 via 10.0.0.81 dev en06 proto openfabric metric 20 onlink
10.0.0.82 nhid 16 via 10.0.0.82 dev en05 proto openfabric metric 20 onlink
172.16.1.0/24 dev vmbr1 proto kernel scope link src 172.16.1.13
192.168.10.0/24 dev vmbr0 proto kernel scope link src 192.168.10.13

Code:
root@pve03:~# iperf3 -c 10.0.0.81
Connecting to host 10.0.0.81, port 5201
[  5] local 10.0.0.83 port 50444 connected to 10.0.0.81 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.10 GBytes  18.0 Gbits/sec  112   2.25 MBytes       
[  5]   1.00-2.00   sec  2.11 GBytes  18.1 Gbits/sec  126   1.75 MBytes       
[  5]   2.00-3.00   sec  2.12 GBytes  18.2 Gbits/sec  145   2.12 MBytes       
[  5]   3.00-4.00   sec  2.10 GBytes  18.0 Gbits/sec  113   2.19 MBytes       
[  5]   4.00-5.00   sec  1.65 GBytes  14.2 Gbits/sec  124   2.06 MBytes       
[  5]   5.00-6.00   sec  1.67 GBytes  14.3 Gbits/sec  103   1.50 MBytes       
[  5]   6.00-7.00   sec  1.65 GBytes  14.2 Gbits/sec  102   2.87 MBytes       
[  5]   7.00-8.00   sec  1.61 GBytes  13.8 Gbits/sec   86   2.81 MBytes       
[  5]   8.00-9.00   sec  1.25 GBytes  10.7 Gbits/sec   84   2.06 MBytes       
[  5]   9.00-10.00  sec  1.73 GBytes  14.9 Gbits/sec  105   3.00 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  18.0 GBytes  15.4 Gbits/sec  1100             sender
[  5]   0.00-10.00  sec  18.0 GBytes  15.4 Gbits/sec                  receiver


iperf Done.
root@pve03:~# ip route
default via 192.168.10.1 dev vmbr0 proto kernel onlink
10.0.0.81 nhid 34 via 10.0.0.82 dev en05 proto openfabric metric 20 onlink
10.0.0.82 nhid 34 via 10.0.0.82 dev en05 proto openfabric metric 20 onlink
172.16.1.0/24 dev vmbr1 proto kernel scope link src 172.16.1.13
192.168.10.0/24 dev vmbr0 proto kernel scope link src 192.168.10.13
 
I've got a relatively recent issue. If one of my nodes reboots, the other 2 will lock up and reboot within a few seconds. I'm wondering if this is due to pinning the IRQ's to a specific CPU core, as this hasn't happened to me in the past and that's the most recent change i've made outside of patches (but i have a pinned 6.5 kernel).

I first had this happen last week when i had a hardware issue with 1 node, and when i shut the node down the others rebooted.

Then today i had a hung process on a different node, i rebooted it, and my other 2 nodes all hard locked up. one auto rebooted, the other didn't.
 
Last edited:
Hi Everyone,

I know this setup is based on Intel NUC devices that have thunderbolt built in.

I'm wondering if this setup can be done devices that don't support thunderbolt, but have USB 3.2 Gen 2, which technically is able to do 10/20Gbps. Has anybody tried this?

I've done some initial checks, but it seems that there are no built in "thunderbolt-net" devices on this, so maybe some sort of mods or drives are needed to get it going.

Guidance would be appreciated.

Thank you.
 
Hi, I'm hoping someone can help me write my [match] for the udev path ids for my two thunderbolt ports.

Here is the udevadm monitor output for the LEFT port:
Code:
KERNEL[1419.758525] remove   /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3 (thunderbolt)
UDEV  [1419.764655] remove   /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3 (thunderbolt)
KERNEL[1426.105479] change   /0-3 (thunderbolt)
KERNEL[1426.105533] add      /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3 (thunderbolt)
UDEV  [1426.111629] change   /0-3 (thunderbolt)
UDEV  [1426.112230] add      /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3 (thunderbolt)

...and here is the udevadm monitor output for the RIGHT port:
Code:
KERNEL[1471.631527] remove   /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1 (thunderbolt)
UDEV  [1471.637664] remove   /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1 (thunderbolt)
KERNEL[1478.137430] change   /0-1 (thunderbolt)
KERNEL[1478.137494] add      /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1 (thunderbolt)
UDEV  [1478.143898] change   /0-1 (thunderbolt)
UDEV  [1478.144416] add      /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1 (thunderbolt)

in both cases the difference is at the end of the path, rather than in the pci part as listed in the gist guide. Can someone suggest a systemd/network/thunderbolt.link file [match] section for these?

Here is what udevadm info says:
Code:
P: /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1
M: 0-1
R: 1
U: thunderbolt
T: thunderbolt_xdomain
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-1
E: SUBSYSTEM=thunderbolt
E: DEVTYPE=thunderbolt_xdomain

P: /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3
M: 0-3
R: 3
U: thunderbolt
T: thunderbolt_xdomain
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3
E: SUBSYSTEM=thunderbolt
E: DEVTYPE=thunderbolt_xdomain

P: /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3/0-3.0
M: 0-3.0
R: 0
U: thunderbolt
T: thunderbolt_service
V: thunderbolt-net
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3/0-3.0
E: SUBSYSTEM=thunderbolt
E: DEVTYPE=thunderbolt_service
E: DRIVER=thunderbolt-net
E: MODALIAS=tbsvc:knetworkp00000001v00000001r00000001

P: /devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3/0-3.0/net/thunderbolt0
M: thunderbolt0
R: 0
U: net
I: 6
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:07:00.0/0000:08:00.0/0000:09:00.0/domain0/0-0/0-3/0-3.0/net/thunderbol>E: SUBSYSTEM=net
E: INTERFACE=thunderbolt0
E: IFINDEX=6
E: USEC_INITIALIZED=5207507238
E: ID_NET_NAMING_SCHEME=v252
E: ID_NET_NAME_MAC=enx1234a92a085f
E: ID_BUS=pci
E: ID_VENDOR_ID=0x8086
E: ID_MODEL_ID=0x15d2
E: ID_PCI_CLASS_FROM_DATABASE=Generic system peripheral
E: ID_PCI_SUBCLASS_FROM_DATABASE=System peripheral
E: ID_VENDOR_FROM_DATABASE=Intel Corporation
E: ID_MODEL_FROM_DATABASE=JHL6540 Thunderbolt 3 NHI (C step) [Alpine Ridge 4C 2016]
E: ID_PATH=pci-0000:09:00.0
E: ID_PATH_TAG=pci-0000_09_00_0
E: ID_NET_DRIVER=thunderbolt-net
E: ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link
E: ID_NET_NAME=thunderbolt0
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/thunderbolt0
E: TAGS=:systemd:
E: CURRENT_TAGS=:systemd:

I don't see a 'thunderbolt1' in the udevadm info output even though there are two thunderbolt ports?

Thanks!
 
Last edited:
I have all three nodes set up and getting 26Gb to each other. But if I unplug one of the thunderbolt cables and the data has to go from Node1 through Node2 to get to Node3, it's very slow, about 1Mb/s.

Shouldn't it still be 26Gb or something a lot faster than 1Mb/s when a cable is unplugged? I thought that was the whole purpose about having these in a ring.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!