[SOLVED] Thunderbolt : Linx Kernel Error "kernel: thunderbolt 1-1: failed request link state change, aborting"

scyto

Active Member
Aug 8, 2023
359
68
28
I have a 3 node NUC cluster where I do a thunderbolt-net mesh/routed network between the 3 nodes. (Routed 26Gbps is pretty awesome)

In one scenario it generates the error in the title : pve1 kernel: thunderbolt 1-1: failed request link state change, aborting

To expand on the failure test scenarios that work vs don't work:

I find the following scenarios super reliable:
  • pulling one of the three TB cables
  • Rebooting any node
  • hard failure of any node (pull the power cord)
In these tests everything comes back ok when the physical fault is corrected

However, for this scenario alone things are not so pretty:
  • shutdown a node (gracefully) and power back on by pressing the front button
This generates the error and can applies to usually one TB connection (and very occasionally two)

I know this is a very edge case scenario, i am hoping someone has seen this on other devices and found a fix.

I am running Proxmox 8

Code:
[    1.585102] ACPI: bus type thunderbolt registered.
[    3.532746] thunderbolt 0-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[    5.471801] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[   17.035024] thunderbolt 0-1: new host found, vendor=0x8086 device=0x1
[   17.035028] thunderbolt 0-1: Intel Corp. pve3
[   17.038497] thunderbolt-net 0-1.0 en05: renamed from thunderbolt0
[   18.230611] thunderbolt 1-1: failed request link state change, aborting
....
[   83.895648] thunderbolt 1-1: failed request link state change, aborting
[   84.919547] thunderbolt 1-1: failed request link state change, aborting
[   85.943324] thunderbolt 1-1: failed request link state change, aborting
[   86.899519] thunderbolt 1-0:1.1: retimer disconnected
[   91.407058] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[   96.726934] thunderbolt 1-1: new host found, vendor=0x8086 device=0x1
[   96.726938] thunderbolt 1-1: Intel Corp. pve2
[   96.729412] thunderbolt-net 1-1.0 en06: renamed from thunderbolt0
 
I have a 3 node NUC cluster where I do a thunderbolt-net mesh/routed network between the 3 nodes. (Routed 26Gbps is pretty awesome)
Sounds neat!
This generates the error and can applies to usually one TB connection (and very occasionally two)

I know this is a very edge case scenario, i am hoping someone has seen this on other devices and found a fix.
Hmm, yeah this is rather an edge case and not sure how much we can help here.
In general, I'd recommend checking that the firmware for all NUCs is up-to-date.

There seems to be Thunderbolt specific firmware, for intel NUC see:
https://www.intel.de/content/www/de/de/support/articles/000026171/intel-nuc/intel-nuc-kits.html
And for how to apply it see the kernel docs (disclaimer, I did not personally test this):
https://www.kernel.org/doc/html/lat...ing-nvm-on-thunderbolt-device-host-or-retimer


You could also test if forcing power off and then on again on the other host could work to regain network:
https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#forcing-power

If it does, you could write a small script that runs on boot and connects to the other NUCs to trigger such a reset.
 
@t.lamprecht

Thanks, this is a NUC13 there appears to be no updated TB firmware (the article is only for Nuc10 and lowe)

I like the idea of the power - though i am unclear one can force the port off with that command

looking at the firmware upgrade i noticed these command for changing the state of the retimers
# echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan # echo 0 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline

I am not sure how to find out from code (i am not programmer) what other actions there besides offline, rescan - maybe there are more like a reset.

I will try issuing these commands next time I see the issue

also i don't think this helps, but i found the code in question generating the error....

https://github.com/torvalds/linux/b...70f/drivers/thunderbolt/xdomain.c#L1266-L1288
 
Last edited:
looking at the firmware upgrade i noticed these command for changing the state of the retimers
# echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan # echo 0 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline

I am not sure how to find out from code (i am not programmer) what other actions there besides offline, rescan - maybe there are more like a reset.

I will try issuing these commands next time I see the issue
Nope that was a blind alley, those nodes don't exist on my nuc and access denied trying to isse the command
 
In the same way for vGPU i compiled a i915 driver and used dkms to run it, is it possible for me to do something similar for some later version of thunderbolt and thunderbolt-net held in the linux kernel repo?
 
Hmm, OK, yeah I didn't have much experience with thunderbolt in such use cases, so you'd need to research and experiment for triggering a reset on your own.

In the same way for vGPU i compiled a i915 driver and used dkms to run it, is it possible for me to do something similar for some later version of thunderbolt and thunderbolt-net held in the linux kernel repo?
Maybe, but as those are quite intertwined with the (not stable) internal kernel ABI it might be hard to do, but that's also just a guestimation.

What I'd recommend now anyway comes close to that tough, if you do not use ZFS for boot you could simply try the newest mainline build of the kernel to see if that fixes things.

You could use the ubuntu kernel mainline ppa, where every kernel version gets auto-built, while the newest would be v6.5.2, that one seems to have failed to build for x86_64/amd64, so maybe try v6.5.1:

https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.5.1/amd64/

You need to download both, the image and the modules package, then you can install them with e.g.:

Code:
apt install ./linux-image-unsigned-6.5.1-060501-generic_6.5.1-060501.202309020842_amd64.deb ./linux-modules-6.5.1-060501-generic_6.5.1-060501.202309020842_amd64.deb

As you mentioned some dkms package you might also need to add the headers package to the mix
 
@t.lamprecht

FYI The owner of the thunderbolt and thunderbolt-net code has reproduced both my bugs (this connection bug and the IPv6 one in my other thread). I will be sure to reply to both threads when I have an update. Also means no point me trying to use the 6.5 kernel at this time.

However thanks for the instructions, i am sure i will need to test when there is a fix so will keep those in my backpocket.
 
FYI The owner of the thunderbolt and thunderbolt-net code has reproduced both my bugs (this connection bug and the IPv6 one in my other thread). I will be sure to reply to both threads when I have an update. Also means no point me trying to use the 6.5 kernel at this time.
Great to hear, out of interest, did you ask them on some mailing list or so, i.e., something where others could follow the discussion too?

Anyhow, thanks for relaying this to upstream and improving FOSS that way!
 
Great to hear, out of interest, did you ask them on some mailing list or so, i.e., something where others could follow the discussion too?
Being utterly new to linux kernel and not a developer I took a blind leap at emailing them (after the debian mailing list for the IPv6 bug was a bust)
I mentioned this second connection bug to them and I think that one got them interested.

I just received two patches; do you have something you can point me to on how to apply them to my PVE kernel?

I think this is the basics, but i am way out of my depth, lol.
 
ok, will use this post to update how far i got, and anyone else can jump in and tell me how i am being and idiot (so long as you tell me the right step tooo, lol)

i started with this on my WSL2 dev machine (I hope i can compile in WSL?)

Code:
mkdrir ~/src/linux
cd ~/src/linux
sudo apt-get install libncurses5-dev gcc make git exuberant-ctags bc libssl-dev build-essential wget bison flex libncurses-dev libelf-devmake rsync zstd debhelper
git clone git://kernel.ubuntu.com/kernel-ppa/stable-queue-branches.git
cd ~/src/linux/cod/mainline/v6.5.1

copy the config file from proxmox aka /boot/config-6.2.16-10-pve from running node to my build location
copy patch files to same build location

Code:
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git am <patch1>
git am <patch 2>
echo "-tbfixes" > localversion

When i run make oldconfig it asks lots of questions i don't know how to answer so i switched to make olddefconfig

Code:
make clean
make olddefconfig
make -j `getconf _NPROCESSORS_ONLN`
ok, completed that step, took a bunch of extra apt-gets on wsl to get through compile
now to make the deb files..

Code:
make deb-pkg

well this seems like a result:
  • linux-headers-6.5.2-tbfixes+_6.5.2-00015-ga2c22042b570-5_amd64.deb
  • linux-image-6.5.2-tbfixes+_6.5.2-00015-ga2c22042b570-5_amd64.deb
  • linux-image-6.5.2-tbfixes+-dbg_6.5.2-00015-ga2c22042b570-5_amd64.deb
  • linux-libc-dev_6.5.2-00015-ga2c22042b570-5_amd64.deb
Not sure I am brave enough to install on one of the nodes, until someone tells me I haven't been super stupid...

also if i install these:
  1. will it do all the intramfs stuff for me?
  2. how do i roll back after wards to the production kernel and modules... ?
 
Last edited:
Ok i didn't wait, i installed them.

This feels really weird running my own kernel - kinda cool.

1. TCP-IPv6 is now working perfectly - yay!
2. the connection issues seem to be gone - need to do some more testing to prove this (reboots, power offs, hard power pulls etc)

but so far - looking good.

@t.lamprecht should i be worried about running with a 6.5.2 non-proxmox kernel?
Should i ask y'all nicely when the code is live in linux kernel to backport to yours?
 
@t.lamprecht should i be worried about running with a 6.5.2 non-proxmox kernel?
Well, it's not officially supported for e.g. enterprise support, so there's that.
W.r.t. actual practical implication for running VMs and all that on it there are the following general issues:
  • ZFS support isn't included by default on those kernels IIRC, so you cannot use that feature without some DKMS module or the like.
  • You're missing some targeted bug fixes we add to our kernel to fix some things that are often rather specific to running and live-migrating virtual machines, but if you use the same hardware (identical CPU and firmware) for every of your nodes you should be covered already for half of them, the others are for PCIe pass-through stuff, some security fixes that are highly likely to be already included in the v6.5.2 kernel and some fixes for a bit more specific stuff. So, it could be that you run into an issue that wouldn't be there in our kernel, but it even could be vice versa, also it's not like you should expect corruption or the like just happen from running a mainline kernel for a while. So, if this isn't some production system you depend on to run rock stable with financial implication otherwise, I'd think that it's relatively safe to say you'll be fine.
Should i ask y'all nicely when the code is live in linux kernel to backport to yours?
Sure, if you can give me a link to the patch, e.g., on some kernel mailing list or on the git repo from the thunderbolt maintainers or Torvalds himself, I can see if I can cherry-pick it for our 6.2 based kernel so that it's included with the next bump.
As this is a bit of a niche usecase, and you already provided positive feedback that it works, I have not that many concerns of spearheading the addition of this to our kernel.
 
  • Like
Reactions: scyto
Last edited:
  • Like
Reactions: ualex
So, if this isn't some production system you depend on to run rock stable with financial implication otherwise, I'd think that it's relatively safe to say you'll be fine.
Thanks, this is a homelab, nothing truly critical. I don’t plan on using ZFS. I am using ceph - a 26gbs thunderbolt networking a 3 node mesh is awesome for ceph…
 
Last edited:
  • Like
Reactions: ualex
thanks for the pointer to the patches.
There's a v2 for that: https://lore.kernel.org/all/20230913052647.407420-1-mika.westerberg@linux.intel.com/ seems OK as is.
Did you just need that single patch, or (parts of) the rest of the series too? Just asking because this is part of a series consisting of five patches, but they all seem rather unrelated to each other, but would like to avoid missing a preparing patch for this fix.
 
thanks for the pointer to the patches.
no thank you for considering them! (also it felt like you couldn't quite believe i got a private fix for this initially, lol. as my mum said when i was a kid "if you never ask, you will never get" lol)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!