[SOLVED] Thunderbolt : Linx Kernel Error "kernel: thunderbolt 1-1: failed request link state change, aborting"

scyto · Sep 3, 2023

I have a 3 node NUC cluster where I do a thunderbolt-net mesh/routed network between the 3 nodes. (Routed 26Gbps is pretty awesome)

In one scenario it generates the error in the title : pve1 kernel: thunderbolt 1-1: failed request link state change, aborting

To expand on the failure test scenarios that work vs don't work:

I find the following scenarios super reliable:

pulling one of the three TB cables
Rebooting any node
hard failure of any node (pull the power cord)

In these tests everything comes back ok when the physical fault is corrected

However, for this scenario alone things are not so pretty:

shutdown a node (gracefully) and power back on by pressing the front button

This generates the error and can applies to usually one TB connection (and very occasionally two)

I know this is a very edge case scenario, i am hoping someone has seen this on other devices and found a fix.

I am running Proxmox 8

Code:

[    1.585102] ACPI: bus type thunderbolt registered.
[    3.532746] thunderbolt 0-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[    5.471801] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[   17.035024] thunderbolt 0-1: new host found, vendor=0x8086 device=0x1
[   17.035028] thunderbolt 0-1: Intel Corp. pve3
[   17.038497] thunderbolt-net 0-1.0 en05: renamed from thunderbolt0
[   18.230611] thunderbolt 1-1: failed request link state change, aborting
....
[   83.895648] thunderbolt 1-1: failed request link state change, aborting
[   84.919547] thunderbolt 1-1: failed request link state change, aborting
[   85.943324] thunderbolt 1-1: failed request link state change, aborting
[   86.899519] thunderbolt 1-0:1.1: retimer disconnected
[   91.407058] thunderbolt 1-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
[   96.726934] thunderbolt 1-1: new host found, vendor=0x8086 device=0x1
[   96.726938] thunderbolt 1-1: Intel Corp. pve2
[   96.729412] thunderbolt-net 1-1.0 en06: renamed from thunderbolt0

t.lamprecht · Sep 6, 2023

scyto said:
I have a 3 node NUC cluster where I do a thunderbolt-net mesh/routed network between the 3 nodes. (Routed 26Gbps is pretty awesome)

Sounds neat!

scyto said:
This generates the error and can applies to usually one TB connection (and very occasionally two)

I know this is a very edge case scenario, i am hoping someone has seen this on other devices and found a fix.

Hmm, yeah this is rather an edge case and not sure how much we can help here.
In general, I'd recommend checking that the firmware for all NUCs is up-to-date.

There seems to be Thunderbolt specific firmware, for intel NUC see:
https://www.intel.de/content/www/de/de/support/articles/000026171/intel-nuc/intel-nuc-kits.html
And for how to apply it see the kernel docs (disclaimer, I did not personally test this):
https://www.kernel.org/doc/html/lat...ing-nvm-on-thunderbolt-device-host-or-retimer

You could also test if forcing power off and then on again on the other host could work to regain network:
https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#forcing-power

If it does, you could write a small script that runs on boot and connects to the other NUCs to trigger such a reset.

scyto · Sep 7, 2023

@t.lamprecht

Thanks, this is a NUC13 there appears to be no updated TB firmware (the article is only for Nuc10 and lowe)

I like the idea of the power - though i am unclear one can force the port off with that command

looking at the firmware upgrade i noticed these command for changing the state of the retimers


# echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline
# echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan
# echo 0 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline

I am not sure how to find out from code (i am not programmer) what other actions there besides offline, rescan - maybe there are more like a reset.

I will try issuing these commands next time I see the issue

also i don't think this helps, but i found the code in question generating the error....

https://github.com/torvalds/linux/b...70f/drivers/thunderbolt/xdomain.c#L1266-L1288

scyto · Sep 7, 2023

scyto said:
looking at the firmware upgrade i noticed these command for changing the state of the retimers
# echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline # echo 1 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/rescan # echo 0 > /sys/bus/thunderbolt/devices/0-0/usb4_port1/offline

I am not sure how to find out from code (i am not programmer) what other actions there besides offline, rescan - maybe there are more like a reset.

I will try issuing these commands next time I see the issue

Nope that was a blind alley, those nodes don't exist on my nuc and access denied trying to isse the command

scyto · Sep 7, 2023

In the same way for vGPU i compiled a i915 driver and used dkms to run it, is it possible for me to do something similar for some later version of thunderbolt and thunderbolt-net held in the linux kernel repo?

t.lamprecht · Sep 7, 2023

Hmm, OK, yeah I didn't have much experience with thunderbolt in such use cases, so you'd need to research and experiment for triggering a reset on your own.

scyto said:
In the same way for vGPU i compiled a i915 driver and used dkms to run it, is it possible for me to do something similar for some later version of thunderbolt and thunderbolt-net held in the linux kernel repo?

Maybe, but as those are quite intertwined with the (not stable) internal kernel ABI it might be hard to do, but that's also just a guestimation.

What I'd recommend now anyway comes close to that tough, if you do not use ZFS for boot you could simply try the newest mainline build of the kernel to see if that fixes things.

You could use the ubuntu kernel mainline ppa, where every kernel version gets auto-built, while the newest would be v6.5.2, that one seems to have failed to build for x86_64/amd64, so maybe try v6.5.1:

https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.5.1/amd64/

You need to download both, the image and the modules package, then you can install them with e.g.:

Code:

apt install ./linux-image-unsigned-6.5.1-060501-generic_6.5.1-060501.202309020842_amd64.deb ./linux-modules-6.5.1-060501-generic_6.5.1-060501.202309020842_amd64.deb

As you mentioned some dkms package you might also need to add the headers package to the mix

scyto · Sep 7, 2023

@t.lamprecht

FYI The owner of the thunderbolt and thunderbolt-net code has reproduced both my bugs (this connection bug and the IPv6 one in my other thread). I will be sure to reply to both threads when I have an update. Also means no point me trying to use the 6.5 kernel at this time.

However thanks for the instructions, i am sure i will need to test when there is a fix so will keep those in my backpocket.

t.lamprecht · Sep 7, 2023

scyto said:
FYI The owner of the thunderbolt and thunderbolt-net code has reproduced both my bugs (this connection bug and the IPv6 one in my other thread). I will be sure to reply to both threads when I have an update. Also means no point me trying to use the 6.5 kernel at this time.

Great to hear, out of interest, did you ask them on some mailing list or so, i.e., something where others could follow the discussion too?

Anyhow, thanks for relaying this to upstream and improving FOSS that way!

scyto · Sep 7, 2023

t.lamprecht said:
Great to hear, out of interest, did you ask them on some mailing list or so, i.e., something where others could follow the discussion too?

Being utterly new to linux kernel and not a developer I took a blind leap at emailing them (after the debian mailing list for the IPv6 bug was a bust)
I mentioned this second connection bug to them and I think that one got them interested.

I just received two patches; do you have something you can point me to on how to apply them to my PVE kernel?

I think this is the basics, but i am way out of my depth, lol.

scyto · Sep 7, 2023

ok, will use this post to update how far i got, and anyone else can jump in and tell me how i am being and idiot (so long as you tell me the right step tooo, lol)

i started with this on my WSL2 dev machine (I hope i can compile in WSL?)

Code:

mkdrir ~/src/linux
cd ~/src/linux
sudo apt-get install libncurses5-dev gcc make git exuberant-ctags bc libssl-dev build-essential wget bison flex libncurses-dev libelf-devmake rsync zstd debhelper
git clone git://kernel.ubuntu.com/kernel-ppa/stable-queue-branches.git
cd ~/src/linux/cod/mainline/v6.5.1

copy the config file from proxmox aka /boot/config-6.2.16-10-pve from running node to my build location
copy patch files to same build location

Code:

git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git am <patch1>
git am <patch 2>
echo "-tbfixes" > localversion

When i run make oldconfig it asks lots of questions i don't know how to answer so i switched to make olddefconfig

Code:

make clean
make olddefconfig
make -j `getconf _NPROCESSORS_ONLN`

ok, completed that step, took a bunch of extra apt-gets on wsl to get through compile
now to make the deb files..

Code:

make deb-pkg

well this seems like a result:

linux-headers-6.5.2-tbfixes+_6.5.2-00015-ga2c22042b570-5_amd64.deb
linux-image-6.5.2-tbfixes+_6.5.2-00015-ga2c22042b570-5_amd64.deb
linux-image-6.5.2-tbfixes+-dbg_6.5.2-00015-ga2c22042b570-5_amd64.deb
linux-libc-dev_6.5.2-00015-ga2c22042b570-5_amd64.deb

Not sure I am brave enough to install on one of the nodes, until someone tells me I haven't been super stupid...

also if i install these:

will it do all the intramfs stuff for me?
how do i roll back after wards to the production kernel and modules... ?

scyto · Sep 8, 2023

Ok i didn't wait, i installed them.

This feels really weird running my own kernel - kinda cool.

1. TCP-IPv6 is now working perfectly - yay!
2. the connection issues seem to be gone - need to do some more testing to prove this (reboots, power offs, hard power pulls etc)

but so far - looking good.

@t.lamprecht should i be worried about running with a 6.5.2 non-proxmox kernel?
Should i ask y'all nicely when the code is live in linux kernel to backport to yours?

t.lamprecht · Sep 8, 2023

scyto said:
@t.lamprecht should i be worried about running with a 6.5.2 non-proxmox kernel?

Well, it's not officially supported for e.g. enterprise support, so there's that.
W.r.t. actual practical implication for running VMs and all that on it there are the following general issues:

ZFS support isn't included by default on those kernels IIRC, so you cannot use that feature without some DKMS module or the like.
You're missing some targeted bug fixes we add to our kernel to fix some things that are often rather specific to running and live-migrating virtual machines, but if you use the same hardware (identical CPU and firmware) for every of your nodes you should be covered already for half of them, the others are for PCIe pass-through stuff, some security fixes that are highly likely to be already included in the v6.5.2 kernel and some fixes for a bit more specific stuff. So, it could be that you run into an issue that wouldn't be there in our kernel, but it even could be vice versa, also it's not like you should expect corruption or the like just happen from running a mainline kernel for a while. So, if this isn't some production system you depend on to run rock stable with financial implication otherwise, I'd think that it's relatively safe to say you'll be fine.

scyto said:
Should i ask y'all nicely when the code is live in linux kernel to backport to yours?

Sure, if you can give me a link to the patch, e.g., on some kernel mailing list or on the git repo from the thunderbolt maintainers or Torvalds himself, I can see if I can cherry-pick it for our 6.2 based kernel so that it's included with the next bump.
As this is a bit of a niche usecase, and you already provided positive feedback that it works, I have not that many concerns of spearheading the addition of this to our kernel.

scyto · Sep 8, 2023

t.lamprecht said:
git repo from the thunderbolt maintainers

I will flag when the commits are posted to https://github.com/torvalds/linux/tree/master/drivers/thunderbolt and https://github.com/torvalds/linux/tree/master/drivers/net/thunderbolt - there is no mailing list for this issue. I got the patches 1:1.

for anyone reading this who wants the patches before anything official, just DM me and I can share

scyto · Sep 8, 2023

t.lamprecht said:
So, if this isn't some production system you depend on to run rock stable with financial implication otherwise, I'd think that it's relatively safe to say you'll be fine.

Thanks, this is a homelab, nothing truly critical. I don’t plan on using ZFS. I am using ceph - a 26gbs thunderbolt networking a 3 node mesh is awesome for ceph…

scyto · Sep 13, 2023

@t.lamprecht as promised, here are the public threads on the patch if folks want to track when they hit the kernel / revisions to the patches.

IPv6 issue https://lore.kernel.org/netdev/20230911100956.GZ1599918@black.fi.intel.com/T/#t
inter-domain connection issue: https://lore.kernel.org/all/20230911100445.3612655-6-mika.westerberg@linux.intel.com/#t

t.lamprecht · Sep 13, 2023

thanks for the pointer to the patches.

scyto said:
IPv6 issue https://lore.kernel.org/netdev/20230911100956.GZ1599918@black.fi.intel.com/T/#t

There's a v2 for that: https://lore.kernel.org/all/20230913052647.407420-1-mika.westerberg@linux.intel.com/ seems OK as is.

scyto said:
inter-domain connection issue: https://lore.kernel.org/all/20230911100445.3612655-6-mika.westerberg@linux.intel.com/#t

Did you just need that single patch, or (parts of) the rest of the series too? Just asking because this is part of a series consisting of five patches, but they all seem rather unrelated to each other, but would like to avoid missing a preparing patch for this fix.

scyto · Sep 13, 2023

t.lamprecht said:
There's a v2 for that: https://lore.kernel.org/all/20230913052647.407420-1-mika.westerberg@linux.intel.com/ seems OK as is.

I compiled it last night, will be testing today to make sure it works.

t.lamprecht said:
Did you just need that single patch

I believe I only need [PATCH 5/5] thunderbolt: Restart XDomain discovery handshake after failure

scyto · Sep 13, 2023

t.lamprecht said:
thanks for the pointer to the patches.

no thank you for considering them! (also it felt like you couldn't quite believe i got a private fix for this initially, lol. as my mum said when i was a kid "if you never ask, you will never get" lol)

scyto · Sep 13, 2023

scyto said:
will be testing today to make sure it works.

No surprise, the new v2 works perfectly (my test is SSH between nodes on the IPv6 thunderbolt addresses.

scyto · Sep 15, 2023

The TCPv6 patch is now applied to netdev/net.git (main) for those that care / are interested.

[SOLVED] Thunderbolt : Linx Kernel Error "kernel: thunderbolt 1-1: failed request link state change, aborting"

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

We value your privacy