Kernel 6.8.12 breaks USB controller passthrough in a weird way

leesteken · Aug 8, 2024

I use PCI passthrough of a USB controller [1b21:3241)] to a VM (with a GPU too), including early binding to vfio-pci and a softdep to make sure vfio-pci is loaded before xhci_pci. This has worked for a long time (since before PVE 8.0, currently on an up to date 8.2 no-subscription) up to and including kernel 6.8.8-4-pve.
The new kernel of today with version 6.8.12-1-pve breaks it in a weird way where the keyboard will not work while the mouse is connected and the mouse will not work while the keyboard is connected.

Code:

05:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM3241 USB 3.2 Gen 2 Host Controller [1b21:3241]
    Subsystem: Gigabyte Technology Co., Ltd ASM3241 USB 3.2 Gen 2 Host Controller [1458:5007]
    Kernel driver in use: vfio-pci
    Kernel modules: xhci_pci

options vfio_pci ids=1b21:3241 disable_vga=1
softdep xhci_pci pre: vfio-pci

I PCI passthrough another USB controller [1022:149c] to another VM (with another GPU too) and that still works fine. I don't use early binding to vfio-pci for this one and the device resets properly. I use the same keyboard and mouse using a USB-switch (KVM without V so to speak).

Code:

10:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
    Subsystem: Gigabyte Technology Co., Ltd Matisse USB 3.0 Host Controller [1458:5007]
    Kernel driver in use: vfio-pci
    Kernel modules: xhci_pci

It's unexpected that the passthrough is influenced by the Proxmox kernel version, when I took all the steps to prevent the Proxmox kernel from touching the device before the VM. Otherwise, I would not be surprised that passthrough sometimes breaks and new work-around are sometimes needed. I can work-around it by pinning kernel 6.8.8-4-pve. And I can do some more testing, like doing the opposite of preventing Proxmox from touching the device, for example.

Has anyone any idea why this might be happening? Did Linux or Ubuntu change something that early binding and softdep on xhci_pci no longer works in Proxmox?

leesteken · Aug 9, 2024

I tested the USB controller (without passthrough) on the host and it works. I also tested not-early binding and that also (unsurprisingly) fails on 6.8.12-1-pve but works on 6.8.8.4-pve (as the device resets properly).
Maybe it's an issue introduced by vfio-pci, since early-binding to vfio-pci (which should be device agnostic) no longer works and nothing else touches the device.

Comparing the journalctl -b 0 from both kernel versions only shows a difference in memory layout (which I would not expect but maybe this is common?):

Code:

e820: update [mem 0xb47c9018-0xb47d7e57] usable ==> usable
e820: update [mem 0xb47c9018-0xb47d7e57] usable ==> usable
e820: update [mem 0xb0e3f018-0xb0e4e057] usable ==> usable
e820: update [mem 0xb0e3f018-0xb0e4e057] usable ==> usable
e820: update [mem 0xb0e20018-0xb0e3e257] usable ==> usable
e820: update [mem 0xb0e20018-0xb0e3e257] usable ==> usable
e820: update [mem 0xb0e0f018-0xb0e1f057] usable ==> usable
e820: update [mem 0xb0e0f018-0xb0e1f057] usable ==> usable

software IO TLB: mapped [mem 0x00000000ace0f000-0x00000000b0e0f000] (64MB)

Code:

e820: update [mem 0xb47c9018-0xb47d7e57] usable ==> usable
e820: update [mem 0xb47c9018-0xb47d7e57] usable ==> usable
e820: update [mem 0xb0d25018-0xb0d34057] usable ==> usable
e820: update [mem 0xb0d25018-0xb0d34057] usable ==> usable
e820: update [mem 0xb0d06018-0xb0d24257] usable ==> usable
e820: update [mem 0xb0d06018-0xb0d24257] usable ==> usable
e820: update [mem 0xb0cf5018-0xb0d05057] usable ==> usable
e820: update [mem 0xb0cf5018-0xb0d05057] usable ==> usable

software IO TLB: mapped [mem 0x00000000accf5000-0x00000000b0cf5000] (64MB)

This small change in memory addresses does appear in several parts of the logs.

The new kernel identifies the USB controller as:

Code:

xhci_hcd 0000:05:00.0: xHCI Host Controller
xhci_hcd 0000:05:00.0: new USB bus registered, assigned bus number 1
xhci_hcd 0000:05:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000000000010
xhci_hcd 0000:05:00.0: xHCI Host Controller
xhci_hcd 0000:05:00.0: new USB bus registered, assigned bus number 2
xhci_hcd 0000:05:00.0: Host supports USB 3.2 Enhanced SuperSpeed

The logs inside the VM are less clear (because it's a Linux Mint with graphical desktop) and also only appear to have a memory offset difference. Both keyboard and mouse appear to be detected correctly inside the VM. Either the keypresses are not comming through or the mouse does not move or click. Maybe it's an interrupt or MSI issue with VFIO passthrough?

I fear that my issue is way too specific and the hardware not common enough for anyone to comment on. I guess the next kernel version will point out whether this was a generic bug or whether this is the new normal and this USB controller can unfortunately no longer be used for passthrough.

leesteken · Sep 19, 2024

It's still broken in this weird way with kernel version 6.8.12-2-pve: keyboard loses most key presses and releases when the mouse is also connected.
How can the host kernel version be relevant for a PCIe USB controller that is early bound to vfio-pci (so the host does not touch it) and passed through to a VM (that did not change).

nevermindnono · Oct 13, 2024

I updated yesterday to 6.8.12.2, and it somehow breaks the USB on my LarkboxX - after some time the USB the USB devices cannot be found anymore, including the keyboard not working if plugged in, and this also destroys passthru, of course. This is the "normal" situation after a reboot:

Code:

root@pve:~# lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 1b1f:c020 eQ-3 Entwicklung GmbH HmIP-RFUSB
Bus 003 Device 003: ID 8087:0026 Intel Corp. AX201 Bluetooth
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
root@pve:~#

When the issue appears, it leaves just one USB2 and one USB3 hub in the list (Bus 001 & Bus 002, if I remember correctly), the rest including the plugged devices vanishes.

Any news on this USB issue?

Looks like the issue appeared just after a scheduled backup was started @1am last night:

Code:

Oct 13 00:59:01 pve pmxcfs[920]: [dcdb] notice: data verification successful
Oct 13 01:00:05 pve pvescheduler[195920]: <root@pam> starting task UPID:pve:0002FD51>
Oct 13 01:00:05 pve pvescheduler[195921]: INFO: starting new backup job: vzdump --mo>
Oct 13 01:00:05 pve pvescheduler[195921]: INFO: Starting Backup of VM 100 (qemu)
Oct 13 01:00:50 pve pvescheduler[195921]: INFO: Finished Backup of VM 100 (00:00:45)
Oct 13 01:00:50 pve pvescheduler[195921]: INFO: Starting Backup of VM 101 (lxc)
Oct 13 01:00:50 pve dmeventd[401]: No longer monitoring thin pool pve-data-tpool.
Oct 13 01:00:50 pve dmeventd[401]: Monitoring thin pool pve-data-tpool.
Oct 13 01:00:50 pve kernel: EXT4-fs (dm-19): mounted filesystem d6154dac-2288-4263-b>
Oct 13 01:01:05 pve kernel: EXT4-fs (dm-19): unmounting filesystem d6154dac-2288-426>
Oct 13 01:01:05 pve pvescheduler[195921]: INFO: Finished Backup of VM 101 (00:00:15)
Oct 13 01:01:05 pve pvescheduler[195921]: INFO: Starting Backup of VM 102 (qemu)
=====>
Oct 13 01:01:05 pve kernel: xhci_hcd 0000:00:14.0: remove, state 4
Oct 13 01:01:05 pve kernel: usb usb4: USB disconnect, device number 1
Oct 13 01:01:05 pve kernel: xhci_hcd 0000:00:14.0: USB bus 4 deregistered
Oct 13 01:01:05 pve kernel: xhci_hcd 0000:00:14.0: remove, state 1
Oct 13 01:01:05 pve kernel: usb usb3: USB disconnect, device number 1
Oct 13 01:01:05 pve kernel: usb 3-2: USB disconnect, device number 2
Oct 13 01:01:05 pve kernel: usb 3-10: USB disconnect, device number 3
Oct 13 01:01:05 pve kernel: xhci_hcd 0000:00:14.0: USB bus 3 deregistered
<=====
Oct 13 01:01:05 pve systemd[1]: Starting systemd-rfkill.service - Load/Save RF Kill Switch Status...
Oct 13 01:01:05 pve QEMU[1692]: kvm: libusb_release_interface: -4 [NO_DEVICE]
Oct 13 01:01:05 pve systemd[1]: Stopped target bluetooth.target - Bluetooth Support.
Oct 13 01:01:05 pve systemd[1]: Started systemd-rfkill.service - Load/Save RF Kill Switch Status.
Oct 13 01:01:06 pve systemd[1]: Started 102.scope.
Oct 13 01:01:06 pve kernel: tap102i1: entered promiscuous mode
Oct 13 01:01:06 pve kernel: vmbr1: port 4(tap102i1) entered blocking state
Oct 13 01:01:06 pve kernel: vmbr1: port 4(tap102i1) entered disabled state
Oct 13 01:01:06 pve kernel: tap102i1: entered allmulticast mode
Oct 13 01:01:06 pve kernel: vmbr1: port 4(tap102i1) entered blocking state
Oct 13 01:01:06 pve kernel: vmbr1: port 4(tap102i1) entered forwarding state
Oct 13 01:01:07 pve kernel: tap102i2: entered promiscuous mode
Oct 13 01:01:07 pve kernel: vmbr0: port 8(tap102i2) entered blocking state
Oct 13 01:01:07 pve kernel: vmbr0: port 8(tap102i2) entered disabled state
Oct 13 01:01:07 pve kernel: tap102i2: entered allmulticast mode
Oct 13 01:01:07 pve kernel: vmbr0: port 8(tap102i2) entered blocking state
Oct 13 01:01:07 pve kernel: vmbr0: port 8(tap102i2) entered forwarding state
Oct 13 01:01:07 pve kernel: vfio-pci 0000:00:14.2: enabling device (0000 -> 0002)
Oct 13 01:01:09 pve kernel: tap102i1: left allmulticast mode
Oct 13 01:01:09 pve kernel: vmbr1: port 4(tap102i1) entered disabled state
Oct 13 01:01:09 pve kernel: tap102i2: left allmulticast mode
Oct 13 01:01:09 pve kernel: vmbr0: port 8(tap102i2) entered disabled state
Oct 13 01:01:09 pve qmeventd[673]: read: Connection reset by peer
Oct 13 01:01:09 pve pvescheduler[195921]: INFO: Finished Backup of VM 102 (00:00:04)
Oct 13 01:01:09 pve pvescheduler[195921]: INFO: Starting Backup of VM 104 (qemu)

frankmanzhu · Oct 16, 2024

Thanks for pointing out the work around kernel.
I upgraded from pve7 to 8. And found out passing through usb controller crashed the host.
Luckily, the 5.15 kernel still works fine and rock solid.
I'll have a go with 6.8.8-4-pve.

After test, macOS VM still goes into boot loop for kenerl 6.8.8-4-pve. Uhm.... looks like I need to stay at 5.15 for a while.

leesteken · Nov 2, 2024

Tested it with kernel 6.11 and 6.8.12-3 but still the USB 3.0 controller can no longer handle mouse (USB 2.0) and keyboard (USB 1.1) at the same time.
Does anyone have any troubleshooting tips or possible work-arounds? This PCIe passthrough worked fine up to Proxmox kernel 6.8.8. I made sure that Proxmox only loads vfio-pci for the device. What could the change in the VFIO driver between 6.8.8-4-pve and 6.8.12-1-pve be that breaks this USB controller? What could I try?

EDIT: Is it just me or could someone else with a ASM3241 (on a X570S AERO G) test this? Should I just disable the chip and buy another controller, since this appears to be still broken? I also tested not early-binding but that makes no difference on any kernel version.

antplugger · Monday at 03:19

I just upgraded to (6.8.12-3-pve) and now I cant passthrough any USB controllers, GPU's passthrough fine, but with USB gives me an error "TASK ERROR: Cannot bind 0000:10:00.3 to vfio". No errors in the log.

This is happening with both ASMEDIA PCI card on onboard [AMD] Matisse USB 3.0 Host Controller. I am using an ASUS WS PRO X570 mother board.

I will try the previous kernel in a bit.
Cheers

FIXED: Needed to add softdep xhci_hcd pre: vfio-pci to /etc/modprobe.d/vfio.conf

dcsapak · Monday at 08:33

leesteken said:
Tested it with kernel 6.11 and 6.8.12-3 but still the USB 3.0 controller can no longer handle mouse (USB 2.0) and keyboard (USB 1.1) at the same time.
Does anyone have any troubleshooting tips or possible work-arounds? This PCIe passthrough worked fine up to Proxmox kernel 6.8.8. I made sure that Proxmox only loads vfio-pci for the device. What could the change in the VFIO driver between 6.8.8-4-pve and 6.8.12-1-pve be that breaks this USB controller? What could I try?

EDIT: Is it just me or could someone else with a ASM3241 (on a X570S AERO G) test this? Should I just disable the chip and buy another controller, since this appears to be still broken? I also tested not early-binding but that makes no difference on any kernel version.

could you maybe post the full dmesg for both the working kernel and a broken one (best the earliest that broke it)? maybe we can see something ..

leesteken · Monday at 11:16

dcsapak said:
could you maybe post the full dmesg for both the working kernel and a broken one (best the earliest that broke it)? maybe we can see something ..

Thank you for taking an interest. I reproduced the issue with just VM 112 that starts automatically. Then I switch my 4-port USB-switch from host to the USB controller (with only one USB-C port) passed through to the VM. There the mouse works but the keyboard seems very laggy and not all key presses/releases are registered in the graphical login screen (practically unusable since 6.8.12-1). Then I switch back to the host and saved the dmesg (for both kernel versions as requested).

dcsapak · Tuesday at 09:11

ok, so after sorting out the diffs between the two boots, there are the following relevant differences (sans address/etc changes):

on the 6.8.12 is missing the following lines:

Code:

PCI: not using ECAM ([mem 0xf0000000-0xf7ffffff] not reserved)                      
PCI: Using configuration type 1 for extended access                               
PCI: ECAM [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) for domain 0000 [bus 00-7f] 
PCI: ECAM [mem 0xf0000000-0xf7ffffff] reserved as ACPI motherboard resource

but gained the new lines:

Code:

ucsi_ccg 3-0008: failed to reset PPM!             
ucsi_ccg 3-0008: error -ETIMEDOUT: PPM init failed

according to https://github.com/torvalds/linux/blob/master/drivers/usb/typec/ucsi/ucsi_ccg.c
this is a driver for a Cypress CCGx Type-C controller

i'll look a bit further to see if there were any significant changes in the kernel to that driver..

leesteken · Tuesday at 11:13

dcsapak said:
ok, so after sorting out the diffs between the two boots, there are the following relevant differences (sans address/etc changes):

on the 6.8.12 is missing the following lines:

Code:

PCI: not using ECAM ([mem 0xf0000000-0xf7ffffff] not reserved) PCI: Using configuration type 1 for extended access PCI: ECAM [mem 0xf0000000-0xf7ffffff] (base 0xf0000000) for domain 0000 [bus 00-7f] PCI: ECAM [mem 0xf0000000-0xf7ffffff] reserved as ACPI motherboard resource

but gained the new lines:

Code:

ucsi_ccg 3-0008: failed to reset PPM! ucsi_ccg 3-0008: error -ETIMEDOUT: PPM init failed

according to https://github.com/torvalds/linux/blob/master/drivers/usb/typec/ucsi/ucsi_ccg.c
this is a driver for a Cypress CCGx Type-C controller

i'll look a bit further to see if there were any significant changes in the kernel to that driver..

I did not spot that when I tried to compare journalctl's myself two months ago. The reproduction was done with early binding to vfio-pci to prevent the kernel from touching the device, so I'm surprised that the kernel tries to do something with it. Are you sure it's about the ASM3241 and not another USB-C controller like the one on the AMD 6950XT?

I also tested it without binding to vio-pci but that gives the same behavior: only when doing PCIe passthrough that it starts misbehaving and losing events or interrupts or something. The same USB controller works fine on the host, with a TV-tuner passed to a container, using the standard kernel driver (on all kernel versions).

Thank youi again for looking into this, as I know that passthrough is never guaranteed and I'm just a home user (who can work around this if it takes you too much time to investigate).

dcsapak · Tuesday at 12:04

leesteken said:
Are you sure it's about the ASM3241 and not another USB-C controller like the one on the AMD 6950XT?

no i'm not and it may be entirely unrelated

it's just what i noticed is different.

it could also of course be some kernel change that does not produce any different dmesg output and this is just a red herring.

is that guest a linux or windows vm? is there any log output there that might give a hint? (e.g. also dmesg on linux, or event viewer in windows)

any way i'll look into the kernel changes for 'ucsi_ccg' and the ASM3241 (if i can find the driver...)

dcsapak · Tuesday at 13:01

ok, so after looking through the changes, i could not see a commit that would stand out which could cause this (no kernel expert here though)

also i don't think we currently have any spare hardware here that matches with which we could try to reproduce that

if you're motivated though, you could try to bisect the kernel

the changes from 6.8.8-4 to 6.8.12-1 correspond to ubuntu kernels Ubuntu-6.8.0-38.38 to to Ubuntu-6.8.0-43.43
so you'd have to git bisect between those two tags in the submodule ubuntu-kernel there https://git.proxmox.com/?p=pve-kern...574e2d8539267232be;hb=refs/heads/bookworm-6.8
(on the bookworm-6.8 branch)

there are ~2200 commits between those tags, which would result in a maximum of ~12 bisect steps

leesteken · Tuesday at 18:13

dcsapak said:
if you're motivated though, you could try to bisect the kernel

the changes from 6.8.8-4 to 6.8.12-1 correspond to ubuntu kernels Ubuntu-6.8.0-38.38 to to Ubuntu-6.8.0-43.43
so you'd have to git bisect between those two tags in the submodule ubuntu-kernel there https://git.proxmox.com/?p=pve-kern...574e2d8539267232be;hb=refs/heads/bookworm-6.8
(on the bookworm-6.8 branch)

I keep running into dh: error: unable to load addon sphinxdoc: Can't locate Debian/Debhelper/Sequence/sphinxdoc.pm in @INC even after apt install devscripts dh-python on make build-dir-fresh. Could there be something missing from the README? I want to try out git bisect, but I cannot get the bookwork-6.8 branch to build currently.

EDIT: apt install sphinx-common is also needed and it got to dpkg-deb: building package 'proxmox-kernel-6.8-build-deps' in '../proxmox-kernel-6.8-build-deps_6.8.12-3_all.deb'.
EDIT2:

apt install asciidoc-base bison dwarves flex libdw-dev libelf-dev libiberty-dev libnuma-dev libslang2-dev libssl-dev python3-dev xmlto zlib1g-dev

is also needed.

I also don't really know how to chose the correct commits for pve-kernel after chekout of the relevant ubuntu-kernel tag in the submodule. I'll keep searching this forum for clues on building the Proxmox kernel but this might take a while with my limited experience.

EDIT3: After a checkout of https://github.com/proxmox/pve-kernel/commit/a37a8d7a495c645f521b2fc5cb0e32493f8ae08f and removing the 6.8.12-1 kernel and header packages, I get a successful build of kernel 6.8.12-1 with Ubuntu-6.8.0-43.43. If fear I will get abicheck problems when bisecting this...

dcsapak · Wednesday at 12:09

thanks for trying to tackling this.

leesteken said:
I also don't really know how to chose the correct commits for pve-kernel after chekout of the relevant ubuntu-kernel tag in the submodule. I'll keep searching this forum for clues on building the Proxmox kernel but this might take a while with my limited experience.

AFAIU you shouldn't need to change commits in the pve-kernel git itself, only in the submodule

leesteken said:
EDIT3: After a checkout of https://github.com/proxmox/pve-kernel/commit/a37a8d7a495c645f521b2fc5cb0e32493f8ae08f and removing the 6.8.12-1 kernel and header packages, I get a successful build of kernel 6.8.12-1 with Ubuntu-6.8.0-43.43. If fear I will get abicheck problems when bisecting this...

if you don't need zfs, then you could also simply building the submodule itself

you'd need to copy a kernel config file to the submodule as '.config' , and then do a 'make olddefconfig' and 'make -j <n> bindeb-pkg' in the submodule

leesteken · Wednesday at 19:23

dcsapak said:
AFAIU you shouldn't need to change commits in the pve-kernel git itself, only in the submodule

I'm not so sure but I also don't know anything about compiling kernels. Last time I did that, it was on Gentoo which does that for me.

dcsapak said:
if you don't need zfs, then you could also simply building the submodule itself

I do need ZFS (and I would like to keep using this system while I bisect this, otherwise I could do a separate install of Proxmox).

I did a first bisect (start, bad and good) on the kernel submodule and make build-dir-fresh && make deb fails after some time with make[2]: *** [debian/rules:340: abicheck] Error 1.

EDIT: Doing a sudo mk-build-deps -ir proxmox-kernel-6.8.12/debian/control in between did not make a difference.

dcsapak · Thursday at 10:02

yes bisecting with the zfs bits intact is indeed much harder, I never did it myself yet... If you still want to do that, I'd have to go investigate a bit and come back with a short guide on how to do that, but not sure when I have time for that sadly

leesteken · Thursday at 10:37

dcsapak said:
yes bisecting with the zfs bits intact is indeed much harder, I never did it myself yet... If you still want to do that, I'd have to go investigate a bit and come back with a short guide on how to do that, but not sure when I have time for that sadly

No need to rush this just for me. I worked-around it and I might try to setup a Proxmox on a USB-drive as plain as possible with ext4, so I don't have undo the work-around each time. But some documentation on how to handle bisecting Ubuntu changes between two Proxmox kernel versions might be helpful for others in the future as well.

Search

Search

Kernel 6.8.12 breaks USB controller passthrough in a weird way

leesteken

Distinguished Member

leesteken

Distinguished Member

leesteken

Distinguished Member

nevermindnono

New Member

frankmanzhu

New Member

leesteken

Distinguished Member

antplugger

Member

dcsapak

Proxmox Staff Member

leesteken

Distinguished Member

Attachments

dcsapak

Proxmox Staff Member

leesteken

Distinguished Member

dcsapak

Proxmox Staff Member

dcsapak

Proxmox Staff Member

leesteken

Distinguished Member

dcsapak

Proxmox Staff Member

leesteken

Distinguished Member

dcsapak

Proxmox Staff Member

leesteken

Distinguished Member