USB3 pass-through fails periodically with transfer event errors

nicoleise

New Member
Apr 25, 2023
8
0
1
Hi there,

I struggle to pass-through USB devices reliably, it seems. My server has three ports, all of which have a Google Coral USB TPU connected to them, and all are served by the same USB controller in the server. Everything is on-board, no expansion cards or external hubs.

GuestVM (q35, OVMF, uses ZFS) made for the sole purpose of running FrigateNVR through Docker Compose. So:

Bare metal => Debian 11 => Proxmox (and nothing else) => GuestVM => Debian 11 => Docker => Frigate
(Yes, overly complicated, but made this way due to conflicts between Docker and Proxmox when using ZFS)

Plenty of ressources available, and the error occurs regardless if I connect just one, two or all three Coral USBs.


Problem
What I am experiencing depends on the manner of pass-through. Passing through a USB device (deviceID), the guestVM sees the USB devices, but for some reason they do not function inside Docker (although they appear in Docker). Passing through a USB port works flawlessly until it doesn't.

Obviously, the latter seems better, but i) Frigate reports inference speeds of around 30-50 ms, indicating an issue as these should generally be around 10 ms tops) and ii) at random points in time, Frigate will become aware that object detection processing seems stuck. This coincides with xhci messages in the guestVM and host.

Impressions
I've looked into it best I can, and apparently this seems to happen generally with I/O intensive tasks, such as file copying to sticks/drives, this error is relatively common, especially with Proxmox (I saw reports of people switching to Proxmox and finding this problem on otherwise working VMs when migrating from other hypervisors).

I've found this helpful post that suggests the problem to be related to the change away from using nec-usb-xhci that took place around Proxmox version 7.2. So I'd like to try using the nec-usb approach, but I guess I can't just do that as Proxmox has decided otherwise? If I can, I don't understand how?

I do see a bunch of examples of apparently doing that in the wiki for USB devices in VMs, but I assume it's simply outdated (?) and will not work. At least, I can't seem to make it.

Other users have reported simply passing through the entire USB host controller instead, which fixed the issue for some (but others report issues remain). I'd like to avoid this as I'm using all USB ports on the server, so if I need to connect a keyboard, I'd have to make one USB port available to my host first. That seems easier if I can simply comment out one TPU from Frigate's config and remove it from the VM.

The most informative post I've found, relevant to my problem, is this: https://forum.proxmox.com/threads/p...urces-for-new-device-state.122792/post-536150


Logs:

Frigate:
Code:
2024-07-30 11:04:48.074441397  [INFO] Preparing Frigate...
2024-07-30 11:04:48.102673037  [INFO] Starting Frigate...
2024-07-30 11:04:49.970334735  [2024-07-30 11:04:49] frigate.app                    INFO    : Starting Frigate (0.13.2-6476f8a)
2024-07-30 11:04:50.085873354  [2024-07-30 11:04:50] peewee_migrate.logs            INFO    : Starting migrations
2024-07-30 11:04:50.092117530  [2024-07-30 11:04:50] peewee_migrate.logs            INFO    : There is nothing to migrate
2024-07-30 11:04:50.100341577  [2024-07-30 11:04:50] frigate.app                    INFO    : Recording process started: 741
2024-07-30 11:04:50.104342352  [2024-07-30 11:04:50] frigate.app                    INFO    : go2rtc process pid: 89
2024-07-30 11:04:50.152898043  [2024-07-30 11:04:50] detector.coral1                INFO    : Starting detection process: 751
2024-07-30 11:04:50.163012662  [2024-07-30 11:04:50] detector.coral2                INFO    : Starting detection process: 753
2024-07-30 11:04:50.174397356  [2024-07-30 11:04:50] frigate.app                    INFO    : Output process started: 758
2024-07-30 11:04:50.177984693  [2024-07-30 11:04:50] detector.coral3                INFO    : Starting detection process: 756
2024-07-30 11:04:50.259391736  [2024-07-30 11:04:50] frigate.app                    INFO    : Camera processor started for E04_Nordvest: 780
2024-07-30 11:04:50.259632235  [2024-07-30 11:04:50] frigate.app                    INFO    : Camera processor started for E05_Nordost: 782
2024-07-30 11:04:50.272552861  [2024-07-30 11:04:50] frigate.app                    INFO    : Camera processor started for I01_Vaerksted: 794
2024-07-30 11:04:50.285867123  [2024-07-30 11:04:50] frigate.app                    INFO    : Camera processor started for I07_Lager: 796
2024-07-30 11:04:50.298670494  [2024-07-30 11:04:50] frigate.app                    INFO    : Capture process started for E04_Nordvest: 799
2024-07-30 11:04:50.312785081  [2024-07-30 11:04:50] frigate.app                    INFO    : Capture process started for E05_Nordost: 803
2024-07-30 11:04:50.328942980  [2024-07-30 11:04:50] frigate.app                    INFO    : Capture process started for I01_Vaerksted: 807
2024-07-30 11:04:50.355022897  [2024-07-30 11:04:50] frigate.app                    INFO    : Capture process started for I07_Lager: 814
2024-07-30 11:04:53.652068704  [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as usb:0
2024-07-30 11:04:53.662824376  [2024-07-30 11:04:53] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found
2024-07-30 11:04:54.506634005  [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as usb:2
2024-07-30 11:04:54.516608316  [2024-07-30 11:04:54] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found
2024-07-30 11:04:54.658337538  [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as usb:1
2024-07-30 11:04:54.668387055  [2024-07-30 11:04:54] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found
2024-07-30 12:05:50.820267790  [2024-07-30 12:05:50] frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...
2024-07-30 12:05:50.820623086  [2024-07-30 12:05:50] root                           INFO    : Waiting for detection process to exit gracefully...
2024-07-30 12:06:20.843610929  [2024-07-30 12:06:20] root                           INFO    : Detection process didnt exit. Force killing...
2024-07-30 12:06:20.861164642  [2024-07-30 12:06:20] root                           INFO    : Detection process has exited...
2024-07-30 12:06:21.038897190  [2024-07-30 12:06:21] detector.coral1                INFO    : Starting detection process: 30271
2024-07-30 12:06:21.044284639  [2024-07-30 12:06:21] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as usb:0
2024-07-30 12:06:23.982269564  [2024-07-30 12:06:23] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found
2024-07-30 12:22:20.930270998  [2024-07-30 12:22:20] frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...
2024-07-30 12:22:20.930404121  [2024-07-30 12:22:20] root                           INFO    : Waiting for detection process to exit gracefully...
2024-07-30 12:22:50.969076064  [2024-07-30 12:22:50] root                           INFO    : Detection process didnt exit. Force killing...
2024-07-30 12:22:50.992574914  [2024-07-30 12:22:50] root                           INFO    : Detection process has exited...
2024-07-30 12:22:51.170057692  [2024-07-30 12:22:51] detector.coral3                INFO    : Starting detection process: 38097
2024-07-30 12:22:54.143501636  [2024-07-30 12:22:51] frigate.detectors.plugins.edgetpu_tfl INFO    : Attempting to load TPU as usb:2
2024-07-30 12:22:54.154507469  [2024-07-30 12:22:54] frigate.detectors.plugins.edgetpu_tfl INFO    : TPU found

Note that Frigate starts up without remarks, and works without issue for an hour before reporting that detection appears stuck. The occurance of this can be rare or it can be e.g. every 6 minutes. It looks like a watchdog of sorts, so I checked dmesg on the Guest and it seems to coincide with this error, which shows equally frequent:

Code:
xhci_hcd 0000:07:1b.0: Error Transfer event TRB DMA prt not part of current TD ep_index 2 comp_code 1
xhci_hcd 0000:07:1b.0: Error Transfer event TRB DMA prt not part of current TD ep_index 2 comp_code 13

These will spam repeatedly, mainly comp_code 1, occasionally comp_code 13. They also output directly to shell, so it's very visible when using noVNC. When some (seemingly random) amount of occurances of the above has happened, the log will show this also:

Code:
usb 6-3: reset SuperSpeed Gen 1 USB device number 7 using xhci_hcd
usb 6-3: LPM exit latency is zeroed, disabling LPM.

All of this will then repeat after - some measurement of time - sometimes a second, sometimes an hour or two. It will indicate different USBs, i.e. 6-3 in one occurance, next time maybe 6-1 and so on. The error is the same regardless of the number of connected Corals, and it seems to also have the same (random) frequency.

I then checked dmesg on the host:

Code:
[72600.900070] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72600.925174] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72634.211973] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72634.233247] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72662.368201] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72662.393401] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72669.033794] perf: interrupt took too long (7802 > 7757), lowering kernel.perf_event_max_sample_rate to 25500
[72695.768495] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72695.789716] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72729.148755] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72729.170113] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72762.349002] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72762.374906] usb 4-2: LPM exit latency is zeroed, disabling LPM.

Configs:

Code:
root@server:~# pveversion -v

proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Code:
root@server:~# qm showcmd 220 --pretty

/usr/bin/kvm \
  -id 220 \
  -name 'frigate,debug-threads=on' \
  -no-shutdown \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/220.qmp,server=on,wait=off' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/220.pid \
  -daemonize \
  -smbios 'type=1,uuid=07fcb28e-5c32-464b-bf1d-3cd1811f0cfb' \
  -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' \
  -drive 'if=pflash,unit=1,id=drive-efidisk0,format=raw,file=/dev/zvol/rpool/data/vm-220-disk-0,size=540672' \
  -smp '16,sockets=2,cores=8,maxcpus=16' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc 'unix:/var/run/qemu-server/220.vnc,password=on' \
  -cpu 'Broadwell,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,vendor=GenuineIntel' \
  -m 65536 \
  -object 'iothread,id=iothread-virtioscsi0' \
  -object 'iothread,id=iothread-virtioscsi1' \
  -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
  -device 'vmgenid,guid=938cfbfd-be7c-4849-ab2b-a7bda8ba311b' \
  -device 'qemu-xhci,p2=15,p3=15,id=xhci,bus=pci.1,addr=0x1b' \
  -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
  -device 'vfio-pci,host=0000:05:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' \
  -device 'vfio-pci,host=0000:05:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' \
  -device 'vfio-pci,host=0000:0b:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' \
  -device 'vfio-pci,host=0000:0b:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' \
  -device 'usb-host,bus=xhci.0,port=1,hostbus=4,hostport=2,id=usb0' \
  -device 'usb-host,bus=xhci.0,port=2,hostbus=4,hostport=5,id=usb1' \
  -device 'usb-host,bus=xhci.0,port=3,hostbus=4,hostport=6,id=usb2' \
  -device 'VGA,id=vga,bus=pcie.0,addr=0x1' \
  -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
  -iscsi 'initiator-name=iqn.1993-08.org.debian:01:fd36851ff5e6' \
  -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
  -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' \
  -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' \
  -drive 'file=/dev/zvol/rpool/data/vm-220-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
  -device 'virtio-scsi-pci,id=virtioscsi1,bus=pci.3,addr=0x2,iothread=iothread-virtioscsi1' \
  -drive 'file=/dev/zvol/zfsraid10nvr/vm-220-disk-0,if=none,id=drive-scsi1,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'scsi-hd,bus=virtioscsi1.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1' \
  -netdev 'type=tap,id=net0,ifname=tap220i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=5a:a2:5b:7c:3e:01,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=1024,bootindex=102' \
  -machine 'type=q35+pve0'

The above config is the one where it works as described, so using USB port pass-through. In addition to the Corals, I am passing-through 2xGPUs (05: and 0b: ) without issue, so those PCIe devices are not the USB host controller(s).

Snip of Docker Compose:
Code:
devices:
  - /dev/bus/usb:/dev/bus/usb

Snip of Frigate configuration:
Code:
detectors:
  coral1:
    type: edgetpu
    device: usb:0
  coral2:
    type: edgetpu
    device: usb:1
  coral3:
    type: edgetpu
    device: usb:2


I hope all of this makes more sense to you than it does to me. :)

Questions:
1) Obviously - Have I missed something obvious that is the simple cause of all this, and if so, what can I do to fix/troubleshoot?
2) Am I able to dictate that the passthrough should use nec-usb? If so how, and where?

Obviously any other thoughts are welcome. :)


Thanks,
Nicolai
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!