Hi there,
I struggle to pass-through USB devices reliably, it seems. My server has three ports, all of which have a Google Coral USB TPU connected to them, and all are served by the same USB controller in the server. Everything is on-board, no expansion cards or external hubs.
GuestVM (q35, OVMF, uses ZFS) made for the sole purpose of running FrigateNVR through Docker Compose. So:
Bare metal => Debian 11 => Proxmox (and nothing else) => GuestVM => Debian 11 => Docker => Frigate
(Yes, overly complicated, but made this way due to conflicts between Docker and Proxmox when using ZFS)
Plenty of ressources available, and the error occurs regardless if I connect just one, two or all three Coral USBs.
Problem
What I am experiencing depends on the manner of pass-through. Passing through a USB device (deviceID), the guestVM sees the USB devices, but for some reason they do not function inside Docker (although they appear in Docker). Passing through a USB port works flawlessly until it doesn't.
Obviously, the latter seems better, but i) Frigate reports inference speeds of around 30-50 ms, indicating an issue as these should generally be around 10 ms tops) and ii) at random points in time, Frigate will become aware that object detection processing seems stuck. This coincides with xhci messages in the guestVM and host.
Impressions
I've looked into it best I can, and apparently this seems to happen generally with I/O intensive tasks, such as file copying to sticks/drives, this error is relatively common, especially with Proxmox (I saw reports of people switching to Proxmox and finding this problem on otherwise working VMs when migrating from other hypervisors).
I've found this helpful post that suggests the problem to be related to the change away from using nec-usb-xhci that took place around Proxmox version 7.2. So I'd like to try using the nec-usb approach, but I guess I can't just do that as Proxmox has decided otherwise? If I can, I don't understand how?
I do see a bunch of examples of apparently doing that in the wiki for USB devices in VMs, but I assume it's simply outdated (?) and will not work. At least, I can't seem to make it.
Other users have reported simply passing through the entire USB host controller instead, which fixed the issue for some (but others report issues remain). I'd like to avoid this as I'm using all USB ports on the server, so if I need to connect a keyboard, I'd have to make one USB port available to my host first. That seems easier if I can simply comment out one TPU from Frigate's config and remove it from the VM.
The most informative post I've found, relevant to my problem, is this: https://forum.proxmox.com/threads/p...urces-for-new-device-state.122792/post-536150
Logs:
Frigate:
Note that Frigate starts up without remarks, and works without issue for an hour before reporting that detection appears stuck. The occurance of this can be rare or it can be e.g. every 6 minutes. It looks like a watchdog of sorts, so I checked dmesg on the Guest and it seems to coincide with this error, which shows equally frequent:
These will spam repeatedly, mainly comp_code 1, occasionally comp_code 13. They also output directly to shell, so it's very visible when using noVNC. When some (seemingly random) amount of occurances of the above has happened, the log will show this also:
All of this will then repeat after - some measurement of time - sometimes a second, sometimes an hour or two. It will indicate different USBs, i.e. 6-3 in one occurance, next time maybe 6-1 and so on. The error is the same regardless of the number of connected Corals, and it seems to also have the same (random) frequency.
I then checked dmesg on the host:
Configs:
The above config is the one where it works as described, so using USB port pass-through. In addition to the Corals, I am passing-through 2xGPUs (05: and 0b: ) without issue, so those PCIe devices are not the USB host controller(s).
Snip of Docker Compose:
Snip of Frigate configuration:
I hope all of this makes more sense to you than it does to me.
Questions:
1) Obviously - Have I missed something obvious that is the simple cause of all this, and if so, what can I do to fix/troubleshoot?
2) Am I able to dictate that the passthrough should use nec-usb? If so how, and where?
Obviously any other thoughts are welcome.
Thanks,
Nicolai
I struggle to pass-through USB devices reliably, it seems. My server has three ports, all of which have a Google Coral USB TPU connected to them, and all are served by the same USB controller in the server. Everything is on-board, no expansion cards or external hubs.
GuestVM (q35, OVMF, uses ZFS) made for the sole purpose of running FrigateNVR through Docker Compose. So:
Bare metal => Debian 11 => Proxmox (and nothing else) => GuestVM => Debian 11 => Docker => Frigate
(Yes, overly complicated, but made this way due to conflicts between Docker and Proxmox when using ZFS)
Plenty of ressources available, and the error occurs regardless if I connect just one, two or all three Coral USBs.
Problem
What I am experiencing depends on the manner of pass-through. Passing through a USB device (deviceID), the guestVM sees the USB devices, but for some reason they do not function inside Docker (although they appear in Docker). Passing through a USB port works flawlessly until it doesn't.
Obviously, the latter seems better, but i) Frigate reports inference speeds of around 30-50 ms, indicating an issue as these should generally be around 10 ms tops) and ii) at random points in time, Frigate will become aware that object detection processing seems stuck. This coincides with xhci messages in the guestVM and host.
Impressions
I've looked into it best I can, and apparently this seems to happen generally with I/O intensive tasks, such as file copying to sticks/drives, this error is relatively common, especially with Proxmox (I saw reports of people switching to Proxmox and finding this problem on otherwise working VMs when migrating from other hypervisors).
I've found this helpful post that suggests the problem to be related to the change away from using nec-usb-xhci that took place around Proxmox version 7.2. So I'd like to try using the nec-usb approach, but I guess I can't just do that as Proxmox has decided otherwise? If I can, I don't understand how?
I do see a bunch of examples of apparently doing that in the wiki for USB devices in VMs, but I assume it's simply outdated (?) and will not work. At least, I can't seem to make it.
Other users have reported simply passing through the entire USB host controller instead, which fixed the issue for some (but others report issues remain). I'd like to avoid this as I'm using all USB ports on the server, so if I need to connect a keyboard, I'd have to make one USB port available to my host first. That seems easier if I can simply comment out one TPU from Frigate's config and remove it from the VM.
The most informative post I've found, relevant to my problem, is this: https://forum.proxmox.com/threads/p...urces-for-new-device-state.122792/post-536150
Logs:
Frigate:
Code:
2024-07-30 11:04:48.074441397 [INFO] Preparing Frigate...
2024-07-30 11:04:48.102673037 [INFO] Starting Frigate...
2024-07-30 11:04:49.970334735 [2024-07-30 11:04:49] frigate.app INFO : Starting Frigate (0.13.2-6476f8a)
2024-07-30 11:04:50.085873354 [2024-07-30 11:04:50] peewee_migrate.logs INFO : Starting migrations
2024-07-30 11:04:50.092117530 [2024-07-30 11:04:50] peewee_migrate.logs INFO : There is nothing to migrate
2024-07-30 11:04:50.100341577 [2024-07-30 11:04:50] frigate.app INFO : Recording process started: 741
2024-07-30 11:04:50.104342352 [2024-07-30 11:04:50] frigate.app INFO : go2rtc process pid: 89
2024-07-30 11:04:50.152898043 [2024-07-30 11:04:50] detector.coral1 INFO : Starting detection process: 751
2024-07-30 11:04:50.163012662 [2024-07-30 11:04:50] detector.coral2 INFO : Starting detection process: 753
2024-07-30 11:04:50.174397356 [2024-07-30 11:04:50] frigate.app INFO : Output process started: 758
2024-07-30 11:04:50.177984693 [2024-07-30 11:04:50] detector.coral3 INFO : Starting detection process: 756
2024-07-30 11:04:50.259391736 [2024-07-30 11:04:50] frigate.app INFO : Camera processor started for E04_Nordvest: 780
2024-07-30 11:04:50.259632235 [2024-07-30 11:04:50] frigate.app INFO : Camera processor started for E05_Nordost: 782
2024-07-30 11:04:50.272552861 [2024-07-30 11:04:50] frigate.app INFO : Camera processor started for I01_Vaerksted: 794
2024-07-30 11:04:50.285867123 [2024-07-30 11:04:50] frigate.app INFO : Camera processor started for I07_Lager: 796
2024-07-30 11:04:50.298670494 [2024-07-30 11:04:50] frigate.app INFO : Capture process started for E04_Nordvest: 799
2024-07-30 11:04:50.312785081 [2024-07-30 11:04:50] frigate.app INFO : Capture process started for E05_Nordost: 803
2024-07-30 11:04:50.328942980 [2024-07-30 11:04:50] frigate.app INFO : Capture process started for I01_Vaerksted: 807
2024-07-30 11:04:50.355022897 [2024-07-30 11:04:50] frigate.app INFO : Capture process started for I07_Lager: 814
2024-07-30 11:04:53.652068704 [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO : Attempting to load TPU as usb:0
2024-07-30 11:04:53.662824376 [2024-07-30 11:04:53] frigate.detectors.plugins.edgetpu_tfl INFO : TPU found
2024-07-30 11:04:54.506634005 [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO : Attempting to load TPU as usb:2
2024-07-30 11:04:54.516608316 [2024-07-30 11:04:54] frigate.detectors.plugins.edgetpu_tfl INFO : TPU found
2024-07-30 11:04:54.658337538 [2024-07-30 11:04:50] frigate.detectors.plugins.edgetpu_tfl INFO : Attempting to load TPU as usb:1
2024-07-30 11:04:54.668387055 [2024-07-30 11:04:54] frigate.detectors.plugins.edgetpu_tfl INFO : TPU found
2024-07-30 12:05:50.820267790 [2024-07-30 12:05:50] frigate.watchdog INFO : Detection appears to be stuck. Restarting detection process...
2024-07-30 12:05:50.820623086 [2024-07-30 12:05:50] root INFO : Waiting for detection process to exit gracefully...
2024-07-30 12:06:20.843610929 [2024-07-30 12:06:20] root INFO : Detection process didnt exit. Force killing...
2024-07-30 12:06:20.861164642 [2024-07-30 12:06:20] root INFO : Detection process has exited...
2024-07-30 12:06:21.038897190 [2024-07-30 12:06:21] detector.coral1 INFO : Starting detection process: 30271
2024-07-30 12:06:21.044284639 [2024-07-30 12:06:21] frigate.detectors.plugins.edgetpu_tfl INFO : Attempting to load TPU as usb:0
2024-07-30 12:06:23.982269564 [2024-07-30 12:06:23] frigate.detectors.plugins.edgetpu_tfl INFO : TPU found
2024-07-30 12:22:20.930270998 [2024-07-30 12:22:20] frigate.watchdog INFO : Detection appears to be stuck. Restarting detection process...
2024-07-30 12:22:20.930404121 [2024-07-30 12:22:20] root INFO : Waiting for detection process to exit gracefully...
2024-07-30 12:22:50.969076064 [2024-07-30 12:22:50] root INFO : Detection process didnt exit. Force killing...
2024-07-30 12:22:50.992574914 [2024-07-30 12:22:50] root INFO : Detection process has exited...
2024-07-30 12:22:51.170057692 [2024-07-30 12:22:51] detector.coral3 INFO : Starting detection process: 38097
2024-07-30 12:22:54.143501636 [2024-07-30 12:22:51] frigate.detectors.plugins.edgetpu_tfl INFO : Attempting to load TPU as usb:2
2024-07-30 12:22:54.154507469 [2024-07-30 12:22:54] frigate.detectors.plugins.edgetpu_tfl INFO : TPU found
Note that Frigate starts up without remarks, and works without issue for an hour before reporting that detection appears stuck. The occurance of this can be rare or it can be e.g. every 6 minutes. It looks like a watchdog of sorts, so I checked dmesg on the Guest and it seems to coincide with this error, which shows equally frequent:
Code:
xhci_hcd 0000:07:1b.0: Error Transfer event TRB DMA prt not part of current TD ep_index 2 comp_code 1
xhci_hcd 0000:07:1b.0: Error Transfer event TRB DMA prt not part of current TD ep_index 2 comp_code 13
These will spam repeatedly, mainly comp_code 1, occasionally comp_code 13. They also output directly to shell, so it's very visible when using noVNC. When some (seemingly random) amount of occurances of the above has happened, the log will show this also:
Code:
usb 6-3: reset SuperSpeed Gen 1 USB device number 7 using xhci_hcd
usb 6-3: LPM exit latency is zeroed, disabling LPM.
All of this will then repeat after - some measurement of time - sometimes a second, sometimes an hour or two. It will indicate different USBs, i.e. 6-3 in one occurance, next time maybe 6-1 and so on. The error is the same regardless of the number of connected Corals, and it seems to also have the same (random) frequency.
I then checked dmesg on the host:
Code:
[72600.900070] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72600.925174] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72634.211973] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72634.233247] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72662.368201] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72662.393401] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72669.033794] perf: interrupt took too long (7802 > 7757), lowering kernel.perf_event_max_sample_rate to 25500
[72695.768495] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72695.789716] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72729.148755] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72729.170113] usb 4-2: LPM exit latency is zeroed, disabling LPM.
[72762.349002] usb 4-2: reset SuperSpeed USB device number 5 using xhci_hcd
[72762.374906] usb 4-2: LPM exit latency is zeroed, disabling LPM.
Configs:
Code:
root@server:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1
Code:
root@server:~# qm showcmd 220 --pretty
/usr/bin/kvm \
-id 220 \
-name 'frigate,debug-threads=on' \
-no-shutdown \
-chardev 'socket,id=qmp,path=/var/run/qemu-server/220.qmp,server=on,wait=off' \
-mon 'chardev=qmp,mode=control' \
-chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
-mon 'chardev=qmp-event,mode=control' \
-pidfile /var/run/qemu-server/220.pid \
-daemonize \
-smbios 'type=1,uuid=07fcb28e-5c32-464b-bf1d-3cd1811f0cfb' \
-drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' \
-drive 'if=pflash,unit=1,id=drive-efidisk0,format=raw,file=/dev/zvol/rpool/data/vm-220-disk-0,size=540672' \
-smp '16,sockets=2,cores=8,maxcpus=16' \
-nodefaults \
-boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
-vnc 'unix:/var/run/qemu-server/220.vnc,password=on' \
-cpu 'Broadwell,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,vendor=GenuineIntel' \
-m 65536 \
-object 'iothread,id=iothread-virtioscsi0' \
-object 'iothread,id=iothread-virtioscsi1' \
-readconfig /usr/share/qemu-server/pve-q35-4.0.cfg \
-device 'vmgenid,guid=938cfbfd-be7c-4849-ab2b-a7bda8ba311b' \
-device 'qemu-xhci,p2=15,p3=15,id=xhci,bus=pci.1,addr=0x1b' \
-device 'usb-tablet,id=tablet,bus=ehci.0,port=1' \
-device 'vfio-pci,host=0000:05:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on' \
-device 'vfio-pci,host=0000:05:00.1,id=hostpci0.1,bus=ich9-pcie-port-1,addr=0x0.1' \
-device 'vfio-pci,host=0000:0b:00.0,id=hostpci1.0,bus=ich9-pcie-port-2,addr=0x0.0,multifunction=on' \
-device 'vfio-pci,host=0000:0b:00.1,id=hostpci1.1,bus=ich9-pcie-port-2,addr=0x0.1' \
-device 'usb-host,bus=xhci.0,port=1,hostbus=4,hostport=2,id=usb0' \
-device 'usb-host,bus=xhci.0,port=2,hostbus=4,hostport=5,id=usb1' \
-device 'usb-host,bus=xhci.0,port=3,hostbus=4,hostport=6,id=usb2' \
-device 'VGA,id=vga,bus=pcie.0,addr=0x1' \
-device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' \
-iscsi 'initiator-name=iqn.1993-08.org.debian:01:fd36851ff5e6' \
-drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
-device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' \
-device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' \
-drive 'file=/dev/zvol/rpool/data/vm-220-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
-device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' \
-device 'virtio-scsi-pci,id=virtioscsi1,bus=pci.3,addr=0x2,iothread=iothread-virtioscsi1' \
-drive 'file=/dev/zvol/zfsraid10nvr/vm-220-disk-0,if=none,id=drive-scsi1,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
-device 'scsi-hd,bus=virtioscsi1.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1' \
-netdev 'type=tap,id=net0,ifname=tap220i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' \
-device 'virtio-net-pci,mac=5a:a2:5b:7c:3e:01,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=1024,bootindex=102' \
-machine 'type=q35+pve0'
The above config is the one where it works as described, so using USB port pass-through. In addition to the Corals, I am passing-through 2xGPUs (05: and 0b: ) without issue, so those PCIe devices are not the USB host controller(s).
Snip of Docker Compose:
Code:
devices:
- /dev/bus/usb:/dev/bus/usb
Snip of Frigate configuration:
Code:
detectors:
coral1:
type: edgetpu
device: usb:0
coral2:
type: edgetpu
device: usb:1
coral3:
type: edgetpu
device: usb:2
I hope all of this makes more sense to you than it does to me.
Questions:
1) Obviously - Have I missed something obvious that is the simple cause of all this, and if so, what can I do to fix/troubleshoot?
2) Am I able to dictate that the passthrough should use nec-usb? If so how, and where?
Obviously any other thoughts are welcome.
Thanks,
Nicolai
Last edited: