Proxmox GPU Passthrough not working - 2 x NVIDIA T4 - Cisco UCS X210c M7

Witchdoc

New Member
Jun 28, 2024
2
0
1
Hi all - wondering if someone can help.

I have got Proxmox 8.3.2 running on a Cisco UCS X-Series Blade that has 2 x NVIDIA T4 cards installed - and I am trying to get GPU passthrough to work to a Linux 24.04 VM I am running on it. I have found a great deal of information on the threads in this forum - and I have tried multiple permutations. I have the drivers blacklisted, and I can see the 2 cards (which interestingly have the same Vendor and Device ID) and can see that the blacklisted drivers appear to be working as the "Drivers in use" no longer appear as shown below:

root@ai:~# lspci -k | grep -E "vfio-pci | NVIDIA"
63:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4]
64:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
Subsystem: NVIDIA Corporation TU104GL [Tesla T4]

I have mapped the PCI device for the first card above (0000:63:00) which is in IOMMU Group 5.
When I try to start the Linux VM I get the following error:

kvm: -device vfio-pci,host=0000:63:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio: error disconnecting group 5 from container
kvm: -device vfio-pci,host=0000:63:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 0000:63:00.0: error getting device from group 5: No such device
Verify all devices in group 5 are bound to vfio-<bus> or pci-stub and not already in use
TASK ERROR: start failed: QEMU exited with code 1

It looks like the host or something else is using it - but I can't see how or where - Does anyone know how to fix this? (Any assistance greatly appreciated!)
 
To add some additional information - When I start the VM - here is the node System log:

Jan 10 07:59:59 ai kernel: vfio-pci 0000:63:00.0: Unable to change power state from D0 to D3hot, device inaccessible
Jan 10 07:59:59 ai kernel: vfio-pci 0000:63:00.0: Unable to change power state from D3cold to D0, device inaccessible
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: pciehp: Slot(1): Card present
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: pciehp: Slot(1): Link Up
Jan 10 07:59:59 ai systemd[1]: Started 101.scope.
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 0 [mem 0xc6000000-0xc6ffffff]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 1 [mem 0x217ec0000000-0x217ecfffffff 64bit pref]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 3 [mem 0x217fd0000000-0x217fd1ffffff 64bit pref]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: Enabling HDA controller
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: PME# supported from D0 D3hot D3cold
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc703ffff]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc73fffff]: contains BAR 0 for 16 VFs
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217edfffffff 64bit pref]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217fcfffffff 64bit pref]: contains BAR 1 for 16 VFs
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x00000000-0x01ffffff 64bit pref]
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x00000000-0x1fffffff 64bit pref]: contains BAR 3 for 16 VFs
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:62:01.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: Adding to iommu group 5
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [io 0x1000-0x0fff] to [bus 63] add_size 1000
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: can't assign; no space
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: failed to assign
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: can't assign; no space
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: failed to assign
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 1 [mem 0x217ec0000000-0x217ecfffffff 64bit pref]: assigned
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217fcfffffff 64bit pref]: assigned
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 3 [mem 0x217fd0000000-0x217fd1ffffff 64bit pref]: assigned
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x217fd2000000-0x217ff1ffffff 64bit pref]: assigned
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: BAR 0 [mem 0xc6000000-0xc6ffffff]: assigned
Jan 10 07:59:59 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc73fffff]: assigned
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: PCI bridge to [bus 63]
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [mem 0xc6000000-0xc73fffff]
Jan 10 07:59:59 ai kernel: pcieport 0000:62:01.0: bridge window [mem 0x217ec0000000-0x217ff1ffffff 64bit pref]
Jan 10 07:59:59 ai kernel: tap101i0: entered promiscuous mode
Jan 10 07:59:59 ai kernel: vmbr0: port 2(fwpr101p0) entered blocking state
Jan 10 07:59:59 ai kernel: vmbr0: port 2(fwpr101p0) entered disabled state
Jan 10 07:59:59 ai kernel: fwpr101p0: entered allmulticast mode
Jan 10 07:59:59 ai kernel: fwpr101p0: entered promiscuous mode
Jan 10 07:59:59 ai kernel: bond2: entered promiscuous mode
Jan 10 07:59:59 ai kernel: bond0: entered promiscuous mode
Jan 10 07:59:59 ai kernel: enic 0000:1b:00.0 eno5: entered promiscuous mode
Jan 10 07:59:59 ai kernel: enic 0000:1b:00.2 eno7: entered promiscuous mode
Jan 10 07:59:59 ai kernel: vmbr0: port 2(fwpr101p0) entered blocking state
Jan 10 07:59:59 ai kernel: vmbr0: port 2(fwpr101p0) entered forwarding state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jan 10 07:59:59 ai kernel: fwln101i0: entered allmulticast mode
Jan 10 07:59:59 ai kernel: fwln101i0: entered promiscuous mode
Jan 10 07:59:59 ai kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Jan 10 07:59:59 ai kernel: tap101i0: entered allmulticast mode
Jan 10 07:59:59 ai kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Jan 10 07:59:59 ai kernel: fwbr101i0: port 2(tap101i0) entered forwarding state
Jan 10 08:00:07 ai pvedaemon[1682]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:08 ai pvestatd[1658]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:09 ai pvestatd[1658]: status update time (8.154 seconds)
Jan 10 08:00:15 ai pvedaemon[1682]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:15 ai pvedaemon[1682]: <root@pam> starting task UPID:ai:00077F0D:00C1D084:678038DF:vncproxy:101:root@pam:
Jan 10 08:00:15 ai pvedaemon[491277]: starting vnc proxy UPID:ai:00077F0D:00C1D084:678038DF:vncproxy:101:root@pam:
Jan 10 08:00:18 ai pvestatd[1658]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:18 ai pvestatd[1658]: status update time (8.142 seconds)
Jan 10 08:00:21 ai qm[491279]: VM 101 qmp command failed - VM 101 qmp command 'set_password' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:21 ai pvedaemon[491277]: Failed to run vncproxy.
Jan 10 08:00:21 ai pvedaemon[1682]: <root@pam> end task UPID:ai:00077F0D:00C1D084:678038DF:vncproxy:101:root@pam: Failed to run vncproxy.
Jan 10 08:00:28 ai pvestatd[1658]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:28 ai pvestatd[1658]: status update time (8.134 seconds)
Jan 10 08:00:32 ai pvedaemon[1682]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:38 ai pvestatd[1658]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:38 ai pvestatd[1658]: status update time (8.130 seconds)
Jan 10 08:00:47 ai kernel: pcieport 0000:62:01.0: pciehp: Slot(1): Link Down
Jan 10 08:00:48 ai pvestatd[1658]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Jan 10 08:00:48 ai pvestatd[1658]: status update time (8.170 seconds)
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: pciehp: Slot(1): Card present
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: pciehp: Slot(1): Link Up
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 0 [mem 0xc6000000-0xc6ffffff]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 1 [mem 0x217ec0000000-0x217ecfffffff 64bit pref]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 3 [mem 0x217fd0000000-0x217fd1ffffff 64bit pref]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: Enabling HDA controller
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: PME# supported from D0 D3hot D3cold
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc703ffff]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc73fffff]: contains BAR 0 for 16 VFs
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217edfffffff 64bit pref]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217fcfffffff 64bit pref]: contains BAR 1 for 16 VFs
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x217fd2000000-0x217fd3ffffff 64bit pref]
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x217fd2000000-0x217ff1ffffff 64bit pref]: contains BAR 3 for 16 VFs
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:62:01.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: Adding to iommu group 5
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [io 0x1000-0x0fff] to [bus 63] add_size 1000
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: can't assign; no space
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: failed to assign
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: can't assign; no space
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [io size 0x1000]: failed to assign
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 1 [mem 0x217ec0000000-0x217ecfffffff 64bit pref]: assigned
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 1 [mem 0x217ed0000000-0x217fcfffffff 64bit pref]: assigned
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 3 [mem 0x217fd0000000-0x217fd1ffffff 64bit pref]: assigned
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 3 [mem 0x217fd2000000-0x217ff1ffffff 64bit pref]: assigned
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: BAR 0 [mem 0xc6000000-0xc6ffffff]: assigned
Jan 10 08:00:50 ai kernel: pci 0000:63:00.0: VF BAR 0 [mem 0xc7000000-0xc73fffff]: assigned
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: PCI bridge to [bus 63]
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [mem 0xc6000000-0xc73fffff]
Jan 10 08:00:50 ai kernel: pcieport 0000:62:01.0: bridge window [mem 0x217ec0000000-0x217ff1ffffff 64bit pref]
Jan 10 08:00:50 ai kernel: tap101i0: left allmulticast mode
Jan 10 08:00:50 ai kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Jan 10 08:00:50 ai kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jan 10 08:00:50 ai kernel: vmbr0: port 2(fwpr101p0) entered disabled state
Jan 10 08:00:50 ai kernel: fwln101i0 (unregistering): left allmulticast mode
Jan 10 08:00:50 ai kernel: fwln101i0 (unregistering): left promiscuous mode
Jan 10 08:00:50 ai kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jan 10 08:00:50 ai kernel: fwpr101p0 (unregistering): left allmulticast mode
Jan 10 08:00:50 ai kernel: fwpr101p0 (unregistering): left promiscuous mode
Jan 10 08:00:50 ai kernel: vmbr0: port 2(fwpr101p0) entered disabled state
Jan 10 08:00:50 ai kernel: bond2: left promiscuous mode
Jan 10 08:00:50 ai kernel: bond0: left promiscuous mode
Jan 10 08:00:50 ai kernel: enic 0000:1b:00.0 eno5: left promiscuous mode
Jan 10 08:00:50 ai kernel: enic 0000:1b:00.2 eno7: left promiscuous mode
Jan 10 08:00:51 ai pvedaemon[491034]: start failed: QEMU exited with code 1
Jan 10 08:00:51 ai pvedaemon[1682]: <root@pam> end task UPID:ai:00077E1A:00C1CA02:678038CE:qmstart:101:root@pam: start failed: QEMU exited with code 1
Jan 10 08:00:51 ai systemd[1]: 101.scope: Deactivated successfully.
Jan 10 08:00:51 ai systemd[1]: 101.scope: Consumed 48.760s CPU time.