Nvidia vGPU No Mediated devices

Adevill20

New Member
Aug 22, 2024
4
1
3
I need help to setup a NVidia RTX A5000 vGPU. I followed the instructions listed here https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE and I get the following outputs from the Nvidia drivers, but for some reason there are no Mediated devices listed. I have a supported GPU and using the latest GRID drivers. Secure boot is disabled and the Display output of the GPU is also disabled:

output1.jpg

output2.jpg

output3.jpg
 
If using kernel 6.8, take a look at this thread [1], as newer kernels use Vendor Specific VFIO instead of Mediated Devices and currently VMs and vGPU have to be configured somewhat manually instead of through the webUI. There's some code done to adapt to this change so it can be used from the webUI soon [2].

[1] https://forum.proxmox.com/threads/vgpu-with-nvidia-on-kernel-6-8.150840/
[2] https://forum.proxmox.com/threads/vgpu-with-nvidia-on-kernel-6-8.150840/post-690894
Thank you for the information. I followed these steps but the VM still doesnt want to start. Please see the error below:

output4.jpg
 
Which steps did you follow exactly? Help me help you :)
Reboot the PVE host, as sometimes the driver/hardware just gets stuck. Then post the output of /usr/lib/nvidia/sriov-manage -e ALL so I can see all the PCI IDs of the vGPUs.
What's the output of cat /sys/bus/pci/devices/<DOMAIN>\:<BUS>\:<SLOT>.<FUNCTION>/nvidia/creatable_vgpu_types ? Using the PCI ID of one of the vGPUs.
Do you set the type of the GPU using something like echo 918 > /sys/bus/pci/devices/0000\:26\:00.4/nvidia/current_vgpu_type?
Post the vm configuration qm config VMID .
 
I followed the exact steps of the first guide (https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE). Then the steps as per the instructions of this post (https://forum.proxmox.com/threads/vgpu-with-nvidia-on-kernel-6-8.150840/).

In summary:
I enabled IOMMU (bios, grub and modules). Installed the latest Nvidia drivers, the followed the instructions you posted. The GPU is seen by PVE, the VFs are seen and I set the GPU Type to 8Q for the first FV (0000:61:00.4)

output5.jpg

VM Config:
root@pve1:/sys/bus/pci/devices/0000:61:00.4/nvidia# cat /etc/pve/qemu-server/8402.conf
agent: 1
args: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:61:00.4 --uuid 7396508e-201b-4911-9dd2-885be0c3c681
bios: ovmf
boot: order=scsi0
cores: 8
cpu: host,flags=+ibpb;+virt-ssbd;+amd-ssbd;+pdpe1gb;+aes
efidisk0: nas1-m2:8402/vm-8402-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: nas1-m2:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
machine: pc-q35-9.0
memory: 8192
meta: creation-qemu=9.0.0,ctime=1721222033
name: Sandbox4-Client1
net0: virtio=BC:24:11:D2:85:C3,bridge=Sandbox4
numa: 1
onboot: 1
ostype: win10
parent: AsBuilt
scsi0: nas1-m2:8402/vm-8402-disk-1.qcow2,aio=native,cache=directsync,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7396508e-201b-4911-9dd2-885be0c3c681
sockets: 1
startup: up=210
tpmstate0: nas1-m2:8402/vm-8402-disk-0.raw,size=4M,version=v2.0
vga: none
vmgenid: 14bab5f9-5eaa-4ae6-ae43-e550e1e2438e
vmstatestorage: nas1-m2

[AsBuilt]
agent: 1
args: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:61:00.4 --uuid 7396508e-201b-4911-9dd2-885be0c3c681
bios: ovmf
boot: order=scsi0
cores: 8
cpu: host,flags=+md-clear;+ibpb;+virt-ssbd;+amd-ssbd;+amd-no-ssb;+pdpe1gb;+aes
efidisk0: nas1-m2:8402/vm-8402-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
ide0: nas1-m2:iso/virtio-win-0.1.248.iso,media=cdrom,size=715188K
machine: pc-q35-9.0
memory: 8192
meta: creation-qemu=9.0.0,ctime=1721222033
name: Sandbox4-Client1
net0: virtio=BC:24:11:D2:85:C3,bridge=Sandbox4
numa: 1
onboot: 1
ostype: win10
scsi0: nas1-m2:8402/vm-8402-disk-1.qcow2,aio=native,cache=directsync,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=7396508e-201b-4911-9dd2-885be0c3c681
snaptime: 1721224793
sockets: 1
startup: up=210
vga: virtio
vmgenid: 14bab5f9-5eaa-4ae6-ae43-e550e1e2438e
 
Found my error: I added 2 "--" to the UUID part of the VM config

giphy.gif


After fixing it the VM booted and the vGPU was recognized. Thank you for the help!!
 
  • Like
Reactions: VictorSTS
I use the libvirt and the vendor-specific vfio and attached vf pci device to vfio but got `error: Requested operation is not valid: Unmanaged PCI device 0000:3b:00.6 must be manually detached from the host`。I can see vf and pf using nvidia driver.Anyone can help.

+ this is my gpu phsical device:

Code:
3b:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
    Subsystem: Dell Device 1459
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 37
    NUMA node: 0
    IOMMU group: 4
    Region 0: Memory at ab000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at 380000000000 (64-bit, prefetchable) [size=64G]
    Region 3: Memory at 382040000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] Null
    Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
        DevCtl:    CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl:    ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 2.5GT/s (downgraded), Width x16
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [b4] Vendor Specific Information: Len=14 <?>
    Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
        Vector table: BAR=0 offset=00b90000
        PBA: BAR=0 offset=00ba0000
    Capabilities: [100 v1] Virtual Channel
        Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:    ArbSelect=Fixed
        Status:    InProgress-
        VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status:    NegoPending- InProgress-
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:    RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
        AERCap:    First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 64GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB
        BAR 3: current size: 32MB, supported: 32MB
    Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap:    Migration- 10BitTagReq+ Interrupt Message Number: 000
        IOVCtl:    Enable+ Migration- Interrupt- MSE+ ARIHierarchy+ 10BitTagReq-
        IOVSta:    Migration-
        Initial VFs: 32, Total VFs: 32, Number of VFs: 32, Function Dependency Link: 00
        VF offset: 4, stride: 1, Device ID: 2230
        Supported Page Size: 00000573, System Page Size: 00000001
        Region 0: Memory at ac000000 (32-bit, non-prefetchable)
        Region 1: Memory at 0000381000000000 (64-bit, prefetchable)
        Region 3: Memory at 0000382000000000 (64-bit, prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap:    MFVC- ACS-, Next Function: 0
        ARICtl:    MFVC- ACS-, Function Group: 0
    Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00 v1] Lane Margining at the Receiver <?>
    Capabilities: [e00 v1] Data Link Feature <?>
    Kernel driver in use: nvidia
    Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia

+ this is my vf

3b:00.5 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
Subsystem: Dell Device 0000
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
NUMA node: 0
IOMMU group: 133
Region 0: Memory at ac040000 (32-bit, non-prefetchable) [virtual] [size=256K]
Region 1: Memory at 381080000000 (64-bit, prefetchable) [virtual] [size=2G]
Region 3: Memory at 382002000000 (64-bit, prefetchable) [virtual] [size=32M]
Capabilities: [40] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown (downgraded), Width x0 (downgraded)
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [7c] MSI-X: Enable- Count=3 Masked-
Vector table: BAR=0 offset=00010000
PBA: BAR=0 offset=00020000
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_vgpu_vfio, nvidia