Nvidia vGPU 16.6 drivers on 6.8 kernel - mdevctl list returns no values

Nov 5, 2021
13
5
8
44
I have two proxmox nodes that have Nvidia Tesla P4s in them. I have been stuck having to pin kernel version 6.5 on these nodes for awhile now due to the vGPU kernel modules not building successfully for newer kernels.

Today I noticed when updating the official vGPU host drivers to version 16.6 that the vGPU kernel modules build successfully for the current 6.8.8-2-pve kernel. I booted one of these nodes into this kernel and am seeing the following:

1. nvidia-smi shows the expected output.
2. nvidia-smi vgpu shows the expected output.
3. mdevctl list returns an empty line.
4. mdevctl types returns an empty line.

It seems like the drivers are almost working. Everything is working correctly on kernel version 6.5.13-5-pve on both of these nodes, so I know everything is configured correctly for that kernel version. I already have the proxmox-default-headers package installed on both nodes, as well as the proxmox-headers-6.8.8-2-pve package. I checked the logs from the nvidia installer and am not seeing any errors. I am running the stock vGPU drivers from Nvidia, so not the modified unlocked drivers.

Are there extra steps required to get mdevs working for vGPU on this newer kernel version? Or are the official 16.6 vGPU drivers still not compatible with the latest pve kernel?
 
can you check what the sysfs says?

there should be a directory with the available mdev under
Code:
/sys/bus/pci/devices/<pciid>/mdev_supported_types

where the pciid is either the device itself or a virtual function (not sure how it's exactly with the P4)

also the lspci -vvv output of the device would be interesting:

Code:
lspci -vvv -s <pciid>

as well as the output of 'dmesg'
 
can you check what the sysfs says?

there should be a directory with the available mdev under
Code:
/sys/bus/pci/devices/<pciid>/mdev_supported_types

where the pciid is either the device itself or a virtual function (not sure how it's exactly with the P4)

also the lspci -vvv output of the device would be interesting:

Code:
lspci -vvv -s <pciid>

as well as the output of 'dmesg'
Hey there! thanks for the prompt response.
My system is using the A2 gpu.
The main issue im facing which is causing this is that the GPU isnt being recognized as a mediated device. further making it so the mdev_supported_types category is not even created.
The output of lspci -vvv -s :
Code:
        Subsystem: NVIDIA Corporation GA107GL [A2 / A16]
        Physical Slot: 1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        IOMMU group: 15
        Region 0: Memory at a8000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 207000000000 (64-bit, prefetchable) [size=16G]
        Region 3: Memory at 207c20000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Null
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
                DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
                Vector table: BAR=0 offset=00b90000
                PBA: BAR=0 offset=00ba0000
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [258 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration- 10BitTagReq+ Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ 10BitTagReq-
                IOVSta: Migration-
                Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 4, stride: 1, Device ID: 25b6
                Supported Page Size: 00000573, System Page Size: 00000001
                Region 0: Memory at a9000000 (32-bit, non-prefetchable)
                Region 1: Memory at 0000207400000000 (64-bit, prefetchable)
                Region 3: Memory at 0000207c00000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00 v1] Lane Margining at the Receiver <?>
        Capabilities: [e00 v1] Data Link Feature <?>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia

besides this , the following is the dmesg output:
root@beta1:~# dmesg | grep -i nvidia
[ 7.494648] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 7.564121] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[ 7.568552] nvidia 0000:17:00.0: enabling device (0140 -> 0142)
[ 7.615155] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.161.05 Thu Jan 25 17:36:41 UTC 2024
[ 7.711662] audit: type=1400 audit(1721075365.432:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1072 comm="apparmor_parser"
[ 7.711666] audit: type=1400 audit(1721075365.432:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1072 comm="apparmor_parser"
 
My system is using the A2 gpu.
The main issue im facing which is causing this is that the GPU isnt being recognized as a mediated device. further making it so the mdev_supported_types category is not even created.
did you enable the virtual functions like described here: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE ?

besides this , the following is the dmesg output:
could you please post the whole, unfiltered dmesg? possibly not all relevant messages contain 'nvidia' ;)
 
  • Like
Reactions: hamza_tester

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!