[TUTORIAL] NVIDIA vGPU on Proxmox VE 7.x

ok the module builds fine but only fails to load... do you have the nouveau module loaded? what kind of card is it? whats the rest of the hardware?
lspci -k output for the GPU is:

07:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev ff)
Kernel modules: nvidiafb, nouveau

Hardware is rack server HP Proliant DL360p (I can provide you with full specs if needed).

Does this means I have nvidiafb & nouveau modules loaded? If, yes since they are already on my blacklist how do I remove them?
 
Does this means I have nvidiafb & nouveau modules loaded?
no that means they could be used for those cards, if they were actually loaded and in use it would look like :

Code:
Kernel driver in use: nvidia

07:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev ff)
that card is not officially supported for vgpu, and i really can't help with unofficial drivers that unlock the vgpus on consumer hardware... sorry
 
Ah
no that means they could be used for those cards, if they were actually loaded and in use it would look like :

Code:
Kernel driver in use: nvidia


that card is not officially supported for vgpu, and i really can't help with unofficial drivers that unlock the vgpus on consumer hardware... sorry
Ok, thanks for your time.
 
I wanna try something else. GPU Passthrough.

My VM wont start if I select my nvidia GPU.

Here is the error:

swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:07:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:07:00.0: failed to setup container for group 35: No available IOMMU models
stopping swtpm instance (pid 365327) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

dmesg | grep -e DMAR -e IOMMU
[ 0.008813] ACPI: DMAR 0x00000000BDDAD200 00060C (v01 HP ProLiant 00000001 \xd2? 0000162E)
[ 0.008882] ACPI: Reserving DMAR table memory at [mem 0xbddad200-0xbddad80b]
[ 0.336976] DMAR: IOMMU enabled
[ 0.806763] DMAR: Host address width 46
[ 0.806765] DMAR: DRHD base: 0x000000fbefe000 flags: 0x0
[ 0.806772] DMAR: dmar0: reg_base_addr fbefe000 ver 1:0 cap d2078c106f0466 ecap f020de
[ 0.806775] DMAR: DRHD base: 0x000000dbffe000 flags: 0x1
[ 0.806779] DMAR: dmar1: reg_base_addr dbffe000 ver 1:0 cap d2078c106f0466 ecap f020de
[ 0.806781] DMAR: RMRR base: 0x000000bdffd000 end: 0x000000bdffffff
[ 0.806783] DMAR: RMRR base: 0x000000bdff6000 end: 0x000000bdffcfff
[ 0.806785] DMAR: RMRR base: 0x000000bdf83000 end: 0x000000bdf84fff
[ 0.806786] DMAR: RMRR base: 0x000000bdf7f000 end: 0x000000bdf82fff
[ 0.806787] DMAR: RMRR base: 0x000000bdf6f000 end: 0x000000bdf7efff
[ 0.806791] DMAR: RMRR base: 0x000000bdf6e000 end: 0x000000bdf6efff
[ 0.806792] DMAR: RMRR base: 0x000000000f4000 end: 0x000000000f4fff
[ 0.806794] DMAR: RMRR base: 0x000000000e8000 end: 0x000000000e8fff
[ 0.806795] DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000000e8000-0x00000000000e8fff], contact BIOS vendor for fixes
[ 0.806843] DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x00000000000e8000-0x00000000000e8fff]
[ 0.806845] DMAR: RMRR base: 0x000000bddde000 end: 0x000000bdddefff
[ 0.806846] DMAR: ATSR flags: 0x0
[ 0.806850] DMAR-IR: IOAPIC id 10 under DRHD base 0xfbefe000 IOMMU 0
[ 0.806852] DMAR-IR: IOAPIC id 8 under DRHD base 0xdbffe000 IOMMU 1
[ 0.806853] DMAR-IR: IOAPIC id 0 under DRHD base 0xdbffe000 IOMMU 1
[ 0.806855] DMAR-IR: HPET id 0 under DRHD base 0xdbffe000
[ 0.806856] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[ 0.806857] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[ 0.807636] DMAR-IR: Enabled IRQ remapping in xapic mode
[ 1.275575] DMAR: No SATC found
[ 1.275580] DMAR: dmar0: Using Queued invalidation
[ 1.275590] DMAR: dmar1: Using Queued invalidation
[ 1.281340] DMAR: Intel(R) Virtualization Technology for Directed I/O


Any idea?
 
Any idea?
can you post your current vm config, the complete dmesg output and you iommu groups? (maybe best in a new thread, as it doesn't have anything to do anymore with the original topic)
 
Hi I need some help here pls. I am currently run PVE 7.4-13, and kernel 5.15.107-2. I am also running RTXA5000.

I downloaded and installed the NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run driver to install. Everything installed correctly.

Display mode is disabled.
Code:
# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Sun Jun 11 20:54:38 2023
Driver Version                            : 525.105.14
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA RTX A5000
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 9999999999999
    GPU UUID                              : GPU-99999999-9999-9999-9999-999999999999
    Minor Number                          : 0
    VBIOS Version                         : 94.02.6D.00.0D
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : 900-5G132-1700-000
    GPU Part Number                       : 2231-850-A1
    Module ID                             : 1
    Inforom Version
        Image Version                     : G132.0500.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x223110DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x147E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24564 MiB
        Reserved                          : 316 MiB
        Used                              : 0 MiB
        Free                              : 24248 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 56 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 90 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 36.81 W
        Power Limit                       : 230.00 W
        Default Power Limit               : 230.00 W
        Enforced Power Limit              : 230.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 230.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Default Applications Clocks
        Graphics                          : 1695 MHz
        Memory                            : 8001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 8001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 656.250 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

Code:
# dmesg | grep -e DMAR -e IOMMU
[    0.018196] ACPI: DMAR 0x0000000074803000 000088 (v02 INTEL  EDK2     00000002      01000013)
[    0.018245] ACPI: Reserving DMAR table memory at [mem 0x74803000-0x74803087]
[    0.152655] DMAR: IOMMU enabled
[    0.329661] DMAR: Host address width 39
[    0.329662] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.329666] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 29a00f0505e
[    0.329668] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.329670] DMAR: dmar1: reg_base_addr fed91000 ver 5:0 cap d2008c40660462 ecap f050da
[    0.329673] DMAR: RMRR base: 0x0000007c000000 end: 0x000000803fffff
[    0.329675] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
[    0.329677] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.329678] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.331239] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    0.493174] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[    0.563599] DMAR: No ATSR found
[    0.563599] DMAR: No SATC found
[    0.563601] DMAR: IOMMU feature fl1gp_support inconsistent
[    0.563602] DMAR: IOMMU feature pgsel_inv inconsistent
[    0.563603] DMAR: IOMMU feature nwfs inconsistent
[    0.563604] DMAR: IOMMU feature dit inconsistent
[    0.563605] DMAR: IOMMU feature sc_support inconsistent
[    0.563605] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    0.563606] DMAR: dmar0: Using Queued invalidation
[    0.563609] DMAR: dmar1: Using Queued invalidation
[    0.564263] DMAR: Intel(R) Virtualization Technology for Directed I/O

Code:
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14   Driver Version: 525.105.14   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:01:00.0 Off |                  Off |
| 30%   56C    P8    37W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                        
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Code:
# nvidia-smi vgpu
Sun Jun 11 20:33:17 2023    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14             Driver Version: 525.105.14                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A5000           | 00000000:01:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

However, regardless whether I run
Code:
/usr/lib/nvidia/sriov-manage -e ALL

or not, I still don't get the 24 virtual functions in addition to the physical card (01:00.0). Instead I get the following only:

Code:
# lspci -d 10de:
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2231 (rev a1)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)

What step did I miss or what did I do wrong? Any pointers would be much appreciated.
 
Last edited:
What step did I miss or what did I do wrong? Any pointers would be much appreciated.
the fact that you still get an audio device means that the displaymode is not correctly set

interestingly when i do a 'nvidia-smi -q' here it shows me:

Code:
...
GPU 00000000:01:00.0                                        
    Product Name                          : NVIDIA RTX A5000
    Product Brand                         : NVIDIA          
    Product Architecture                  : Ampere          
    Display Mode                          : Enabled         
    Display Active                        : Disabled        
    Persistence Mode                      : Enabled         
    vGPU Device Capability                                  
        Fractional Multi-vGPU             : Supported       
        Heterogeneous Time-Slice Profiles : Supported       
        Heterogeneous Time-Slice Sizes    : Not Supported   
...

so maybe the check in their tool is buggy & reversed?
 
the fact that you still get an audio device means that the displaymode is not correctly set

interestingly when i do a 'nvidia-smi -q' here it shows me:

Code:
...
GPU 00000000:01:00.0                                      
    Product Name                          : NVIDIA RTX A5000
    Product Brand                         : NVIDIA        
    Product Architecture                  : Ampere        
    Display Mode                          : Enabled       
    Display Active                        : Disabled      
    Persistence Mode                      : Enabled       
    vGPU Device Capability                                
        Fractional Multi-vGPU             : Supported     
        Heterogeneous Time-Slice Profiles : Supported     
        Heterogeneous Time-Slice Sizes    : Not Supported 
...

so maybe the check in their tool is buggy & reversed?

Thanks. I figured out how to get it running at last. For those who are interested, here are the steps I took:

1/ Uninstall the nvidia drivers from Proxmox and reboot Proxmox
Code:
# ./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --uninstall
(replace the version number to yours)

2/ Run displaymodeselector and disable Physical Display Mode, and select "physical_display_disabled", then reboot Proxmox
Code:
./displaymodeselector --gpumode

3/ Install nvidia driver again and reboot
Code:
# ./NVIDIA-Linux-x86_64-525.105.14-vgpu-kvm.run --dkms
(replace the version number to yours)

4/ Follow install guide and enable SR-IOV

5/ Check if driver is installed correctly, run "lspci -d 10de:" and you should get the following (yours may slightly different depending on the card you have).
Code:
# lspci -d 10de:
01:00.0 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:00.4 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:00.5 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:00.6 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:00.7 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.0 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.1 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.2 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.3 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.4 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.5 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.6 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:01.7 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.0 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.1 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.2 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.3 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.4 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.5 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.6 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:02.7 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:03.0 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:03.1 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:03.2 3D controller: NVIDIA Corporation Device 2231 (rev a1)
01:03.3 3D controller: NVIDIA Corporation Device 2231 (rev a1)

6/ Just follow the guide for the VM set up and the rest.
 
Last edited:
  • Like
Reactions: sungl
Hi there, I just upgraded to Proxmox 8. Everything upgrade successfully. I ran the script once again to setup my A5000 vGPU.

Interestingly enough, I ran into an issue where every time the system reboots, it won't execute the
Code:
nvidia-sriov.service

However, things works again each time I re-run each time I reboot and logged in.
Code:
systemctl enable --now nvidia-sriov.service
I checked my code and it is 100% the same as the guide (in fact I didn't have to change it since my 7 to 8 upgrade).

In that case, that's just as good as running
Code:
/usr/lib/nvidia/sriov-manage -e ALL

Does anyone else run into the same issue? If so, how did you overcome or fix it?
 
Last edited:
Tell me how to fix this error or how to disable this notification.
[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled

IMG_20230905_225949 0.jpg

Everything works fine, vgpu profiles all work, but this error annoys me.

69816146136131.JPG
 
I have a A10 GPU and its showing like this to me in Mdev can someone suggest me what has to be done

View attachment 54930
hi, what exactly is the issue here? just select a virtual function ( the ones with 'yes' in the last column) and select an mdev type
 
Tell me how to fix this error or how to disable this notification.
[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled

View attachment 55069

Everything works fine, vgpu profiles all work, but this error annoys me.

View attachment 55070
that comes from the nvidia driver. if it detects it cannot enable live migration it will log that. afaik you cannot disable this warning. AFAIR it should not display that on a recent (e.g. 6.2) kernel
 
that comes from the nvidia driver. if it detects it cannot enable live migration it will log that. afaik you cannot disable this warning. AFAIR it should not display that on a recent (e.g. 6.2) kernel
I have PVE version 7.4-16
I have had this error since when I installed PVE 7.1.0
6541611331.JPG
 
hi, what exactly is the issue here? just select a virtual function ( the ones with 'yes' in the last column) and select an mdev type
so i have selected a profile which creates 2gb vgpu profile ie NVidia-787, and the profile supports upto 12 max instance of 2gb each, but the pcieid is showing more than 24+vfio how am i suppose to know which pcie support those available 12 instances.
 
so i have selected a profile which creates 2gb vgpu profile ie NVidia-787, and the profile supports upto 12 max instance of 2gb each, but the pcieid is showing more than 24+vfio how am i suppose to know which pcie support those available 12 instances.
the gui tells you when you want to select a mdev profile how many are available on creation

alternatively in pve 8, you can use the new pci mapping feature exactly for that: there you create a new mapping with all virtual functions and on the vm select a profile, on vm start the first virtual function where that profile is availabe is used (or the start is aborted with an error)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!