[SOLVED] Sharing Nvidia Quadro RTX 8000 to Multiple VMs in Proxmox

wastedolphine

New Member
Oct 22, 2022
10
1
3
India
Hi,

I have a Nvidia Quadro RTX 8000 48GB Card.
I wanted to use it with proxmox to share the same gpu in upto 4 ubuntu virtual machines using the vGPU Profiles.

I installed the nvidia vGPU driver on proxmox and is working fine (nvidia-smi shows). I download the evaluation vgpu drivers from nvid.nvidia.com
The vgpu profiles shows correctly when adding pcie device to VM.

Also I was able to install the guest drivers on ubuntu vm, the installation was complete but when running nvidia-smi on guest ubuntu vm it is giving error communicating with device.

Has anybody got this setup or similar working.
If anybody can help me on this.
Is this setup even possible with Proxmox?

Nvidia officially supports this but with vmware vsphere or citrix hypervisor, i am not familiar with these hypervisors so wanted to give it a try with proxmox.

My config:
Dell R750
2x Intel Xeon E5 64 Core Total
196 GB RAM
Nvidia Quadro RTX 8000 48 GB - Slot 2
Nvidia A100 80 GB - Slot 1
Latest Proxmox with Latest Kernel

Thanks in advance.
 
can you post the logs from the guest & host ? maybe there's a hint there...

EDIT: also the vm config would be interesting: 'qm config ID'
 
  • Like
Reactions: wastedolphine
Hi,

Sorry about the delay.
Here are the host system logs. Kindly let me know what logs to share from guest os.


Code:
root@pve:~# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Mon Oct 31 22:25:51 2022
Driver Version                            : 460.73.01
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:18:00.0
    Product Name                          : Quadro RTX 8000
    Product Brand                         : NVIDIA
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323021078177
    GPU UUID                              : GPU-514545d9-1885-e6aa-99f1-ed64f96d2510
    Minor Number                          : 0
    VBIOS Version                         : 90.02.4A.00.11
    MultiGPU Board                        : No
    Board ID                              : 0x1800
    GPU Part Number                       : 900-5G150-2700-001
    Inforom Version
        Image Version                     : G150.0500.00.03
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : Non SR-IOV
    vGPU Software Licensed Product
        Product Name                      : NVIDIA RTX Virtual Workstation
        License Status                    : Unlicensed
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x18
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E3010DE
        Bus Id                            : 00000000:18:00.0
        Sub System Id                     : 0x129E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 33 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 49143 MiB
        Used                              : 313 MiB
        Free                              : 48830 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 40 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 39.92 W
        Power Limit                       : 260.00 W
        Default Power Limit               : 260.00 W
        Enforced Power Limit              : 260.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 260.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 1395 MHz
        Memory                            : 7001 MHz
    Default Applications Clocks
        Graphics                          : 1395 MHz
        Memory                            : 7001 MHz
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

Code:
root@pve:~# nvidia-smi
Mon Oct 31 22:27:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:18:00.0 Off |                  Off |
| 33%   40C    P8    40W / 260W |    313MiB / 49143MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 
Code:
less /var/log/syslog

Oct 30 16:24:11 pve systemd[1]: Starting Daily apt download activities...
Oct 30 16:24:11 pve systemd[1]: apt-daily.service: Succeeded.
Oct 30 16:24:11 pve systemd[1]: Finished Daily apt download activities.
Oct 30 16:48:18 pve kernel: [173437.500346] perf: interrupt took too long (3913 > 3911), lowering kernel.perf_event_max_sample_rate to 51000
Oct 30 16:55:01 pve systemd[1]: Starting Cleanup of Temporary Directories...
Oct 30 16:55:01 pve systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Oct 30 16:55:01 pve systemd[1]: Finished Cleanup of Temporary Directories.
Oct 30 17:17:01 pve CRON[426631]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 18:17:02 pve CRON[436051]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 19:07:51 pve smartd[1393]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 73 to 74
Oct 30 19:17:01 pve CRON[445480]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 20:17:01 pve CRON[454794]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 20:39:42 pve pvedaemon[242158]: <root@pam> successful auth for user 'root@pam'
Oct 30 20:39:48 pve pvedaemon[458412]: stop VM 101: UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam:
Oct 30 20:39:48 pve pvedaemon[242158]: <root@pam> starting task UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam:
Oct 30 20:39:48 pve kernel: [187327.921107] fwbr101i0: port 2(tap101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.944850] fwbr101i0: port 1(fwln101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.944901] vmbr0: port 2(fwpr101p0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.945235] device fwln101i0 left promiscuous mode
Oct 30 20:39:48 pve kernel: [187327.945239] fwbr101i0: port 1(fwln101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.966272] device fwpr101p0 left promiscuous mode
Oct 30 20:39:48 pve kernel: [187327.966275] vmbr0: port 2(fwpr101p0) entered disabled state
Oct 30 20:39:48 pve qmeventd[1388]: read: Connection reset by peer
Oct 30 20:39:48 pve systemd[1]: 101.scope: Succeeded.
Oct 30 20:39:48 pve systemd[1]: 101.scope: Consumed 1d 9h 5min 47.663s CPU time.
Oct 30 20:39:49 pve kernel: [187328.621929] vfio_mdev 00000000-0000-0000-0000-000000000101: Removing from iommu group 148
Oct 30 20:39:49 pve kernel: [187328.621948] vfio_mdev 00000000-0000-0000-0000-000000000101: MDEV: detaching iommu
Oct 30 20:39:49 pve pvedaemon[242158]: <root@pam> end task UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam: OK
Oct 30 20:39:49 pve qmeventd[458434]: Starting cleanup for 101
Oct 30 10:17:01 pve CRON[359997]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 11:17:01 pve CRON[369733]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 12:17:01 pve CRON[379429]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 12:56:56 pve pvestatd[1704]: auth key pair too old, rotating..
Oct 30 13:17:01 pve CRON[388800]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 14:17:01 pve CRON[398114]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 15:17:01 pve CRON[407893]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 16:17:01 pve CRON[417268]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 16:24:11 pve systemd[1]: Starting Daily apt download activities...
Oct 30 16:24:11 pve systemd[1]: apt-daily.service: Succeeded.
Oct 30 16:24:11 pve systemd[1]: Finished Daily apt download activities.
Oct 30 16:48:18 pve kernel: [173437.500346] perf: interrupt took too long (3913 > 3911), lowering kernel.perf_event_max_sample_rate to 51000
Oct 30 16:55:01 pve systemd[1]: Starting Cleanup of Temporary Directories...
Oct 30 16:55:01 pve systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Oct 30 16:55:01 pve systemd[1]: Finished Cleanup of Temporary Directories.
Oct 30 17:17:01 pve CRON[426631]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 18:17:02 pve CRON[436051]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 19:07:51 pve smartd[1393]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 73 to 74
Oct 30 19:17:01 pve CRON[445480]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 20:17:01 pve CRON[454794]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 30 20:39:42 pve pvedaemon[242158]: <root@pam> successful auth for user 'root@pam'
Oct 30 20:39:48 pve pvedaemon[458412]: stop VM 101: UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam:
Oct 30 20:39:48 pve pvedaemon[242158]: <root@pam> starting task UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam:
Oct 30 20:39:48 pve kernel: [187327.921107] fwbr101i0: port 2(tap101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.944850] fwbr101i0: port 1(fwln101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.944901] vmbr0: port 2(fwpr101p0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.945235] device fwln101i0 left promiscuous mode
Oct 30 20:39:48 pve kernel: [187327.945239] fwbr101i0: port 1(fwln101i0) entered disabled state
Oct 30 20:39:48 pve kernel: [187327.966272] device fwpr101p0 left promiscuous mode
Oct 30 20:39:48 pve kernel: [187327.966275] vmbr0: port 2(fwpr101p0) entered disabled state
Oct 30 20:39:48 pve qmeventd[1388]: read: Connection reset by peer
Oct 30 20:39:48 pve systemd[1]: 101.scope: Succeeded.
Oct 30 20:39:48 pve systemd[1]: 101.scope: Consumed 1d 9h 5min 47.663s CPU time.
Oct 30 20:39:49 pve kernel: [187328.621929] vfio_mdev 00000000-0000-0000-0000-000000000101: Removing from iommu group 148
Oct 30 20:39:49 pve kernel: [187328.621948] vfio_mdev 00000000-0000-0000-0000-000000000101: MDEV: detaching iommu
Oct 30 20:39:49 pve pvedaemon[242158]: <root@pam> end task UPID:pve:0006FEAC:011DD73C:635E93BC:qmstop:101:root@pam: OK
Oct 30 20:39:49 pve qmeventd[458434]: Starting cleanup for 101
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: VgpuConfig {#012    vgpu_type: 259,#012    vgpu_name: "GRID RTX6000-4Q",#012    vgpu_class: "Quadro",#012    vgpu_signature: [],#012    features: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-
Virtual-WS-Ext,2.0",#012    max_instances: 6,#012    num_heads: 4,#012    max_resolution_x: 7680,#012    max_resolution_y: 4320,#012    max_pixels: 58982400,#012    frl_config: 60,#012    cuda_enabled: 1,#012    ecc_supported: 1,#012    mig
_instance_size: 0,#012    multi_vgpu_supported: 0,#012    vdev_id: 0x1e301328,#012    pdev_id: 0x1e30,#012    fb_length: 0xec000000,#012    mappable_video_size: 0x400000,#012    fb_reservation: 0x14000000,#012    encoder_capacity: 0x64,#012
    bar1_length: 0x100,#012    frl_enable: 1,#012    adapter_name: "GRID RTX6000-4Q",#012    adapter_name_unicode: "GRID RTX6000-4Q",#012    short_gpu_name_string: "TU102GL-A",#012    licensed_product_name: "NVIDIA RTX Virtual Workstation",
#012    vgpu_extra_params: [],#012}
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): gpu-pci-id : 0x1800
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Framebuffer: 0xec000000
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1e30:0x1328
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: ######## vGPU Manager Information: ########
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: Driver Version: 460.73.01
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x90002)
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): vGPU migration enabled
Oct 31 14:39:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: display_init inst: 0 successful
Oct 31 14:39:18 pve kernel: [252097.865254] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000601: vGPU migration disabled
Oct 31 14:39:18 pve pvedaemon[601591]: <root@pam> end task UPID:pve:0009394B:0180ABD5:635F90BD:qmstart:601:root@pam: OK
Oct 31 14:39:25 pve kernel: [252104.672891] kvm [604504]: ignored rdmsr: 0x10f data 0x0
Oct 31 14:39:25 pve kernel: [252104.672924] kvm [604504]: ignored rdmsr: 0x123 data 0x0
Oct 31 14:39:25 pve kernel: [252104.672946] kvm [604504]: ignored rdmsr: 0xc0011020 data 0x0
Oct 31 14:39:26 pve pvedaemon[604682]: starting vnc proxy UPID:pve:00093A0A:0180AF4D:635F90C6:vncproxy:601:root@pam:
Oct 31 14:39:26 pve pvedaemon[242158]: <root@pam> starting task UPID:pve:00093A0A:0180AF4D:635F90C6:vncproxy:601:root@pam:
Oct 31 14:40:24 pve pvedaemon[242158]: worker exit
Oct 31 14:40:24 pve pvedaemon[1732]: worker 242158 finished
Oct 31 14:40:24 pve pvedaemon[1732]: starting 1 worker(s)
Oct 31 14:40:24 pve pvedaemon[1732]: worker 604891 started
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: Driver Version: 510.85.02
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: error: vmiop_log: (0x0): Incompatible Guest/Host drivers: Guest VGX version is newer than the maximum version supported by the Host. Disabling vGPU.
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: error: vmiop_log: (0x0): VGPU message 1 failed, result code: 0x6a
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: error: vmiop_log: (0x0):         0x1e, 0xe, 0x100, 0x100,
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: error: vmiop_log: (0x0):         0x100, 0x1e17ea1, '510.85.02', 'rel/gpu_drv/r510/r513_40-519', 'Private r513_40 rel/gpu_drv/r510/r513_40-519 unknown'
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x42c10d000, current PA:0x39876b000
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x100000000, current PA:0x121cfe000
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x400000000, current PA:0x121cfe000
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x42c10c000, current PA:0x121cfe000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x400000000, current PA:0x121cfe000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x400000000, current PA:0x39876b000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x3) is already valid: new PA=0x400000000, current PA:0x39d075000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x4) is already valid: new PA=0x400000000, current PA:0x109f24000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x400000000, current PA:0x15dd92000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x400000000, current PA:0x15dd92000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x400000000, current PA:0x15dd92000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x42c133000, current PA:0x15dd92000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x3) is already valid: new PA=0x400000000, current PA:0x39d075000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x3) is already valid: new PA=0x400000000, current PA:0x39d075000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x3) is already valid: new PA=0x42c134000, current PA:0x39d075000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x4) is already valid: new PA=0x400000000, current PA:0x109f24000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x4) is already valid: new PA=0x400000000, current PA:0x109f24000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x4) is already valid: new PA=0x42c111000, current PA:0x109f24000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x400000000, current PA:0x39876b000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x400000000, current PA:0x39876b000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x42c110000, current PA:0x39876b000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x400000000, current PA:0x121cfe000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x400000000, current PA:0x121cfe000
Oct 31 14:44:18 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x42c10e000, current PA:0x121cfe000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x6) is already valid: new PA=0x400000000, current PA:0x121cfe000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Shared memory (0x5) is already valid: new PA=0x400000000, current PA:0x39876b000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x3) is already valid: new PA=0x400000000, current PA:0x39d075000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x4) is already valid: new PA=0x400000000, current PA:0x109f24000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x400000000, current PA:0x15dd92000
Oct 31 14:44:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: (0x0): Ring (0x2) is already valid: new PA=0x400000000, current PA:0x15dd92000
 
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |

Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: notice: vmiop_log: Driver Version: 510.85.02
Oct 31 14:43:48 pve nvidia-vgpu-mgr[604651]: error: vmiop_log: (0x0): Incompatible Guest/Host drivers: Guest VGX version is newer than the maximum version supported by the Host. Disabling vGPU.

it seems you are using unsupported host/guest driver versions. for vgpu, you must always match the host + guest drivers with the versions nvidia gives you in the bundle
 
  • Like
Reactions: wastedolphine
Hi,

I installed the latest version on the host as well as the guest ubuntu vm but still not able to get the nvidia-smi working. Ubuntu VM is running the driver that came bundled with host package i.e. 'NVIDIA-Linux-x86_64-510.85.02-grid.run'

Screenshot 2022-11-03 at 4.51.34 AM.png

Screenshot 2022-11-03 at 4.51.16 AM.png

Screenshot 2022-11-03 at 4.51.57 AM.png

-- Journal begins at Sat 2022-10-08 12:55:53 IST, ends at Thu 2022-11-03 04:50:55 IST. --
Nov 03 02:32:44 pve kernel: #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
Nov 03 02:32:44 pve kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.>
Nov 03 02:32:44 pve kernel: #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
Nov 03 02:32:44 pve kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Nov 03 02:32:44 pve kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements >
Nov 03 02:32:44 pve kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Nov 03 02:32:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Nov 03 02:32:44 pve kernel: acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instanc>
Nov 03 02:32:44 pve kernel: lpc_ich 0000:00:1f.0: No MFD cells added
Nov 03 02:32:44 pve kernel: i2c i2c-1: Systems with more than 4 memory slots not supported yet, not instantiating S>
Nov 03 02:32:44 pve kernel: spl: loading out-of-tree module taints kernel.
Nov 03 02:32:44 pve kernel: znvpair: module license 'CDDL' taints kernel.
Nov 03 02:32:44 pve kernel: Disabling lock debugging due to kernel taint
Nov 03 02:32:44 pve kernel: pstore: ignoring unexpected backend 'efi'
Nov 03 02:32:44 pve kernel:
Nov 03 02:32:45 pve kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.85.03 Thu Jul 21 17:03:06 UTC 2022
NVRM: GPU at 0000:18:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
Nov 03 02:32:47 pve kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PC>
Nov 03 02:32:47 pve kernel: caller os_map_kernel_space.part.0+0x8e/0xc0 [nvidia] mapping multiple BARs
Nov 03 02:41:44 pve kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled
Nov 03 02:41:48 pve kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000501: vGPU migration disabled
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x4e data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored wrmsr: 0x4e data 0x2
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x4e data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x1c9 data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored wrmsr: 0x1c9 data 0x3
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x1c9 data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x1a6 data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored wrmsr: 0x1a6 data 0x11
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x1a6 data 0x0
Nov 03 02:41:59 pve kernel: kvm [3116]: ignored rdmsr: 0x1a7 data 0x0
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): Incompatible Guest/Host drivers: Guest VGX vers>
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): VGPU message 1 failed, result code: 0x6a
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): 0x1f, 0xf, 0x100, 0x100,
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): 0x100, 0x0, '517.40', 'r515_00-323', 'D>
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): VGPU message 47 failed, result code: 0xff100002
Nov 03 02:48:54 pve kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000501: vGPU migration disabled
Nov 03 02:49:05 pve kernel: kvm_msr_ignored_check: 108 callbacks suppressed
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x4e data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored wrmsr: 0x4e data 0x2
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x4e data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x1c9 data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored wrmsr: 0x1c9 data 0x3
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x1c9 data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x1a6 data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored wrmsr: 0x1a6 data 0x11
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x1a6 data 0x0
Nov 03 02:49:05 pve kernel: kvm [4430]: ignored rdmsr: 0x1a7 data 0x0
Nov 03 02:49:10 pve kernel: kvm_msr_ignored_check: 85 callbacks suppressed
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0xb0 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0xb1 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x128 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x53 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x128 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x53 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x128 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x53 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x128 data 0x0
Nov 03 02:49:10 pve kernel: kvm [4430]: ignored rdmsr: 0x53 data 0x0
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported
Nov 03 02:52:00 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): RING_SIZE > 4K is not supported


Thank you.
 
Last edited:
not sure which logs belong to which vm (seems you started vm 100 and vm 501) but there is still some driver problem:

Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): Incompatible Guest/Host drivers: Guest VGX vers>
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): VGPU message 1 failed, result code: 0x6a
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): 0x1f, 0xf, 0x100, 0x100,
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): 0x100, 0x0, '517.40', 'r515_00-323', 'D>
Nov 03 02:42:25 pve nvidia-vgpu-mgr[3094]: error: vmiop_log: (0x0): VGPU message 47 failed, result code: 0xff100002
 
  • Like
Reactions: wastedolphine
Hi,

Thanks for the reply.

Here the latest logs on a fresh boot, running only one Ubuntu 22.04 VM


root@pve:~# journalctl -p warning -b
-- Journal begins at Sat 2022-10-08 12:55:53 IST, ends at Thu 2022-11-03 15:29:17 IST. --
Nov 03 15:24:44 pve kernel: secureboot: Secure boot could not be determined (mode 0)
Nov 03 15:24:44 pve kernel: secureboot: Secure boot could not be determined (mode 0)
Nov 03 15:24:44 pve kernel: #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
Nov 03 15:24:44 pve kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.
Nov 03 15:24:44 pve kernel: #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
Nov 03 15:24:44 pve kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Nov 03 15:24:44 pve kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Nov 03 15:24:44 pve kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Nov 03 15:24:44 pve kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Nov 03 15:24:44 pve kernel: acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:03)
Nov 03 15:24:44 pve kernel: lpc_ich 0000:00:1f.0: No MFD cells added
Nov 03 15:24:44 pve kernel: i2c i2c-1: Systems with more than 4 memory slots not supported yet, not instantiating SPD
Nov 03 15:24:44 pve kernel: spl: loading out-of-tree module taints kernel.
Nov 03 15:24:44 pve kernel: znvpair: module license 'CDDL' taints kernel.
Nov 03 15:24:44 pve kernel: Disabling lock debugging due to kernel taint
Nov 03 15:24:44 pve kernel: pstore: ignoring unexpected backend 'efi'
Nov 03 15:24:44 pve kernel:
Nov 03 15:24:45 pve kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.85.03 Thu Jul 21 17:03:06 UTC 2022
Nov 03 15:24:47 pve kernel: NVRM: GPU at 0000:18:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
Nov 03 15:24:47 pve kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
Nov 03 15:24:47 pve kernel: caller os_map_kernel_space.part.0+0x8e/0xc0 [nvidia] mapping multiple BARs
Nov 03 15:25:09 pve kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000101: vGPU migration disabled
Nov 03 15:25:18 pve kernel: kvm [1810]: ignored rdmsr: 0x10f data 0x0
Nov 03 15:25:18 pve kernel: kvm [1810]: ignored rdmsr: 0x123 data 0x0
Nov 03 15:25:18 pve kernel: kvm [1810]: ignored rdmsr: 0xc0011020 data 0x0
root@pve:~#


root@qvm1:/home/poornima# journalctl -p err -b
Nov 03 15:25:19 qvm1 kernel: shpchp 0000:05:01.0: pci_hp_register failed with error -16
Nov 03 15:25:19 qvm1 kernel: shpchp 0000:05:01.0: Slot initialization failed
Nov 03 15:25:19 qvm1 kernel: shpchp 0000:05:02.0: pci_hp_register failed with error -16
Nov 03 15:25:19 qvm1 kernel: shpchp 0000:05:02.0: Slot initialization failed
Nov 03 15:25:19 qvm1 kernel: shpchp 0000:05:03.0: pci_hp_register failed with error -16
Nov 03 15:25:20 qvm1 kernel: shpchp 0000:05:03.0: Slot initialization failed
Nov 03 15:25:20 qvm1 kernel: shpchp 0000:05:04.0: pci_hp_register failed with error -16
Nov 03 15:25:20 qvm1 kernel: shpchp 0000:05:04.0: Slot initialization failed
Nov 03 15:25:20 qvm1 kernel: snd_hda_intel 0000:00:1b.0: no codecs found!
Nov 03 15:25:21 qvm1 systemd[1]: Failed to start Process error reports when automatic reporting is enabled.
Nov 03 15:25:22 qvm1 nvidia-gridd[764]: Failed to initialise RM client
Nov 03 15:25:22 qvm1 nvidia-gridd[764]: Failed to initialise RM client
Nov 03 15:25:22 qvm1 nvidia-gridd[764]: Failed to unlock PID file: Bad file descriptor
Nov 03 15:25:22 qvm1 nvidia-gridd[764]: Failed to close PID file: Bad file descriptor
Nov 03 15:25:42 qvm1 gdm-launch-environment][843]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:42 qvm1 gdm-launch-environment][891]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:42 qvm1 pulseaudio[857]: Failed to find a working profile.
Nov 03 15:25:42 qvm1 pulseaudio[857]: Failed to load module "module-alsa-card" (argument: "device_id="0" name="pci-0000_00_1b.0" card_name="alsa_card.pci-0000_00_1b.0" namereg_fail=false tsched=yes fixed_latency_range=no ignore_dB=no deferred_volume=yes use_ucm=yes avoid_resamplin>
Nov 03 15:25:42 qvm1 pulseaudio[857]: Failed to find a working profile.
Nov 03 15:25:42 qvm1 pulseaudio[857]: Failed to load module "module-alsa-card" (argument: "device_id="0" name="pci-0000_00_1b.0" card_name="alsa_card.pci-0000_00_1b.0" namereg_fail=false tsched=yes fixed_latency_range=no ignore_dB=no deferred_volume=yes use_ucm=yes avoid_resamplin>
Nov 03 15:25:42 qvm1 gdm-launch-environment][937]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:43 qvm1 gdm-launch-environment][1024]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:43 qvm1 gdm-launch-environment][1033]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:43 qvm1 gdm-launch-environment][1049]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Nov 03 15:25:44 qvm1 systemd[848]: Failed to start Service for snap application snapd-desktop-integration.snapd-desktop-integration.
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] SSL_read: I/O error
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] libxrdp_force_read: header read error
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] Processing [ITU-T T.125] Connect-Initial failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] [MCS Connection Sequence] receive connection request failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] xrdp_sec_incoming: xrdp_mcs_incoming failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] xrdp_rdp_incoming: xrdp_sec_incoming failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] xrdp_process_main_loop: libxrdp_process_incoming failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] xrdp_iso_send: trans_write_copy_s failed
Nov 03 15:26:14 qvm1 xrdp[1148]: [ERROR] Sending [ITU T.125] DisconnectProviderUltimatum failed
Nov 03 15:26:20 qvm1 xrdp-sesman[700]: [ERROR] sesman_data_in: scp_process_msg failed
Nov 03 15:26:20 qvm1 xrdp-sesman[700]: [ERROR] sesman_main_loop: trans_check_wait_objs failed, removing trans
Nov 03 15:26:21 qvm1 systemd[1152]: Failed to start Application launched by gnome-session-binary.
Nov 03 15:26:22 qvm1 systemd[1152]: Failed to start Application launched by gnome-session-binary.
Nov 03 15:26:22 qvm1 systemd[1152]: Failed to start IBus Daemon for GNOME.
Nov 03 15:26:22 qvm1 systemd[1152]: Failed to start Application launched by gnome-session-binary.
Nov 03 15:26:45 qvm1 pulseaudio[1161]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or th>
lines 1-41/41 (END)

Is there anything I can try like a specific GPU driver version or the ones supplied by Google Cloud on guest os?
Is this related to Guest OS Drivers or I am still missing something on proxmox host?

Thank you so much for all the support.
 
Is it possible for you to access my system and check what I am doing wrong, The system is on static IP and proxmox can be directly accessed.
If you can check I can provide you the proxmox root password.

Thank You.
 
  • Like
Reactions: wastedolphine
if you want to continue here, please post the whole not-truncated log (so all messages, not only warn/errors; sometimes that contains messages that help) from the host + guest

also did you enable above 4g decoding etc. in the bios (like it's described here: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE_7.x )
 
  • Like
Reactions: wastedolphine
@dcsapak I have another query regarding this vGPU setup.

I have another machine with NVIDIA A100 80 GB, Will I be able to use proxmox for this similar setup where the GPU can be distributed into multiple VMs? note that I only need compute capabilities in vm, no need for graphics.

I tried the same tutorial, the nvidia-smi works fine in proxmox host but it is not showing the mdevctl profiles.

Thanks.
 
@dcsapak I have another query regarding this vGPU setup.

I have another machine with NVIDIA A100 80 GB, Will I be able to use proxmox for this similar setup where the GPU can be distributed into multiple VMs? note that I only need compute capabilities in vm, no need for graphics.

I tried the same tutorial, the nvidia-smi works fine in proxmox host but it is not showing the mdevctl profiles.

Thanks.
according to: https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html

the A100 should be supported by the vgpu 14 drivers. and theoretically should be the same procedure as with our RTX A5000 card (since it's the same architecture/generation)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!