NVIDIA A100 GPU works in both host and guest VM but not really

morik_proxmox

Member
Jan 2, 2023
19
1
8
Dear experts,
I'm in need of your help once again.

Desire: Use NVIDIA A100 GPU in guest VMs with docker containers

Problem: VM sees NVIDIA vGPU (created by host) but cuda calls don't find a GPU. From logs:
Code:
NVRM: GPU 0000:01:00.0: RmInitAdapter failed!

Setup:
  • 2-node (+1 voting member) on a PVE8.2 cluster
  • 1x NVIDIA A100 GPU
  • vGPU drivers (550.90.07) installed on host w/ sriov_numfs=20
  • matching gridd (550.90.07) drivers installed on Ubuntu 22.04 VM
  • CUDA Version: 12.4 installed on the same VM
Please note: the same HW setup was, until recently running ESXi8.0 where everything was working. Meaning, it is not likely a case of hardware failure

Documentation used:
pveversion:
Code:
pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.2.8 (running version: 8.2.8/a577cfa684c7476d)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-3-pve-signed: 6.8.12-3
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240813.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.8
libpve-cluster-perl: 8.0.8
libpve-common-perl: 8.2.8
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.11
libpve-storage-perl: 8.2.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.0
pve-cluster: 8.0.8
pve-container: 5.2.1
pve-docs: 8.2.4
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.14-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.4
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1


nvidia-smi host w/ 1 MIG gi + 1 MIG ci:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:8A:00.0 Off |                   On |
| N/A   30C    P0             95W /  300W |    9478MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   13   0   0  |            9478MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   13    0    1436240    C+G   vgpu                                         9472MiB |
+-----------------------------------------------------------------------------------------+

Relevant outputs from host:
Code:
lspci -kd 10de:
8a:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
    Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
8a:00.4 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
    Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
.... (and so on) .....

--

VM config:
(mdev uuid generation via built-in GUI feature)
Code:
cat /etc/pve/qemu-server/8013.conf
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=sata0;scsi0;scsi1
cicustom: vendor=local:snippets/vendor.yaml,user=local:snippets/users-machine1.yaml,network=local:snippets/metadata.yaml
cores: 1
cpu: x86-64-v4
efidisk0: vm-data:vm-8013-disk-0,format=raw,size=128K
hostpci0: 0000:8a:00.4,mdev=nvidia-699,pcie=1
machine: q35,viommu=virtio
memory: 65536
meta: creation-qemu=9.0.2,ctime=1729726853
name: <something>
nameserver: 192.168.100.36 192.168.100.35 192.168.0.1
net0: virtio=00:50:56:88:e9:8d,bridge=vmbr4
onboot: 1
ostype: l26
scsi0: vm-data:vm-8013-disk-1,format=raw,size=150G
scsi1: vm-data:vm-8013-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
searchdomain: <domain>
smbios1: uuid=4208fd12-12f9-9b4d-beb5-2b8a00ebf73a
sockets: 8
startup: order=10
tags: 22.04;docker;ubuntu
vga: std
vmgenid: bf13e566-8051-4035-ad0a-592cc15916de

nvidia-smi on VM:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-1-10C               On  |   00000000:01:00.0 Off |                   On |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  10240MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvidia-gridd activation on VM:
Code:
sstatus nvidia-gridd
● nvidia-gridd.service - NVIDIA Grid Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-11-22 02:15:50 PST; 6h ago
    Process: 1212 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 1215 (nvidia-gridd)
      Tasks: 3 (limit: 77022)
     Memory: 6.2M
        CPU: 48ms
     CGroup: /system.slice/nvidia-gridd.service
             └─1215 /usr/bin/nvidia-gridd

Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Starting NVIDIA Grid Daemon...
Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Started NVIDIA Grid Daemon.
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Started (1215)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: vGPU Software package (0)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Ignore service provider and node-locked licensing
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: NLS initialized
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Acquiring license. (Info: api.cls.licensing.nvidia.com; NVIDIA Virtual Compute Server)
Nov 22 02:15:53 cpai.esco.ghaar nvidia-gridd[1215]: License acquired successfully. (Info: api.cls.licensing.nvidia.com, NVIDIA Virtual Compute Server; Expiry: 2025-11-21 10:15:52 GMT)

Other relevant info from VM:
Code:
cat /proc/driver/nvidia/gpus/0000:01:00.0/information
Model:          GRID A100D-1-10C
IRQ:            0
GPU UUID:      GPU-2a79a458-a8b5-11ef-b601-f45ad91f937d
Video BIOS:      00.00.00.00.00
Bus Type:      PCI
DMA Size:      47 bits
DMA Mask:      0x7fffffffffff
Bus Location:      0000:01:00.0
Device Minor:      0
GPU Excluded:     No

--

cuda not detecting GPU:
Code:
cuda-samples/Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

logs from VM:
Code:
--
dmesg | tail -n 100
[INDENT][    6.236427] snd_hda_intel 0000:00:1b.0: Adding to iommu group 10[/INDENT]
[INDENT][    6.236433] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMA[/INDENT]
[INDENT][    6.241153] snd_hda_intel 0000:00:1b.0: no codecs found![/INDENT]
[INDENT][    6.494903] nvidia: loading out-of-tree module taints kernel.[/INDENT]
[INDENT][    6.494913] nvidia: module license 'NVIDIA' taints kernel.[/INDENT]
[INDENT][    6.494914] Disabling lock debugging due to kernel taint[/INDENT]
[INDENT][    6.520226] nvidia: module verification failed: signature and/or required key missing - tainting kernel[/INDENT]
[INDENT][    6.531727] nvidia-nvlink: Nvlink Core is being initialized, major device number 234[/INDENT]
[INDENT][/INDENT]
[INDENT][    6.533102] nvidia 0000:01:00.0: Adding to iommu group 11[/INDENT]
[INDENT][    6.533886] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMANVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024[/INDENT]
[INDENT][    6.550041] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.[/INDENT]
[INDENT][    6.555814] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.90.07  Fri May 31 09:30:47 UTC 2024[/INDENT]
[INDENT][    6.559092] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver[/INDENT]
[INDENT][    6.559094] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1[/INDENT]
[INDENT][    6.597161] audit: type=1400 audit(1732270547.596:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][    6.597166] audit: type=1400 audit(1732270547.596:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][    6.597315] audit: type=1400 audit(1732270547.596:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=704 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599720] audit: type=1400 audit(1732270547.600:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599723] audit: type=1400 audit(1732270547.600:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599725] audit: type=1400 audit(1732270547.600:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.601872] audit: type=1400 audit(1732270547.600:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][    6.601875] audit: type=1400 audit(1732270547.600:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][    6.603242] audit: type=1400 audit(1732270547.604:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=710 comm="apparmor_parser"[/INDENT]
[INDENT][    6.603537] audit: type=1400 audit(1732270547.604:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=706 comm="apparmor_parser"[/INDENT]
[INDENT]...
[    8.285208] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)[/INDENT]
[INDENT][    8.286209] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0[/INDENT]
[INDENT][    8.385782] loop6: detected capacity change from 0 to 8[/INDENT]
[INDENT][    8.588329] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.[/INDENT]
[INDENT][    8.593520] nvidia-uvm: Loaded the UVM driver, major device number 510.[/INDENT]
[INDENT][   26.692893] Initializing XFRM netlink socket[/INDENT]
[INDENT][   26.742248] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.[/INDENT]
 
Thank you for taking a look at my post! I understood that the recommended steps for dealing w/ NVIDIA drivers no longer uses `mdevctl` via CLI. The GUI seems to handle the mapping. This is also the recommended way from NVIDIA's driver manuals.

Nonetheless, I get a command not found for `mdevctl`.
1745522587384.png