NVIDIA A100 GPU works in both host and guest VM but not really

morik_proxmox

New Member
Jan 2, 2023
16
0
1
Dear experts,
I'm in need of your help once again.

Desire: Use NVIDIA A100 GPU in guest VMs with docker containers

Problem: VM sees NVIDIA vGPU (created by host) but cuda calls don't find a GPU. From logs:
Code:
NVRM: GPU 0000:01:00.0: RmInitAdapter failed!

Setup:
  • 2-node (+1 voting member) on a PVE8.2 cluster
  • 1x NVIDIA A100 GPU
  • vGPU drivers (550.90.07) installed on host w/ sriov_numfs=20
  • matching gridd (550.90.07) drivers installed on Ubuntu 22.04 VM
  • CUDA Version: 12.4 installed on the same VM
Please note: the same HW setup was, until recently running ESXi8.0 where everything was working. Meaning, it is not likely a case of hardware failure

Documentation used:
pveversion:
Code:
pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.2.8 (running version: 8.2.8/a577cfa684c7476d)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-3-pve-signed: 6.8.12-3
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240813.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.8
libpve-cluster-perl: 8.0.8
libpve-common-perl: 8.2.8
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.11
libpve-storage-perl: 8.2.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.0
pve-cluster: 8.0.8
pve-container: 5.2.1
pve-docs: 8.2.4
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.14-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.4
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1


nvidia-smi host w/ 1 MIG gi + 1 MIG ci:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:8A:00.0 Off |                   On |
| N/A   30C    P0             95W /  300W |    9478MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   13   0   0  |            9478MiB /  9728MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   13    0    1436240    C+G   vgpu                                         9472MiB |
+-----------------------------------------------------------------------------------------+

Relevant outputs from host:
Code:
lspci -kd 10de:
8a:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
    Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
8a:00.4 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
    Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
.... (and so on) .....

--

VM config:
(mdev uuid generation via built-in GUI feature)
Code:
cat /etc/pve/qemu-server/8013.conf
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=sata0;scsi0;scsi1
cicustom: vendor=local:snippets/vendor.yaml,user=local:snippets/users-machine1.yaml,network=local:snippets/metadata.yaml
cores: 1
cpu: x86-64-v4
efidisk0: vm-data:vm-8013-disk-0,format=raw,size=128K
hostpci0: 0000:8a:00.4,mdev=nvidia-699,pcie=1
machine: q35,viommu=virtio
memory: 65536
meta: creation-qemu=9.0.2,ctime=1729726853
name: <something>
nameserver: 192.168.100.36 192.168.100.35 192.168.0.1
net0: virtio=00:50:56:88:e9:8d,bridge=vmbr4
onboot: 1
ostype: l26
scsi0: vm-data:vm-8013-disk-1,format=raw,size=150G
scsi1: vm-data:vm-8013-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
searchdomain: <domain>
smbios1: uuid=4208fd12-12f9-9b4d-beb5-2b8a00ebf73a
sockets: 8
startup: order=10
tags: 22.04;docker;ubuntu
vga: std
vmgenid: bf13e566-8051-4035-ad0a-592cc15916de

nvidia-smi on VM:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  GRID A100D-1-10C               On  |   00000000:01:00.0 Off |                   On |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  10240MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvidia-gridd activation on VM:
Code:
sstatus nvidia-gridd
● nvidia-gridd.service - NVIDIA Grid Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-11-22 02:15:50 PST; 6h ago
    Process: 1212 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 1215 (nvidia-gridd)
      Tasks: 3 (limit: 77022)
     Memory: 6.2M
        CPU: 48ms
     CGroup: /system.slice/nvidia-gridd.service
             └─1215 /usr/bin/nvidia-gridd

Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Starting NVIDIA Grid Daemon...
Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Started NVIDIA Grid Daemon.
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Started (1215)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: vGPU Software package (0)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Ignore service provider and node-locked licensing
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: NLS initialized
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Acquiring license. (Info: api.cls.licensing.nvidia.com; NVIDIA Virtual Compute Server)
Nov 22 02:15:53 cpai.esco.ghaar nvidia-gridd[1215]: License acquired successfully. (Info: api.cls.licensing.nvidia.com, NVIDIA Virtual Compute Server; Expiry: 2025-11-21 10:15:52 GMT)

Other relevant info from VM:
Code:
cat /proc/driver/nvidia/gpus/0000:01:00.0/information
Model:          GRID A100D-1-10C
IRQ:            0
GPU UUID:      GPU-2a79a458-a8b5-11ef-b601-f45ad91f937d
Video BIOS:      00.00.00.00.00
Bus Type:      PCI
DMA Size:      47 bits
DMA Mask:      0x7fffffffffff
Bus Location:      0000:01:00.0
Device Minor:      0
GPU Excluded:     No

--

cuda not detecting GPU:
Code:
cuda-samples/Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

logs from VM:
Code:
--
dmesg | tail -n 100
[INDENT][    6.236427] snd_hda_intel 0000:00:1b.0: Adding to iommu group 10[/INDENT]
[INDENT][    6.236433] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMA[/INDENT]
[INDENT][    6.241153] snd_hda_intel 0000:00:1b.0: no codecs found![/INDENT]
[INDENT][    6.494903] nvidia: loading out-of-tree module taints kernel.[/INDENT]
[INDENT][    6.494913] nvidia: module license 'NVIDIA' taints kernel.[/INDENT]
[INDENT][    6.494914] Disabling lock debugging due to kernel taint[/INDENT]
[INDENT][    6.520226] nvidia: module verification failed: signature and/or required key missing - tainting kernel[/INDENT]
[INDENT][    6.531727] nvidia-nvlink: Nvlink Core is being initialized, major device number 234[/INDENT]
[INDENT][/INDENT]
[INDENT][    6.533102] nvidia 0000:01:00.0: Adding to iommu group 11[/INDENT]
[INDENT][    6.533886] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMANVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.90.07  Fri May 31 09:35:42 UTC 2024[/INDENT]
[INDENT][    6.550041] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.[/INDENT]
[INDENT][    6.555814] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.90.07  Fri May 31 09:30:47 UTC 2024[/INDENT]
[INDENT][    6.559092] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver[/INDENT]
[INDENT][    6.559094] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1[/INDENT]
[INDENT][    6.597161] audit: type=1400 audit(1732270547.596:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][    6.597166] audit: type=1400 audit(1732270547.596:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][    6.597315] audit: type=1400 audit(1732270547.596:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=704 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599720] audit: type=1400 audit(1732270547.600:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599723] audit: type=1400 audit(1732270547.600:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.599725] audit: type=1400 audit(1732270547.600:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][    6.601872] audit: type=1400 audit(1732270547.600:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][    6.601875] audit: type=1400 audit(1732270547.600:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][    6.603242] audit: type=1400 audit(1732270547.604:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=710 comm="apparmor_parser"[/INDENT]
[INDENT][    6.603537] audit: type=1400 audit(1732270547.604:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=706 comm="apparmor_parser"[/INDENT]
[INDENT]...
[    8.285208] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)[/INDENT]
[INDENT][    8.286209] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0[/INDENT]
[INDENT][    8.385782] loop6: detected capacity change from 0 to 8[/INDENT]
[INDENT][    8.588329] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.[/INDENT]
[INDENT][    8.593520] nvidia-uvm: Loaded the UVM driver, major device number 510.[/INDENT]
[INDENT][   26.692893] Initializing XFRM netlink socket[/INDENT]
[INDENT][   26.742248] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.[/INDENT]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!