Dear experts,
I'm in need of your help once again.
Desire: Use NVIDIA A100 GPU in guest VMs with docker containers
Problem: VM sees NVIDIA vGPU (created by host) but cuda calls don't find a GPU. From logs:
Setup:
Documentation used:
pveversion:
nvidia-smi host w/ 1 MIG gi + 1 MIG ci:
Relevant outputs from host:
VM config: (mdev uuid generation via built-in GUI feature)
nvidia-smi on VM:
nvidia-gridd activation on VM:
Other relevant info from VM:
cuda not detecting GPU:
logs from VM:
I'm in need of your help once again.
Desire: Use NVIDIA A100 GPU in guest VMs with docker containers
Problem: VM sees NVIDIA vGPU (created by host) but cuda calls don't find a GPU. From logs:
Code:
NVRM: GPU 0000:01:00.0: RmInitAdapter failed!
Setup:
- 2-node (+1 voting member) on a PVE8.2 cluster
- 1x NVIDIA A100 GPU
- vGPU drivers (550.90.07) installed on host w/ sriov_numfs=20
- matching gridd (550.90.07) drivers installed on Ubuntu 22.04 VM
- CUDA Version: 12.4 installed on the same VM
Documentation used:
pveversion:
Code:
pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-4-pve)
pve-manager: 8.2.8 (running version: 8.2.8/a577cfa684c7476d)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-4
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
proxmox-kernel-6.8.12-3-pve-signed: 6.8.12-3
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240813.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.8
libpve-cluster-perl: 8.0.8
libpve-common-perl: 8.2.8
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.11
libpve-storage-perl: 8.2.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.2.9-1
proxmox-backup-file-restore: 3.2.9-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.0
pve-cluster: 8.0.8
pve-container: 5.2.1
pve-docs: 8.2.4
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.14-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.4
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
nvidia-smi host w/ 1 MIG gi + 1 MIG ci:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:8A:00.0 Off | On |
| N/A 30C P0 95W / 300W | 9478MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 13 0 0 | 9478MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 13 0 1436240 C+G vgpu 9472MiB |
+-----------------------------------------------------------------------------------------+
Relevant outputs from host:
Code:
lspci -kd 10de:
8a:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
8a:00.4 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
.... (and so on) .....
--
VM config: (mdev uuid generation via built-in GUI feature)
Code:
cat /etc/pve/qemu-server/8013.conf
agent: 1,fstrim_cloned_disks=1
bios: ovmf
boot: order=sata0;scsi0;scsi1
cicustom: vendor=local:snippets/vendor.yaml,user=local:snippets/users-machine1.yaml,network=local:snippets/metadata.yaml
cores: 1
cpu: x86-64-v4
efidisk0: vm-data:vm-8013-disk-0,format=raw,size=128K
hostpci0: 0000:8a:00.4,mdev=nvidia-699,pcie=1
machine: q35,viommu=virtio
memory: 65536
meta: creation-qemu=9.0.2,ctime=1729726853
name: <something>
nameserver: 192.168.100.36 192.168.100.35 192.168.0.1
net0: virtio=00:50:56:88:e9:8d,bridge=vmbr4
onboot: 1
ostype: l26
scsi0: vm-data:vm-8013-disk-1,format=raw,size=150G
scsi1: vm-data:vm-8013-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-pci
searchdomain: <domain>
smbios1: uuid=4208fd12-12f9-9b4d-beb5-2b8a00ebf73a
sockets: 8
startup: order=10
tags: 22.04;docker;ubuntu
vga: std
vmgenid: bf13e566-8051-4035-ad0a-592cc15916de
nvidia-smi on VM:
Code:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 GRID A100D-1-10C On | 00000000:01:00.0 Off | On |
| N/A N/A P0 N/A / N/A | 0MiB / 10240MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
nvidia-gridd activation on VM:
Code:
sstatus nvidia-gridd
● nvidia-gridd.service - NVIDIA Grid Daemon
Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-11-22 02:15:50 PST; 6h ago
Process: 1212 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
Main PID: 1215 (nvidia-gridd)
Tasks: 3 (limit: 77022)
Memory: 6.2M
CPU: 48ms
CGroup: /system.slice/nvidia-gridd.service
└─1215 /usr/bin/nvidia-gridd
Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Starting NVIDIA Grid Daemon...
Nov 22 02:15:50 cpai.esco.ghaar systemd[1]: Started NVIDIA Grid Daemon.
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Started (1215)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: vGPU Software package (0)
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Ignore service provider and node-locked licensing
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: NLS initialized
Nov 22 02:15:50 cpai.esco.ghaar nvidia-gridd[1215]: Acquiring license. (Info: api.cls.licensing.nvidia.com; NVIDIA Virtual Compute Server)
Nov 22 02:15:53 cpai.esco.ghaar nvidia-gridd[1215]: License acquired successfully. (Info: api.cls.licensing.nvidia.com, NVIDIA Virtual Compute Server; Expiry: 2025-11-21 10:15:52 GMT)
Other relevant info from VM:
Code:
cat /proc/driver/nvidia/gpus/0000:01:00.0/information
Model: GRID A100D-1-10C
IRQ: 0
GPU UUID: GPU-2a79a458-a8b5-11ef-b601-f45ad91f937d
Video BIOS: 00.00.00.00.00
Bus Type: PCI
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:01:00.0
Device Minor: 0
GPU Excluded: No
--
cuda not detecting GPU:
Code:
cuda-samples/Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL
logs from VM:
Code:
--
dmesg | tail -n 100
[INDENT][ 6.236427] snd_hda_intel 0000:00:1b.0: Adding to iommu group 10[/INDENT]
[INDENT][ 6.236433] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMA[/INDENT]
[INDENT][ 6.241153] snd_hda_intel 0000:00:1b.0: no codecs found![/INDENT]
[INDENT][ 6.494903] nvidia: loading out-of-tree module taints kernel.[/INDENT]
[INDENT][ 6.494913] nvidia: module license 'NVIDIA' taints kernel.[/INDENT]
[INDENT][ 6.494914] Disabling lock debugging due to kernel taint[/INDENT]
[INDENT][ 6.520226] nvidia: module verification failed: signature and/or required key missing - tainting kernel[/INDENT]
[INDENT][ 6.531727] nvidia-nvlink: Nvlink Core is being initialized, major device number 234[/INDENT]
[INDENT][/INDENT]
[INDENT][ 6.533102] nvidia 0000:01:00.0: Adding to iommu group 11[/INDENT]
[INDENT][ 6.533886] iommu: Failed to allocate default IOMMU domain of type 4 for group (null) - Falling back to IOMMU_DOMAIN_DMANVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.90.07 Fri May 31 09:35:42 UTC 2024[/INDENT]
[INDENT][ 6.550041] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.[/INDENT]
[INDENT][ 6.555814] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.90.07 Fri May 31 09:30:47 UTC 2024[/INDENT]
[INDENT][ 6.559092] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver[/INDENT]
[INDENT][ 6.559094] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1[/INDENT]
[INDENT][ 6.597161] audit: type=1400 audit(1732270547.596:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.597166] audit: type=1400 audit(1732270547.596:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=705 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.597315] audit: type=1400 audit(1732270547.596:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=704 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.599720] audit: type=1400 audit(1732270547.600:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.599723] audit: type=1400 audit(1732270547.600:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.599725] audit: type=1400 audit(1732270547.600:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=709 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.601872] audit: type=1400 audit(1732270547.600:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.601875] audit: type=1400 audit(1732270547.600:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=711 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.603242] audit: type=1400 audit(1732270547.604:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=710 comm="apparmor_parser"[/INDENT]
[INDENT][ 6.603537] audit: type=1400 audit(1732270547.604:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=706 comm="apparmor_parser"[/INDENT]
[INDENT]...
[ 8.285208] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)[/INDENT]
[INDENT][ 8.286209] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0[/INDENT]
[INDENT][ 8.385782] loop6: detected capacity change from 0 to 8[/INDENT]
[INDENT][ 8.588329] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.[/INDENT]
[INDENT][ 8.593520] nvidia-uvm: Loaded the UVM driver, major device number 510.[/INDENT]
[INDENT][ 26.692893] Initializing XFRM netlink socket[/INDENT]
[INDENT][ 26.742248] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.[/INDENT]