Hello,
My issue is that I cannot see my vGPU's in nvidia-smi, but I can see them in Proxmox GUI and add them to my VM config before I then cannot boot the VM and get "TASK ERROR: pci device '0000:01:00.4' has no available instances of 'nvidia-528'.
Hardware:
My first question is, should I be using the Linux KVM or Ubuntu version of the Nvidia vGPU 14.1 installer set? I've tried both had similar results with both. I've strictly followed the Proxmox vGPU docs (minimal) and the Nvidia Grid 14.1 Docs. All BIOS options should* be good, the hardware is a bit new ( Supermicro AS-2114GT-DNR)
As I understand it, with this hardware I should be mediating devices https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings and then Proxmox handles SRIOV to the VM's?
After a fresh install of Proxmox and then:
Anyone got any ideas? I'm fresh out. Someone please tell me im missing something stooooopid.
TIA
My issue is that I cannot see my vGPU's in nvidia-smi, but I can see them in Proxmox GUI and add them to my VM config before I then cannot boot the VM and get "TASK ERROR: pci device '0000:01:00.4' has no available instances of 'nvidia-528'.
Hardware:
CPU(s) 32 x AMD Ryzen Threadripper PRO 3955WX 16-Cores (1 Socket) |
Kernel Version Linux 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200) |
PVE Manager Version pve-manager/7.2-7/d0dd0e85 |
Nvidia A6000 |
My first question is, should I be using the Linux KVM or Ubuntu version of the Nvidia vGPU 14.1 installer set? I've tried both had similar results with both. I've strictly followed the Proxmox vGPU docs (minimal) and the Nvidia Grid 14.1 Docs. All BIOS options should* be good, the hardware is a bit new ( Supermicro AS-2114GT-DNR)
As I understand it, with this hardware I should be mediating devices https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings and then Proxmox handles SRIOV to the VM's?
After a fresh install of Proxmox and then:
- Set up non subscription repositories https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_package_repositories
- apt-get update
- apt-get dist-upgrade
- reboot
- apt install build-essential
- apt install pve-headers
- echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
- apt install libvirt-daemon-system
- reboot
- apt install unzip
- Upload nvidia drivers to Proxmox host - scp NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip root@10.1.2.30:/root
- unzip NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip
- sudo apt install ./nvidia-vgpu-ubuntu-510_510.73.06_amd64.deb
- /usr/lib/nvidia/sriov-manage -e 00:01:0000.0
- cd /sys/class/mdev_bus/0000\:01\:00.4/mdev_supported_types
- echo "37a54373-4813-443e-9261-5c0a05ede1ab"> nvidia-528/create
- reboot
Anyone got any ideas? I'm fresh out. Someone please tell me im missing something stooooopid.
TIA
Code:
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpud.service
● nvidia-vgpud.service - NVIDIA vGPU Daemon
Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Thu 2022-07-21 01:26:30 BST; 6min ago
Process: 3687 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
Process: 3689 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
Main PID: 3688 (code=exited, status=0/SUCCESS)
CPU: 103ms
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Number of Displays: 1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Max pixels: 8847360
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Display: width 4096, height 2160
Jul 21 01:26:30 pve nvidia-vgpud[3688]: GPU Direct supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: NVLink P2P supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: License: NVIDIA-vComputeServer,9.0;Quadro-Virtual-DWS,5.0
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file unlocked.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file closed.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Shutdown (3688)
Jul 21 01:26:30 pve systemd[1]: nvidia-vgpud.service: Succeeded.
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpu-mgr.service
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-07-21 01:12:42 BST; 20min ago
Process: 1006 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
Main PID: 1010 (nvidia-vgpu-mgr)
Tasks: 1 (limit: 154345)
Memory: 532.0K
CPU: 2.430s
CGroup: /system.slice/nvidia-vgpu-mgr.service
└─1010 /usr/bin/nvidia-vgpu-mgr
Jul 21 01:12:42 pve systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jul 21 01:12:42 pve systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jul 21 01:12:43 pve nvidia-vgpu-mgr[1010]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# mdevctl list
37a54373-4813-443e-9261-5c0a05ede1ab 0000:01:00.4 nvidia-528 (defined)
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# ls -l /sys/bus/mdev/devices/
total 0
lrwxrwxrwx 1 root root 0 Jul 21 01:15 37a54373-4813-443e-9261-5c0a05ede1ab -> ../../../devices/pci0000:00/0000:00:01.3/0000:01:00.4/37a54373-4813-443e-9261-5c0a05ede1ab
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi
Thu Jul 21 01:36:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06 Driver Version: 510.73.06 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | 0 |
| 30% 28C P8 26W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi vgpu
Thu Jul 21 01:36:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06 Driver Version: 510.73.06 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA RTX A6000 | 00000000:01:00.0 | 0% |
+---------------------------------+------------------------------+------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# lsmod | grep nvidia
nvidia_vgpu_vfio 61440 0
nvidia 39124992 11
mdev 28672 1 nvidia_vgpu_vfio
vfio 40960 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
drm 602112 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# dmesg | grep -E "NVRM|nvidia"
[ 4.031106] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 4.033842] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[ 4.118372] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.73.06 Mon May 9 08:06:24 UTC 2022
[ 5.311479] audit: type=1400 audit(1658362362.052:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=992 comm="apparmor_parser"
[ 5.311482] audit: type=1400 audit(1658362362.052:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=992 comm="apparmor_parser"
[ 5.325582] NVRM: GPU at 0000:01:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[ 122.485737] NVRM: GPU 0000:01:00.0: UnbindLock acquired
[ 123.206289] NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver.
[ 123.206291] nvidia: probe of 0000:01:00.4 failed with error -1