PCI Passthrough with NVIDIA HGX 8 x A100 40GB - VM does not start

whyer

New Member
Jan 5, 2024
1
0
1
Hi, I`m new member on the forum, but I`m working with Proxmox for a while. I run into quite confusing problem and need advice. Below I will describe my problem and hardware on which I am working.

Host is a Supermicro AS 4124GO-NART with NVIDIA DGX A100 8-GPU 40GB working on my University. Machine has:

2 x AMD EPYC 7702,
1 TB of RAM, and
NVIDIA DGX A100 8-GPU 40GB. Compute module contains NVLink and 2 x NVSwitch between GPU cores.

PCI Passthrough is configured following official documentation (https://pve.proxmox.com/wiki/PCI_Passthrough) and verified with (https://pve.proxmox.com/wiki/PCI(e)_Passthrough). More over I ran into this thread (https://forum.proxmox.com/threads/p...a100-80gb-4-vms-gpu-only-works-on-one.127114/) and step by step checked my config and tested help proposed in above thread.

More over Host is configured as a cluster with 2 other nodes.

Host is working fine. Run VMs, can migrate then to other cluster nodes and so on. Problem begins with passing DGX subsystem. If I add any number of A100 GPU, VM does not start and start task runs till Host reboot (Screen below). If I remove added GPUs VM starts normally. More over I tried to add PCIe device: 'Broadcom / LSI PCIe Switch management endpoint' thinking about it as NVLink/NVSwitch device, but VM still won`t start. Moreover I run into some information about disabling NVIDIA GPUs in this thread (https://unix.stackexchange.com/ques...ble-and-later-re-enable-one-of-my-nvidia-gpus). Also tried to disable A100 cores before running VM with passthrough but this does not help. Removing PCIe devices enables VM to start.

Checking journalctl after starting VM with GPU looks like this:

Code:
Jan 12 12:14:39 deimos login[5848]: ROOT LOGIN  on '/dev/pts/0' from '*.*.*.*'
Jan 12 12:14:54 deimos chronyd[5308]: Selected source 212.160.106.226 (2.debian.pool.ntp.org)
Jan 12 12:14:54 deimos chronyd[5308]: System clock TAI offset set to 37 seconds
Jan 12 12:15:01 deimos kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
Jan 12 12:15:01 deimos kernel: nvidia-uvm: Loaded the UVM driver, major device number 504.
Jan 12 12:16:00 deimos chronyd[5308]: Selected source 80.50.102.114 (2.debian.pool.ntp.org)
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00008d00] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00008700] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x0000ca00] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x0000c700] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00000a00] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00000700] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00004d00] Unloading driver
Jan 12 12:16:27 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00004700] Unloading driver
Jan 12 12:17:01 deimos CRON[6432]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 12 12:17:01 deimos CRON[6433]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 12 12:17:01 deimos CRON[6432]: pam_unix(cron:session): session closed for user root
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00004700] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:47:00.0 on minor 1
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00004d00] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:4d:00.0 on minor 2
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:07:00.0 on minor 3
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 4
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x0000c700] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:c7:00.0 on minor 5
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x0000ca00] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:ca:00.0 on minor 6
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00008700] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:87:00.0 on minor 7
Jan 12 12:18:06 deimos kernel: [drm] [nvidia-drm] [GPU ID 0x00008d00] Loading driver
Jan 12 12:18:06 deimos kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:8d:00.0 on minor 8
Jan 12 12:19:04 deimos pvedaemon[6983]: start VM 124: UPID:deimos:00001B47:0000785C:65A12028:qmstart:124:root@pam:
Jan 12 12:19:04 deimos pvedaemon[5622]: <root@pam> starting task UPID:deimos:00001B47:0000785C:65A12028:qmstart:124:root@pam:
Jan 12 12:19:04 deimos kernel: NVRM: Attempting to remove device 0000:47:00.0 with non-zero usage count!

Today I run into NVRM: Attempting to remove device 0000:47:00.0 with non-zero usage count! whitch I understand that GPU is still used by driver. Additionaly I wonder if nvidia-drm modeule isn`t doing something wrong here.

Any thoughts? Please feel free to ask any additional information.

I am going back to study NVidia docs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!