GPU pass through on HPE Cray XD670

Ignauus

New Member
Dec 18, 2024
3
0
1
Hi,

Anyone that has tried and enabled GPU pass through on this beast of a server?
It has 8 x Nvida H200 with Intel XEON 8570

I've tried various settings / tutorials on the web and this forum but none ends well.

The closest I've got is a driver install fail with errors like belov in the cuda installer log:
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid

With CPU type host the VM hangs with "NO VNC" in the console and on the host I see it get stuck on:
vfio-pci-pci 0000:0a:00.0:Enabling HDA controller

I use an UEFI VM with q35, tried all the settings I can think of when adding a PCI Device, IOMMU groups looks fine

We had problems like this on an older HPE server with 8 x A100 also and ended up with using libvirt kvm. I'm not quite ready to give up just yet.
Any suggestion where to begin?

--Peter
 
So after some more testing....
GPU passthrough sort of almost works but I'm only able to pass 1 / 8 GPUs to a VM, if I try more than 1, all are visible with lspci but only the last one works and nvidia driver fails on the rest.

I've setup 2 hosts in parallel with exact same spec and bios setting. One with proxmox and the other with RHEL9 based OS and libvirt. RHEL9 work with multiple GPUs on VMs.