Multi GPU Cuda setup issues

daulky

New Member
Aug 17, 2022
4
0
1
So I have 4 3090RTX GPU's in a Proxmox server. I am having real trouble getting them to all work together.

I have done the pass through and can't seem to get them to work on a VM (Not a container) would I need to
install the drivers on the Proxmox host first before installing cuda on to any of the VM's?

I have installed the GPU Driver for the 3090RTX and then use the offcial docker image with Tensorflow but I still can only use either 1 GPU
or it just crashes.

Anyone got a similar config working?

Thanks.
 
I have done the pass through and can't seem to get them to work on a VM (Not a container) would I need to
install the drivers on the Proxmox host first before installing cuda on to any of the VM's?
No, not when using PCIe passthrough. In principle you want nothing to touch the devices before the drivers inside the VM do. Therefore use early binding to vfio-pci and boot the system with another GPU (using options vfio-pci ids= as per the Wiki). Maybe you need to dump the device ROMs and pass then?
I have installed the GPU Driver for the 3090RTX and then use the offcial docker image with Tensorflow but I still can only use either 1 GPU
or it just crashes.
Disable Resizable BAR maybe, which i think is not supported by virtualization? Disable above 4G decoding? Is you power supply powerfull enough, otherwise the additional (peak) load from another GPU might trip a reset. Are the devices in IOMMU group isolated from the host, or did you use pcie_acs_override (can you try without or show all groups without it)?
What does a crash look like? Does it happen when starting the VM or when using the GPU? Anything relevant in the system logs when it crashed?
Anyone got a similar config working?
Sorry, I don't even have experience with NVidia devcies (due to them preventing passthrough in the past), so I can give only very generic advise.
 
According to igorslab the 3090 spikesaunder heavy load to 500W power consumption. So the PSU should be able to handle 2000W just for the GPUs without tripping.
 
Last edited:
According to igorslab the 3090 spikesaunder heavy load to 500W power consumption. So the PSU should be able to handle 2000W just for the GPUs without tripping.
We have 4kw in 2 x 2kw PSUs so we should be good
 
No, not when using PCIe passthrough. In principle you want nothing to touch the devices before the drivers inside the VM do. Therefore use early binding to vfio-pci and boot the system with another GPU (using options vfio-pci ids= as per the Wiki). Maybe you need to dump the device ROMs and pass then?

Disable Resizable BAR maybe, which i think is not supported by virtualization? Disable above 4G decoding? Is you power supply powerfull enough, otherwise the additional (peak) load from another GPU might trip a reset. Are the devices in IOMMU group isolated from the host, or did you use pcie_acs_override (can you try without or show all groups without it)?
What does a crash look like? Does it happen when starting the VM or when using the GPU? Anything relevant in the system logs when it crashed?

Sorry, I don't even have experience with NVidia devcies (due to them preventing passthrough in the past), so I can give only very generic advise.
Okay so I did get it working.

I checked the host machine fo the VM and the logs said it all really. Something was using one of the graphic cards memory and causing an issues.

12000000000-13801ffffff : PCI Bus 0000:40
12000000000-12801ffffff : PCI Bus 0000:42
12000000000-127ffffffff : 0000:42:00.0
12000000000-127ffffffff : vfio-pci
12800000000-12801ffffff : 0000:42:00.0
12800000000-12801ffffff : vfio-pci
13000000000-13801ffffff : PCI Bus 0000:41
13000000000-137ffffffff : 0000:41:00.0
13000000000-130002fffff : BOOTFB
13000000000-130002fffff : simplefb

13800000000-13801ffffff : 0000:41:00.0

In bold was using the Gfx card memory and not setting it to a vfio-pci device. I'm using a Kernel of 5.15 which apprently has this as a bug where as 5.12 and 5.4 don't.

I found a hack to fix it for the 5.15 kernel here.

https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/page-2

So far it seems to have worked.
 
Indeed, you need kernel parameter initcall_blacklist=sysfb_init when doing passthrough on kernel 5.15 or higher of an NVidia GPU that is used for output during POST/boot/host console.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!