Multi GPU Cuda setup issues

daulky · Aug 17, 2022

So I have 4 3090RTX GPU's in a Proxmox server. I am having real trouble getting them to all work together.

I have done the pass through and can't seem to get them to work on a VM (Not a container) would I need to
install the drivers on the Proxmox host first before installing cuda on to any of the VM's?

I have installed the GPU Driver for the 3090RTX and then use the offcial docker image with Tensorflow but I still can only use either 1 GPU
or it just crashes.

Anyone got a similar config working?

Thanks.

leesteken · Aug 17, 2022

daulky said:
I have done the pass through and can't seem to get them to work on a VM (Not a container) would I need to
install the drivers on the Proxmox host first before installing cuda on to any of the VM's?

No, not when using PCIe passthrough. In principle you want nothing to touch the devices before the drivers inside the VM do. Therefore use early binding to vfio-pci and boot the system with another GPU (using options vfio-pci ids= as per the Wiki). Maybe you need to dump the device ROMs and pass then?

daulky said:
I have installed the GPU Driver for the 3090RTX and then use the offcial docker image with Tensorflow but I still can only use either 1 GPU
or it just crashes.

Disable Resizable BAR maybe, which i think is not supported by virtualization? Disable above 4G decoding? Is you power supply powerfull enough, otherwise the additional (peak) load from another GPU might trip a reset. Are the devices in IOMMU group isolated from the host, or did you use pcie_acs_override (can you try without or show all groups without it)?
What does a crash look like? Does it happen when starting the VM or when using the GPU? Anything relevant in the system logs when it crashed?

daulky said:
Anyone got a similar config working?

Sorry, I don't even have experience with NVidia devcies (due to them preventing passthrough in the past), so I can give only very generic advise.

Dunuin · Aug 17, 2022

According to igorslab the 3090 spikesaunder heavy load to 500W power consumption. So the PSU should be able to handle 2000W just for the GPUs without tripping.

daulky · Aug 18, 2022

Dunuin said:
According to igorslab the 3090 spikesaunder heavy load to 500W power consumption. So the PSU should be able to handle 2000W just for the GPUs without tripping.

We have 4kw in 2 x 2kw PSUs so we should be good

daulky · Aug 18, 2022

leesteken said:
No, not when using PCIe passthrough. In principle you want nothing to touch the devices before the drivers inside the VM do. Therefore use early binding to vfio-pci and boot the system with another GPU (using options vfio-pci ids= as per the Wiki). Maybe you need to dump the device ROMs and pass then?

Disable Resizable BAR maybe, which i think is not supported by virtualization? Disable above 4G decoding? Is you power supply powerfull enough, otherwise the additional (peak) load from another GPU might trip a reset. Are the devices in IOMMU group isolated from the host, or did you use pcie_acs_override (can you try without or show all groups without it)?
What does a crash look like? Does it happen when starting the VM or when using the GPU? Anything relevant in the system logs when it crashed?

Sorry, I don't even have experience with NVidia devcies (due to them preventing passthrough in the past), so I can give only very generic advise.

Okay so I did get it working.

I checked the host machine fo the VM and the logs said it all really. Something was using one of the graphic cards memory and causing an issues.

12000000000-13801ffffff : PCI Bus 0000:40
12000000000-12801ffffff : PCI Bus 0000:42
12000000000-127ffffffff : 0000:42:00.0
12000000000-127ffffffff : vfio-pci
12800000000-12801ffffff : 0000:42:00.0
12800000000-12801ffffff : vfio-pci
13000000000-13801ffffff : PCI Bus 0000:41
13000000000-137ffffffff : 0000:41:00.0
13000000000-130002fffff : BOOTFB
13000000000-130002fffff : simplefb
13800000000-13801ffffff : 0000:41:00.0

In bold was using the Gfx card memory and not setting it to a vfio-pci device. I'm using a Kernel of 5.15 which apprently has this as a bug where as 5.12 and 5.4 don't.

I found a hack to fix it for the 5.15 kernel here.

https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/page-2

So far it seems to have worked.

leesteken · Aug 18, 2022

Indeed, you need kernel parameter initcall_blacklist=sysfb_init when doing passthrough on kernel 5.15 or higher of an NVidia GPU that is used for output during POST/boot/host console.

Search

Search

Multi GPU Cuda setup issues

daulky

New Member

leesteken

Distinguished Member

Dunuin

Distinguished Member

daulky

New Member

daulky

New Member

leesteken

Distinguished Member

We value your privacy