Running a GPU cluster with Proxmox, a good idea or should I stick with bare metal?

Sandbo

Well-Known Member
Jul 4, 2019
85
10
48
35
At the moment my team at work has got some expiring budget, and if there isn't other equipment to spend on, we maybe able to use them to purchase additonal GPUs.
Currently, we have this system:
GPU A+ Client System AS -4125GS-TNRT
https://www.supermicro.com/ja/products/system/gpu/4u/as -4125gs-tnrt
which runs bare metal Linux, with 2 CPUs, 1.5 TB RAM and one GPU (A100 80GB).

We are now wondering if we can get an additional 7x H100 NVL 94 GB (vendor suggests this part can work in the system), and then maintain it through Proxmox.
The plan will be to have a couple VMs

1. Running numerical simulation with 1 GPU
2. Running measurement optimization with 1 GPU
3. Running local large langage model with RAG with 6 GPUs

I have lots of experience using Proxmox since 4.0, and been installing over 10 machines and maintained by myself, done GPU passthrough. But I have never worked on something in this scale. I want to use Proxmox and will likely have static GPU allocation, so I think I don't have to go through using Nvidia's vGPU licenses and any additional work then directly passing through GPUs one by one.

I wonder if I am being too naive in it, any insight and concerns will be appreciated.
 
We are currently setting up GPU passthrough with a pair of H100 NVL cards, so far functionality & performance seem good but the VM takes a huge amount of time to boot, this appears to be an issue with vfio and memory mapping although the specifics are a bit out of my depth.

I've tried some workarounds with hugepages and NUMA pinning but nothing seems to have an Impact.

TLDR: from recent experience it should work but maybe with some tradeoffs.