Running a GPU cluster with Proxmox, a good idea or should I stick with bare metal?

Sandbo · Sep 1, 2024

At the moment my team at work has got some expiring budget, and if there isn't other equipment to spend on, we maybe able to use them to purchase additonal GPUs.
Currently, we have this system:
GPU A+ Client System AS -4125GS-TNRT
https://www.supermicro.com/ja/products/system/gpu/4u/as -4125gs-tnrt
which runs bare metal Linux, with 2 CPUs, 1.5 TB RAM and one GPU (A100 80GB).

We are now wondering if we can get an additional 7x H100 NVL 94 GB (vendor suggests this part can work in the system), and then maintain it through Proxmox.
The plan will be to have a couple VMs

1. Running numerical simulation with 1 GPU
2. Running measurement optimization with 1 GPU
3. Running local large langage model with RAG with 6 GPUs

I have lots of experience using Proxmox since 4.0, and been installing over 10 machines and maintained by myself, done GPU passthrough. But I have never worked on something in this scale. I want to use Proxmox and will likely have static GPU allocation, so I think I don't have to go through using Nvidia's vGPU licenses and any additional work then directly passing through GPUs one by one.

I wonder if I am being too naive in it, any insight and concerns will be appreciated.

Search

Search

Running a GPU cluster with Proxmox, a good idea or should I stick with bare metal?

Sandbo

Well-Known Member