Running a GPU cluster with Proxmox, a good idea or should I stick with bare metal?

Sandbo

Well-Known Member
Jul 4, 2019
85
10
48
34
At the moment my team at work has got some expiring budget, and if there isn't other equipment to spend on, we maybe able to use them to purchase additonal GPUs.
Currently, we have this system:
GPU A+ Client System AS -4125GS-TNRT
https://www.supermicro.com/ja/products/system/gpu/4u/as -4125gs-tnrt
which runs bare metal Linux, with 2 CPUs, 1.5 TB RAM and one GPU (A100 80GB).

We are now wondering if we can get an additional 7x H100 NVL 94 GB (vendor suggests this part can work in the system), and then maintain it through Proxmox.
The plan will be to have a couple VMs

1. Running numerical simulation with 1 GPU
2. Running measurement optimization with 1 GPU
3. Running local large langage model with RAG with 6 GPUs

I have lots of experience using Proxmox since 4.0, and been installing over 10 machines and maintained by myself, done GPU passthrough. But I have never worked on something in this scale. I want to use Proxmox and will likely have static GPU allocation, so I think I don't have to go through using Nvidia's vGPU licenses and any additional work then directly passing through GPUs one by one.

I wonder if I am being too naive in it, any insight and concerns will be appreciated.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!