Powerful virtual machine

GoXLd

New Member
Jan 18, 2025
18
4
3
33
Hello, everyone!

I have multiple nodes and would like to combine their computing resources to create a single powerful virtual machine. However, after searching the forum, I found old discussions dating back to 2017, and it seems that this feature is still not available. Has anything changed?

While reading the Proxmox Datacenter Manager roadmap, I noticed the following points:

  • First-class SDN integration, for example, for setting up EVPN between different clusters. Probably the most important feature in the long run.
  • Stretching EVPN VNets across clusters
  • Support for multiple VRFs across clusters
  • Automatic configuration of RT Import/Export
This means that there are plans for EVPN implementation, but as I understand it, there is still no native way to pool compute resources across nodes into a single VM.

I might be missing something—does Proxmox currently support any approach for aggregating CPU and RAM across multiple nodes to create a single high-performance VM? If not, are there any workarounds or alternative solutions within Proxmox?

I would appreciate any insights, as I would prefer not to switch to another hypervisor.
 
Hi GoXLd,

is there a specific Hypervisor you are thinking about? I wouldn't know about any, that claims to interconnect CPU and RAM across physical server borders via network connections. That's the main problems I'd see on this endeavour. Modern RAM-speeds handled on the mainboard by far exceed what typical NICs can do and similar things are valid for CPU regarding latency. These seem like borders to me that are difficult to be by-passed by software, not matter how great it is written.

Storage is another thing, you can easily span Ceph across your nodes. But also here, decent networking hardware is prime for performance!

Depending on your computing needs, one could use clustering workload managers like slurm [0] which is built for distributed workloads. But this is rather different from a 'single' VM, but a way to distribute highly parallel workloads onto distinct computers/servers/VMs. Another project that comes to my mind in this respect would be BOINC [1].

Best regards,
Daniel

[0] https://slurm.schedmd.com/overview.html
[1] https://boinc.berkeley.edu/
 
Hi Daniel,

Thank you for your response!

Virtualization is something new to me—I've been working with networking for the past years, but I started exploring virtualization only recently. To add more context: I wanted to try something different—creating one big machine out of multiple nodes, similar to link aggregation in networking or how enterChannel works in distributed computing.

However, the more I read, the more I understand why this is not feasible—primarily due to speed differences between RAM and network interfaces. The latency and bandwidth limitations make it nearly impossible to share CPU and RAM between nodes in real-time.

But what about distributing the workload of a single large task?

My idea is to run heavyweight AI models for training and inference. Instead of trying to aggregate hardware resources into a single VM, would it be possible to distribute one AI workload across multiple nodes in a cluster?

I've read about Slurm, Kubernetes, Ray, and DeepSpeed, but I'm still trying to figure out the best approach for running a large AI model (400GB in size) efficiently across multiple nodes.

Would love to hear your thoughts on this!

Best regards,
 
Last edited:
Usually such things are done on an application level ( like distributedComputing mentioned by you, one prominent example would be Nasas Seti@Home project), where an application is designed in a way to distribute the tasks on a fleet of several clients/nodes.. Another one would be what large research institutions do in their simulations and calculations. For example I would expect that CERN knows a lot on this topic (their papers on storage e.g. their Ceph cluster are quite insightful, maybe they have something on their compute clusters as well).
So if you want to do this you need to use a software (e.G. an AI LLMA) which is designed to allow such usecases or need to develop your own.

Kubernetes, Openstack or ProxmoxVE in the end only give you the Infrastructure for your applications to actually to the work you still need develop and setup the applications, whether you are developing them yourself or using already existing ones. At least if we talk about virtualization on the architecture Vmware and KVM (the kernel level hypervisor of Linux which is utilized by ProxmoxVE), maybe super-computers or other plattforms for computational sciences are designed differently (I have no idea to be honest and havn't meet my former coworker who now works at the computing center of our old university for a long time, otherwise I would ask him ;) )
 
  • Like
Reactions: dherzig and GoXLd