Small Datacenter Setup - What is the maximum number of pve servers supported in a cluster

hahmed15

New Member
Apr 22, 2026
1
1
3
Hello,

We are running a small data center with 30 servers. We have thus far added 14 PVE servers and we intend to add 15 more additional servers.

We read somewhere that we need to make some "tweaks" to the pve cluster managment software otherwise the cluster mgmt software may run into issues.

Could someone kindly point us towards the issue and it's resolution. We are in production hence we are very hesitant to make changes that will bring our entire cluster down.

Thankyou
 
  • Like
Reactions: cgball
Hi @hahmed15 , welcome to the forum.

You may want to get familiar with arguments expressed in this discussion:

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @hahmed15 , welcome to the forum.

You may want to get familiar with arguments expressed in this discussion:

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Co-worker here,

We're running a Cisco UCS in an environment with our specific use case being that we need large amounts of compute at a moments notice. Ordinarily smaller clusters would be just fine but we need to deploy large topologies of VMs and have them distributed across the cluster quickly and efficiently. Doing so across clusters would be burdensome especially if we want stakeholders to be able to manage their relevant infrastructure in terms of basic troubleshooting.

We've been using proxlb for DRS successfully up to this point with our compute cluster. We also have a separate cluster for management functions.

Current cluster stats:
- 544 CPUs - 30% utilized
- 5.1 TB RAM - 35% utilized
- 170TB Storage - 30% utilized

These stats vary wildly depending on the time of year and our needs.

Our hardware at a glance:
- 14x cisco UCS M3 + many more to be added.
- ~300GB ram per blade
- Redundant SSDs in raid 1 for boot disk
- Storage is dell ME5045 primary via lvm on iscsi
- Dell EMC PS6100 cluster secondary via lvm on iscsi
- Redundant Nexus 5k for the core

Networking at a glance:
- 6 vNICs
- Management is redundant bond with failover between fabrics
- VMdata is similar setup
- Corosync lives on management bond (only one loop for now)
- Storage lives on 2 dedicated interfaces

TL;DR
Our usecase requires that we scale up or face burdensome architectural difficulty in managing large topologies of VMs.


Our core questions:
- In once instance a node crashed due to cpu/ram resource contention during a deployment storm. Is there any safeguards we can put in place to not let that happen again?

- At what point do we need to start worrying about corosync beyond the default settings?
 
Last edited:
Did you read the thread linked by @bbgeek17 ? It contains two posts by @Maximiliano Basically until around 24 nodes the defaults are fine if your Hardware and network setup fits the documentation recommendations ( e.g. dedicated redundant networks for corosync ).
For more nodes you need to tweak the corosync defaults, this is explained by Maximiliano on the threads second page.
It's propably a good idea to get a Proxmox Partner on Board to aid in the Architecture and setup of your cluster.
 
Last edited:
We're running a Cisco UCS in an environment with our specific use case being that we need large amounts of compute at a moments notice. Ordinarily smaller clusters would be just fine but we need to deploy large topologies of VMs and have them distributed across the cluster quickly and efficiently. Doing so across clusters would be burdensome especially if we want stakeholders to be able to manage their relevant infrastructure in terms of basic troubleshooting.
Read the link @bbgeek17 referenced. when you're done, you should have a realization that the problem you will run into isnt just how many NODES are in the cluster, but also how much virtual resources. PVE's solution for cluster metadata coordination is clever but does not scale very well; when you have thousands of vms in the cluster pvestatd processes get so big that they timeout, and that causes respective nodes to have their proxy services (the api) become unavailable. its a bad time.

in practice, you have choices. you dont HAVE to federate all the nodes together; just keep them in islands. if you need cross island migrations, use PDM or you can home grow code using api calls- I've been doing this for years, even before pdm existed.

HOWEVER, given the scope of your project, PVE might not be the best choice as its not really in scope. Have you looked at Cloudstack?
 
This is not a big cluster,but i usually recommend to my clients to split clusters into something like 15-20 nodes if they are unsure about stability of it.
You now how PDM, live migration works good.
 
  • Like
Reactions: Johannes S
This is not a big cluster,but i usually recommend to my clients to split clusters into something like 15-20 nodes if they are unsure about stability of it.
You now how PDM, live migration works good.
I recall one reddit thread where somebody run an infrastrucuture with around 30 nodes. They mentioned that their Proxmox partner and proxmox support also recommended splliting up their cluster. Different to Maximilianos post the recommendation was to limit to around 20 nodes instead of 24. Four nodes more doesn't change much though ;) Since they had reasons to avoid the split they went the route with the tweaks mentioned by @fweber in the same thread as Maximilianos: https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826

According to fweber the defaults are sufficient for around 25 nodes. With more nodes one need to change corosync parameters, they basically allow having more nodes but also makes corosync even more sensitive to latency in the corosync network. With other words: You need to design your network accordingly and need to be careful in choosing the values. The mentioned reddit user in /r/proxmox pointed out that although this worked for their environment he would recommend to only do this after getting confirmation from Proxmox support or a partner. The Proxmox developers (according to the reddit user and the posts by fweber and maximiliano) are working on recommendations for such cases in the documentation. If I recall correct they also investivate possible improvements to corosync for large clusters together with the corosync developers but my memory might be wrong.
 
  • Like
Reactions: UdoB