Proxmox with 48 nodes

Nathan Stratton

Well-Known Member
Dec 28, 2018
50
3
48
49
I know in the past that the recommended max number of nodes in a cluster was 32, but is this still the case? My boxes are all dual E5-2690v4 with dual 40 Gig Ethernet. I would like to have one cluster with 48 nodes, but is that a bad idea? Should I go two with 24 nodes?
 
so before any answer would be applicable...

why? what is your use case?

Also, 40g for cluster traffic is effectively the same as 10g (same latency.) you should be fine, but depending on the REST of your system architecture its will likely not be enough as your cluster gets larger (more prone to contention issues.) Not in the sense that the individual links isnt "fast" enough, but insufficient separation of traffic.

Lastly- you might want to consider your design penny wise and pound foolish. Broadwell isnt the most power efficient architecture in 2025. If you're designing something of actual size, you may be better off with less nodes of higher power/performance density. It will cost more to deploy but will pay for itself in a matter of months (18-24 is common) in power and cooling savings.
 
Yes, latency is the same for 40/10, but with dual ports and VLANs for traffic separation, I thought I would be ok. As for why, the hardware was reclaimed and that is what we have... You're right about Broadwell, but again, it's what we have, and I'm not so sure about the cost savings. We compared the E5-2690v4 to EPYC 9575F, based on spec numbers, we can absolutely do it with fewer servers, but surprisingly, the power numbers are not as far apart as I thought.
 
We compared the E5-2690v4 to EPYC 9575F, based on spec numbers, we can absolutely do it with fewer servers, but surprisingly, the power numbers are not as far apart as I thought.
E5-2690v4 is 14c@2.6GHz, 135W.
Epyc 9575F is 64c@3.3GHz, 400W.

even if we ignore the MUCH newer process node, WAY faster memory, PCIe generations and just counted each at equal IPC:
AMD (64*3300)/400= 528 instructions per watt
Intel (14*2600)/135= 269.

you would literally need half as many. I suppose I dont know what you thought...
 
2690v4 is actually 22 cores, and you're neglecting the power of the FANS, GPUs, Hard Drives, etc. But you're right, you need less than half as many, but the power consumption is not that much different in the two workloads. Also, when you factor in the price of the used E5-2690v4 and the new Epyc 9575F systems, the break-even is a LOT longer than 18-24 months.
 
Or maybe only a quarter (10-12) is already enough?
A dual socket system with 2x Intel Xeon E5-2690v4 has the potential for 72800 "cpu units" of performance (more with turbo+hyperthreading but lets leave that for now)
a dual socket system with 2x Epyc 9575F has the potential for 422400 "cpu units" (same comment applies) and is roughly twice as power efficient. a single system can replace almost SIX of the Xeon one- with faster memory, more memory channels, and faster busses for everything from networking to storage.

technology doesn't stand still. The Xeon part is nearly 10 years old.
 
  • Like
Reactions: gfngfn256
But there's no answer to the real question until now ... maybe someone would like to build 48 nodes out of AMD EPYC™ 9575F ...
what would be the actual limit or how would that be reachable with which eg corosync tuning etc ... that's in the room ...
 
  • Like
Reactions: fba
what would be the actual limit or how would that be reachable
The reason you cant find the answer is because its not something you can answer in a vaccum. as I alluded to above, it depends on just how dependable the network is, and how spammy/sensitive the service using it is.

"conventional wisdom" has been that you don't want to climb beyond 32 nodes for pve use. I would say that unless there is a USE CASE to challenge this limitation, I dont bother investing the time and effort to figure it out.
 
  • Like
Reactions: Johannes S
Hi @Nathan Stratton and all,

You need clear guidance here: do not do that unless you have a very compelling reason to.

a) Your hardware is discontinued and past the end of service, which significantly increases the likelihood of component failure.

b) As the number of virtual machines grows, the pmxcfs payload becomes larger and more demanding to synchronize across nodes.

c) Software updates introduce additional complexity and coordination challenges at this scale.

We work with several customers operating at this scale, including those who have tested beyond 32 nodes. None of them found significant value in deploying a single, monolithic cluster. You will be far better off with separate failure domains for fault isolation, simpler management, and easier maintenance.

I always encourage people to ask themselves: "How much of my infrastructure can I tolerate losing if there were an unexpected failure?" If that answer is not 100%, start dividing.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox