Hi
@hahmed15 , welcome to the forum.
You may want to get familiar with arguments expressed in this discussion:
I know in the past that the recommended max number of nodes in a cluster was 32, but is this still the case? My boxes are all dual E5-2690v4 with dual 40 Gig Ethernet. I would like to have one cluster with 48 nodes, but is that a bad idea? Should I go two with 24 nodes?
Cheers
Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Co-worker here,
We're running a Cisco UCS in an environment with our specific use case being that we need large amounts of compute at a moments notice. Ordinarily smaller clusters would be just fine but we need to deploy large topologies of VMs and have them distributed across the cluster quickly and efficiently. Doing so across clusters would be burdensome especially if we want stakeholders to be able to manage their relevant infrastructure in terms of basic troubleshooting.
We've been using proxlb for DRS successfully up to this point with our compute cluster. We also have a separate cluster for management functions.
Current cluster stats:
- 544 CPUs - 30% utilized
- 5.1 TB RAM - 35% utilized
- 170TB Storage - 30% utilized
These stats vary wildly depending on the time of year and our needs.
Our hardware at a glance:
- 14x cisco UCS M3 + many more to be added.
- ~300GB ram per blade
- Redundant SSDs in raid 1 for boot disk
- Storage is dell ME5045 primary via lvm on iscsi
- Dell EMC PS6100 cluster secondary via lvm on iscsi
- Redundant Nexus 5k for the core
Networking at a glance:
- 6 vNICs
- Management is redundant bond with failover between fabrics
- VMdata is similar setup
- Corosync lives on management bond (only one loop for now)
- Storage lives on 2 dedicated interfaces
TL;DR
Our usecase requires that we scale up or face burdensome architectural difficulty in managing large topologies of VMs.
Our core questions:
- In once instance a node crashed due to cpu/ram resource contention during a deployment storm. Is there any safeguards we can put in place to not let that happen again?
- At what point do we need to start worrying about corosync beyond the default settings?