there are clusters out there with 20-30 nodes at least, but it really depends on the hardware and network setup whether that works in practice or not. there is no hard limit enforced anywhere in our or corosync's code.
Just to update from my side. I made several tests, like skipping a node, changing switch ports and such, but I was not able to cluster more than 12 nodes. Interestingly, load for a moment is quite high when adding nodes after a few already exists (wild guess: is the node join effort exponential to the node number?) and for a few seconds node may be slow. I only noticed because I had interactive shells open and saw "top -d 0.1" lagging. There are resources that say PVE sync is time critical and latency of the network shall be below 5ms and if I visibly notice a lag in top, the latency (of top) is muuuuch larger. So maybe you need much computing power to add nodes when you already have a few (for example, if corosync or the algorithm would be not so optimized or suited to scale over 10 or so). I think in the past when I had issues adding fewer nodes, it possibly was caused by other running processes. So I kept the load low (i.e. no single instance running whatsoever) due to the join phase and I was able to join 12 nodes.
As I was tired of reinstalling nodes and entering hostname and everything all the time, we decided to stick with a 12 node cluster and see if it is stable (so far it is)
Interestingly, putting load on the nodes after the join fortunately seems not to create issues, only join seems to need much CPU and small load for a short moment. Maybe someone could gprof the join code, in best case some loop just misses a break or so
yeah, joining is for sure an intensive task on bigger clusters, since it needs to sync up the state from scratch. I will see if I can find some way to improve matters there
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.