Stuck on large cluster deployment

mav · Nov 3, 2022

Apologies for the long story.

I work for a university, and my department inherited an old parallel computing cluster (thankfully we're not paying for its power.) There were 32 nodes, some had hardware failures, but most were salavageable. We wanted this to create a large Proxmox cluster for deploying cybersecurity lab environments.

Had some students deploy Proxmox on each of the nodes and begin the process of joining them all to a single cluster. At the 17th node, all of the nodes in the cluster began misbehaving and refused to cluster up, even if they were all restarted - they started over from scratch, but had the same problem a 2nd time. That was the point they enlisted my help.

They got the base Proxmox install on a whole bunch of nodes, so I tried to add the 17th node myself. At the point when 16 nodes were joined and operational, there did not appear to be any notable cluster errors on the head node or any of the 2-3 individual nodes I interrogated. When I added the 17th node, I got the same behavior they did; all of the nodes fell out of the cluster, except node 1 and node 17. pvecm said that nodes 1 and 17 were both online, however, all the other nodes were not.

I picked a node and restarted corosync, and at that point the node appeared like node 17 did - it appeared to show up in the cluster, and basic management tasks like viewing the system status through the GUI worked. However, shell was unavailable and any connections to the system at all from the head node were extremely slow. Today I tried restarting corosync & pve-cluster

Important notes: we have _not_ tried replacing the network switch, though I have no indications that it is faulty. Each node has only a single network connection, and I am wondering if perhaps the network traffic is the problem, however this is a network only used by these nodes & ping times between the nodes are still in the sub half-millisecond range.

I put the corosync and pve-cluster logs on my google drive since they were too large for the forum. You can see that, aside from the day it was set up, the cluster was idling fine until yesterday when I tried to add node 17.

I'd love to get any ideas or thoughts or suggestions about this.

At this point the head node (pve31) doesn't recognize any other nodes as present:

Code:

root@pve31:~# pvecm status
Cluster information
-------------------
Name:             manticore-2022
Config Version:   19
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Nov  3 12:32:17 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.8a63
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      1
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.201.0.131 (local)

But the last node added (pve17) thinks the ones I brought online should all be there:

Code:

root@pve20:~# pvecm status
Cluster information
-------------------
Name:             manticore-2022
Config Version:   19
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Nov  3 12:30:07 2022
Quorum provider:  corosync_votequorum
Nodes:            17
Node ID:          0x00000011
Ring ID:          1.8211
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      17
Quorum:           9
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.201.0.131
0x00000002          1 10.201.0.102
0x00000003          1 10.201.0.103
0x00000004          1 10.201.0.105
0x00000005          1 10.201.0.107
0x00000006          1 10.201.0.108
0x00000007          1 10.201.0.110
0x00000008          1 10.201.0.111
0x00000009          1 10.201.0.112
0x0000000a          1 10.201.0.113
0x0000000b          1 10.201.0.114
0x0000000c          1 10.201.0.115
0x0000000d          1 10.201.0.116
0x0000000e          1 10.201.0.117
0x0000000f          1 10.201.0.118
0x00000010          1 10.201.0.119
0x00000011          1 10.201.0.120 (local)

I'm really confused.

shanreich · Nov 4, 2022

Generally speaking, having a 32 node cluster is already quite a lot. The hardware (especially the switches, but also the nodes) would need to be quite performant. What kind of hardware are you using (nodes as well as switches, ...). I think it might be a better idea to run 2 x 16 node cluster, or even 4x 8 node cluster. Even 16 is quite a lot of nodes already for a cluster.

That being said, some stuff that could go wrong off the top of my head:

Is it always the same node you are trying to join as 17th, or always a different one? Maybe there is an issue with a specific node.
Is the time synchronized on all nodes?
Do you have HA enabled on any of the nodes?
How is the ping and network traffic when you have 16 nodes, is there any way for you to monitor it? Does any packet loss occur?
Did you completely reinstall proxmox before trying again? Starting completely from scratch might be a good idea, in case there is some funky business. Just using pvecm delnode might not be sufficient.

But, as I said, I think it might be a better idea to create several smaller clusters, since Corosync is the limiting factor. Some people are running 30 node clusters, but on very specialized hardware. It might just be that you are hitting the limits of Corosync with your particular setup.

mav · Nov 5, 2022

I appreciate your detail. Probably the best bet is to break it down into at least two different clusters. Really, all I wanted is some unified management interface, so it won't be the end of the world if we have to break it up. This is ultimately just an edu lab, not a production setup, but I still want it to be stable enough students can use it

We tried this in a couple of different orders, it's not always the same node.
The time appears to be synchronized within <1sec.
HA is not enabled.
I didn't actually check to see how much network traffic the nodes were generating; ping times between the nodes seem to be consistently sub-half millisecond, though, with no noted loss.
The nodes were all reinstalled from scratch before trying the 2nd time.

I guess this tells me what I really want to know though - I don't want to have to hassle with cluster instability issues that are ongoing, so we probably want to just go with breaking it up into two or three sets of systems.
Thanks again for your info.

I don't suppose there's a way to have some kind of unified management interface for just a few administrative things like authentication without having to build out a full cluster?

shanreich · Nov 7, 2022

mav said:
I don't suppose there's a way to have some kind of unified management interface for just a few administrative things like authentication without having to build out a full cluster?

No, but we will be providing a way for multi-cluster management in the foreseeable future, I cannot give any guarantees on the arrival date though.

alexskysilk · Nov 8, 2022

shanreich said:
Even 16 is quite a lot of nodes already for a cluster.

The problem of noise and latency is logarithmic, so the more nodes you implement the more your network has to be able to deal with.

What is the interconnect? how many links/speed does each node have, and how is it configured? in general, you really want a MINIMUM of two 10gbit links per node , on two seperate subnets (and switches, ideally) for a two ring setup for corosync. Public/management interfaces should be on different NICs in a perfect world too- so should ceph if you're using it.

the content of your /etc/network/interfaces and switch make/topology would help.

mav · Nov 9, 2022

alexskysilk said:
What is the interconnect? how many links/speed does each node have, and how is it configured? in general, you really want a MINIMUM of two 10gbit links per node , on two seperate subnets (and switches, ideally) for a two ring setup for corosync.

Uh
I ... was not aware of any of that. The documentation basically just says "hey, maybe you might want to have a dedicated interface for cluster traffic," it does not indicate you need a very large amount of money worth of equipment to set up a cluster. It just says you need to keep the latency low, which you can easily do with 1GB, so I'm a little confused why the cluster itself would need such high bandwidth interfaces.

If that is the case then clustering is effectively out of reach to anyone but very large businesses.

To say that we neither have nor can get such a network is a understatement. The cost for dual 10GB links to each of these systems would exceed the budget of my entire department, including personnel.

I guess I'll try something else.

alexskysilk · Nov 9, 2022

This is of course the reality of any system building.

The documentation provides sufficient for the lowest common denominator, but this becomes increasingly subject to the knowledge, experience, and expertise of the architect/sysadmin that is putting it all together to understand the finer points, which is how they make their living

This isnt limited to Proxmox, its basically a function of the profession. The more moving parts your system has the smaller your tolerances.

Ultimately, what is your goal with these systems? As an educational tool, reaching the breakage point of the cluster and challenging the students to figure out why and how to fix is a fantastic use, imo. Using them for a production cluster is really pointless without application, storage, and networking commensurate with the scale of operation.

Search

Search

Stuck on large cluster deployment

mav

Member

shanreich

Proxmox Staff Member

mav

Member

shanreich

Proxmox Staff Member

alexskysilk

Distinguished Member

mav

Member

alexskysilk

Distinguished Member

We value your privacy