Apologies for the long story.
I work for a university, and my department inherited an old parallel computing cluster (thankfully we're not paying for its power.) There were 32 nodes, some had hardware failures, but most were salavageable. We wanted this to create a large Proxmox cluster for deploying cybersecurity lab environments.
Had some students deploy Proxmox on each of the nodes and begin the process of joining them all to a single cluster. At the 17th node, all of the nodes in the cluster began misbehaving and refused to cluster up, even if they were all restarted - they started over from scratch, but had the same problem a 2nd time. That was the point they enlisted my help.
They got the base Proxmox install on a whole bunch of nodes, so I tried to add the 17th node myself. At the point when 16 nodes were joined and operational, there did not appear to be any notable cluster errors on the head node or any of the 2-3 individual nodes I interrogated. When I added the 17th node, I got the same behavior they did; all of the nodes fell out of the cluster, except node 1 and node 17. pvecm said that nodes 1 and 17 were both online, however, all the other nodes were not.
I picked a node and restarted corosync, and at that point the node appeared like node 17 did - it appeared to show up in the cluster, and basic management tasks like viewing the system status through the GUI worked. However, shell was unavailable and any connections to the system at all from the head node were extremely slow. Today I tried restarting corosync & pve-cluster
Important notes: we have _not_ tried replacing the network switch, though I have no indications that it is faulty. Each node has only a single network connection, and I am wondering if perhaps the network traffic is the problem, however this is a network only used by these nodes & ping times between the nodes are still in the sub half-millisecond range.
I put the corosync and pve-cluster logs on my google drive since they were too large for the forum. You can see that, aside from the day it was set up, the cluster was idling fine until yesterday when I tried to add node 17.
I'd love to get any ideas or thoughts or suggestions about this.
At this point the head node (pve31) doesn't recognize any other nodes as present:
But the last node added (pve17) thinks the ones I brought online should all be there:
I'm really confused.
I work for a university, and my department inherited an old parallel computing cluster (thankfully we're not paying for its power.) There were 32 nodes, some had hardware failures, but most were salavageable. We wanted this to create a large Proxmox cluster for deploying cybersecurity lab environments.
Had some students deploy Proxmox on each of the nodes and begin the process of joining them all to a single cluster. At the 17th node, all of the nodes in the cluster began misbehaving and refused to cluster up, even if they were all restarted - they started over from scratch, but had the same problem a 2nd time. That was the point they enlisted my help.
They got the base Proxmox install on a whole bunch of nodes, so I tried to add the 17th node myself. At the point when 16 nodes were joined and operational, there did not appear to be any notable cluster errors on the head node or any of the 2-3 individual nodes I interrogated. When I added the 17th node, I got the same behavior they did; all of the nodes fell out of the cluster, except node 1 and node 17. pvecm said that nodes 1 and 17 were both online, however, all the other nodes were not.
I picked a node and restarted corosync, and at that point the node appeared like node 17 did - it appeared to show up in the cluster, and basic management tasks like viewing the system status through the GUI worked. However, shell was unavailable and any connections to the system at all from the head node were extremely slow. Today I tried restarting corosync & pve-cluster
Important notes: we have _not_ tried replacing the network switch, though I have no indications that it is faulty. Each node has only a single network connection, and I am wondering if perhaps the network traffic is the problem, however this is a network only used by these nodes & ping times between the nodes are still in the sub half-millisecond range.
I put the corosync and pve-cluster logs on my google drive since they were too large for the forum. You can see that, aside from the day it was set up, the cluster was idling fine until yesterday when I tried to add node 17.
I'd love to get any ideas or thoughts or suggestions about this.
At this point the head node (pve31) doesn't recognize any other nodes as present:
Code:
root@pve31:~# pvecm status
Cluster information
-------------------
Name: manticore-2022
Config Version: 19
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Nov 3 12:32:17 2022
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.8a63
Quorate: No
Votequorum information
----------------------
Expected votes: 17
Highest expected: 17
Total votes: 1
Quorum: 9 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.201.0.131 (local)
But the last node added (pve17) thinks the ones I brought online should all be there:
Code:
root@pve20:~# pvecm status
Cluster information
-------------------
Name: manticore-2022
Config Version: 19
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Nov 3 12:30:07 2022
Quorum provider: corosync_votequorum
Nodes: 17
Node ID: 0x00000011
Ring ID: 1.8211
Quorate: Yes
Votequorum information
----------------------
Expected votes: 17
Highest expected: 17
Total votes: 17
Quorum: 9
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.201.0.131
0x00000002 1 10.201.0.102
0x00000003 1 10.201.0.103
0x00000004 1 10.201.0.105
0x00000005 1 10.201.0.107
0x00000006 1 10.201.0.108
0x00000007 1 10.201.0.110
0x00000008 1 10.201.0.111
0x00000009 1 10.201.0.112
0x0000000a 1 10.201.0.113
0x0000000b 1 10.201.0.114
0x0000000c 1 10.201.0.115
0x0000000d 1 10.201.0.116
0x0000000e 1 10.201.0.117
0x0000000f 1 10.201.0.118
0x00000010 1 10.201.0.119
0x00000011 1 10.201.0.120 (local)
I'm really confused.