[SOLVED] Cluster died, can't get it going again

CRCinAU

Well-Known Member
May 4, 2020
120
36
48
crc.id.au
Hi all,

I have a two node cluster that started misbehaving a couple of days ago.... No matter what I did, I couldn't get things back up and happy again. The second node always hung on starting pveproxy in the `pvecm updatecerts` command. Once that hung, I wasn't able to recover.

As the second node only has two small VMs for testing on it, I did a `dd` backup of the VMs, copied them out and did a full, clean install of PVE 8.0.3. I then followed the directions to remove the first node from being configured as a cluster manually under the heading 'Separate a Node Without Reinstalling' here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

This removed the cluster configuration and left me back with the first node being standalone and a freshly installed second node.

I cleaned up the `authorized_keys` and `known_hosts` file in `/etc/pve/priv/` and cleaned up in `/root/.ssh` as well. So far so good.

Now, when I try to join the two nodes together again in a new cluster, things get whacky.... Corosync says that both nodes are fine, but the pve layers are not happy at all. There's strange behaviour when accessing things in /etc/pve/ and on the second clean node, pveproxy fails to launch - again stuck on the updatecerts script.

I have the two nodes running in standalone at the moment, and everything is operating just fine.

What am I missing?

EDIT: Some other strange observation:
* in certain subdirectories in /etc/pve/, I can't even bring up a directory listing on the second node - but can without issue on the first.
* The newly joined node that is misbehaving, I get errors in dmesg about a hanging task - meaning it looks like an underlying filesystem has gone away issue.
* In the Web UI on the first node, I can see the second, and can even get stats from it - but no details pages work (probably because pveproxy is failing on the second node). It does however show a green tick to show its online.
* I can open a shell via the web UI on the first node to the second node perfectly.
 
Last edited:
Trying again to join the cluster together - after adding the Join information to the second node, the output I get is:

Code:
Establishing API connection with host '172.31.1.1'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

Then nothing more...

pvecm on node #1 says:
Code:
Cluster information
-------------------
Name:             Home
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Jul 15 13:07:00 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 <ipv6_prefix>::1%32576 (local)
0x00000002          1 <ipv6_prefix>::2%32576

corosync-quorumtool:
Code:
Quorum information
------------------
Date:             Sat Jul 15 13:08:10 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 mel-pm (local)
         2          1 mel-pm2

All the WebUI components accessed via mel-pm to mel-pm2 will timeout.

At the moment, I can't seem to log in via SSH to mel-pm2 either.
 
More fun and broken things - when trying to view the Cluster information on the Web UI of mel-pm (aka node 1):

pm-cluster-error.png

Also, when looking at /etc/pve/nodes/mel-pm2/ on mel-pm, there's zero content:
1689391348015.png
 
Last edited:
As I can only get into mel-pm2 via screen + keyboard at the moment, this is some of the fun in journalctl on that node:
1689392175361.png
 
Interestingly, this seems to come down to some kind of network adapter problem.... I'm stuck using a USB3 Ethernet adapter on this Intel NUC - as the e1000 driver has many problems that I can't seem to overcome - and simply switching the USB port that the USB ethernet adapter was plugged into, and now everything functions perfectly....

I'm just about lost for words...

Just goes to show what a layer 1 network problem can cause!