[SOLVED] Cluster died, can't get it going again

CRCinAU

Well-Known Member
May 4, 2020
120
36
48
crc.id.au
Hi all,

I have a two node cluster that started misbehaving a couple of days ago.... No matter what I did, I couldn't get things back up and happy again. The second node always hung on starting pveproxy in the `pvecm updatecerts` command. Once that hung, I wasn't able to recover.

As the second node only has two small VMs for testing on it, I did a `dd` backup of the VMs, copied them out and did a full, clean install of PVE 8.0.3. I then followed the directions to remove the first node from being configured as a cluster manually under the heading 'Separate a Node Without Reinstalling' here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node

This removed the cluster configuration and left me back with the first node being standalone and a freshly installed second node.

I cleaned up the `authorized_keys` and `known_hosts` file in `/etc/pve/priv/` and cleaned up in `/root/.ssh` as well. So far so good.

Now, when I try to join the two nodes together again in a new cluster, things get whacky.... Corosync says that both nodes are fine, but the pve layers are not happy at all. There's strange behaviour when accessing things in /etc/pve/ and on the second clean node, pveproxy fails to launch - again stuck on the updatecerts script.

I have the two nodes running in standalone at the moment, and everything is operating just fine.

What am I missing?

EDIT: Some other strange observation:
* in certain subdirectories in /etc/pve/, I can't even bring up a directory listing on the second node - but can without issue on the first.
* The newly joined node that is misbehaving, I get errors in dmesg about a hanging task - meaning it looks like an underlying filesystem has gone away issue.
* In the Web UI on the first node, I can see the second, and can even get stats from it - but no details pages work (probably because pveproxy is failing on the second node). It does however show a green tick to show its online.
* I can open a shell via the web UI on the first node to the second node perfectly.
 
Last edited:
Trying again to join the cluster together - after adding the Join information to the second node, the output I get is:

Code:
Establishing API connection with host '172.31.1.1'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

Then nothing more...

pvecm on node #1 says:
Code:
Cluster information
-------------------
Name:             Home
Config Version:   2
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Jul 15 13:07:00 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 <ipv6_prefix>::1%32576 (local)
0x00000002          1 <ipv6_prefix>::2%32576

corosync-quorumtool:
Code:
Quorum information
------------------
Date:             Sat Jul 15 13:08:10 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1.9
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 mel-pm (local)
         2          1 mel-pm2

All the WebUI components accessed via mel-pm to mel-pm2 will timeout.

At the moment, I can't seem to log in via SSH to mel-pm2 either.
 
More fun and broken things - when trying to view the Cluster information on the Web UI of mel-pm (aka node 1):

pm-cluster-error.png

Also, when looking at /etc/pve/nodes/mel-pm2/ on mel-pm, there's zero content:
1689391348015.png
 
Last edited:
As I can only get into mel-pm2 via screen + keyboard at the moment, this is some of the fun in journalctl on that node:
1689392175361.png
 
Interestingly, this seems to come down to some kind of network adapter problem.... I'm stuck using a USB3 Ethernet adapter on this Intel NUC - as the e1000 driver has many problems that I can't seem to overcome - and simply switching the USB port that the USB ethernet adapter was plugged into, and now everything functions perfectly....

I'm just about lost for words...

Just goes to show what a layer 1 network problem can cause!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!