Broken cluster don't know what to do.

Dunsparth

New Member
Feb 18, 2024
7
0
1
Hi, i had 5 nodes all in one cluster.

i ran
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm -r /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

on all my nodes and 3 of them are gone from the cluster how i wanted it.
but the 1st and 5th one are still appearing in the GUI even tho they arent clustered and i can't access them unless connected to seperate GUI instances.

pvecm status
returns
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

pvecm expected 1
returns
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

if i type
# pvecm delnode pve05
it returns
Node/IP: pve05 is not a known host of the cluster.

and vice versa if i do.
pvecm delnode pve
Node/IP: pve is not a known host of the cluster.

something is messed up somewhere and i have no idea what to do
ive gone thru and read
https://pve.proxmox.com/wiki/Cluster_Manager

and checked many forum post and still cant get a fix to this thing need some professional help thanks.
 
so, here are the steps:

1. turn everything off.
2. turn on one of your member nodes. wait till its up
3. turn on another member node. wait till its up. are you able to ping each other on their COROSYNC ADDRESS?
4. if yes, is the cluster "up?"
4b. if no, fix your networking issues and repeat 3
5. if cluster is up, repeat with next node until all expected nodes are up
5b if cluster is still not up, you have two options: continue to fiddle with corosync and cluster configuration files, OR make copies of all your vms, blow away ALL the nodes and reset everything up. the latter is the safer (and likely faster) option.
 
so, here are the steps:

1. turn everything off.
2. turn on one of your member nodes. wait till its up
3. turn on another member node. wait till its up. are you able to ping each other on their COROSYNC ADDRESS?
4. if yes, is the cluster "up?"
4b. if no, fix your networking issues and repeat 3
5. if cluster is up, repeat with next node until all expected nodes are up
5b if cluster is still not up, you have two options: continue to fiddle with corosync and cluster configuration files, OR make copies of all your vms, blow away ALL the nodes and reset everything up. the latter is the safer (and likely faster) option.
Going to try this now what command do i use to ping to the corosync addresses so i can report back as soon as i go thru these steps.
 
so, here are the steps:

1. turn everything off.
2. turn on one of your member nodes. wait till its up
3. turn on another member node. wait till its up. are you able to ping each other on their COROSYNC ADDRESS?
4. if yes, is the cluster "up?"
4b. if no, fix your networking issues and repeat 3
5. if cluster is up, repeat with next node until all expected nodes are up
5b if cluster is still not up, you have two options: continue to fiddle with corosync and cluster configuration files, OR make copies of all your vms, blow away ALL the nodes and reset everything up. the latter is the safer (and likely faster) option.

i'm able to to use the ping command on both nodes using there IP's.
and they will talk back and forth to eachother that way.

even if i try to go and create a new cluster i am getting this error.

corosync-keygen: Could not create /etc/corosync/authkey: No such file or directory
Corosync Cluster Engine Authentication key generator.
Gathering 2048 bits for key from /dev/urandom.
TASK ERROR: command '/usr/sbin/corosync-keygen -lk /etc/corosync/authkey' failed: exit code 2
 
Well you may try to remove Cluster and corresponding data. I didn't do it myself so no guarantee at all.
List and delete all nodes

pvecm nodes
pvecm delnode pve2
pvecm delnode pve3

Wait for several minutes and then stop cluster services
systemctl stop pvestatd pvedaemon pve-cluster corosync
Now we need to remove cluster config:
sqlite3 /var/lib/pve-cluster/config.db
> DELETE FROM tree WHERE name = 'corosync.conf';
> .quit
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
rm /etc/pve/corosync.conf
rm /etc/corosync/*
rm /var/lib/corosync/*
systemctl start pvestatd pvedaemon pve-cluster corosync
 
  • Like
Reactions: ofer5183
Well you may try to remove Cluster and corresponding data. I didn't do it myself so no guarantee at all.
List and delete all nodes

pvecm nodes
pvecm delnode pve2
pvecm delnode pve3

Wait for several minutes and then stop cluster services
systemctl stop pvestatd pvedaemon pve-cluster corosync
Now we need to remove cluster config:
sqlite3 /var/lib/pve-cluster/config.db
> DELETE FROM tree WHERE name = 'corosync.conf';
> .quit
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
rm /etc/pve/corosync.conf
rm /etc/corosync/*
rm /var/lib/corosync/*
systemctl start pvestatd pvedaemon pve-cluster corosync

pvecm nodes wont work all it does is return
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

im going to have to wipe these things and start fresh, i hooked up a monitor to the server and rebooted it earlier and getting all these errors in the CLI.

So now i have to figure out how i can back up my containers that i have on the server without loosing everything..
 

Attachments

  • 429624058_379018591551994_4848892295491272171_n.jpg
    429624058_379018591551994_4848892295491272171_n.jpg
    283.1 KB · Views: 17
but the 1st and 5th one are still appearing in the GUI even tho they arent clustered

Just go and delete them from /etc/pve/nodes/$nodename

A regular rm -rf would do - as you are not clustered, you might need to do this individually on each standalone node.

and i can't access them unless connected to seperate GUI instances.

That's what you wanted though, correct?
 
Just go and delete them from /etc/pve/nodes/$nodename

A regular rm -rf would do - as you are not clustered, you might need to do this individually on each standalone node.



That's what you wanted though, correct?

Well that worked Thank you.

Now im going to need to figure out why i am getting all these errors in the console.
 
Have a look at journal -b when they happen, i.e. what precedes them. Do you have failing samba shares?

Thank you so much man its all back up and running again.
I'm not going to be able to cluster it all together but itleast its running and i can maybe get a backup server going to at the least back everything up then install proxmox on this box.
 
Well you may try to remove Cluster and corresponding data. I didn't do it myself so no guarantee at all.
List and delete all nodes

pvecm nodes
pvecm delnode pve2
pvecm delnode pve3

Wait for several minutes and then stop cluster services
systemctl stop pvestatd pvedaemon pve-cluster corosync
Now we need to remove cluster config:
sqlite3 /var/lib/pve-cluster/config.db
> DELETE FROM tree WHERE name = 'corosync.conf';
> .quit
rm -f /var/lib/pve-cluster/.pmxcfs.lockfile
rm /etc/pve/corosync.conf
rm /etc/corosync/*
rm /var/lib/corosync/*
systemctl start pvestatd pvedaemon pve-cluster corosync
You Saved my life. i searched this db couple of days .. thank you very much..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!