Recover broken PVE node

Oct 10, 2022
42
0
6
I had a single node which I transformed into a 2 nodes cluster, and while the secondary node had no machines yet I tried to disconnect it from the cluster.
However, I ran the following commands in the main node instead of the secondary node, so causing issues:

Bash:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

Trying to fix it, I made it even worse by reinstalling the package corosync-pve, which seems to have deleted the virtual machines' configurations at the main node.

I see the configurations are still available in the secondary node, that now cannot connect to the broken node.

Can I just copy these configurations from the secondary node to the main node in order to restore the `/etc/pve` there? If so, is there any other configurations I should be aware of?
 
I had a single node which I transformed into a 2 nodes cluster, and while the secondary node had no machines yet I tried to disconnect it from the cluster.
However, I ran the following commands in the main node instead of the secondary node, so causing issues:

Bash:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

Trying to fix it, I made it even worse by reinstalling the package corosync-pve, which seems to have deleted the virtual machines' configurations at the main node.

I see the configurations are still available in the secondary node, that now cannot connect to the broken node.

Can I just copy these configurations from the secondary node to the main node in order to restore the `/etc/pve` there? If so, is there any other configurations I should be aware of?

You can, but before you cause yourself even (potentially) more damage, do yourself a favour and copy into a backup place the /var/lib/pve-cluster/config.db* files on the node which still has your configs. It is this database that actually holds the virtual "filesystem" that at runtime gets mounted into the /etc/pve.

Ideally you want to do this with pmxcfs killed (that empty node is not doing any writes there anyhow).
 
@esi_y thank you for the advice! So, as far as I understand, I just have to:
- Make a backup of the database directory from the main node
- Make a backup of the database directory from the secondary node
- Stop the services
- Copy the database backup from the secondary (working) node to the main node

Maybe the configurations of the virtual machines will be automatically recovered. If not, copy the configuration from the secondary to the main node.

Is it correct? Anything else I'm missing?
 
@esi_y thank you for the advice! So, as far as I understand, I just have to:
- Make a backup of the database directory from the main node
- Make a backup of the database directory from the secondary node
- Stop the services
- Copy the database backup from the secondary (working) node to the main node

Maybe the configurations of the virtual machines will be automatically recovered. If not, copy the configuration from the secondary to the main node.

Is it correct? Anything else I'm missing?

It's a bit of risk to reply with delay (amongst multiple) threads and get the terms right, but if I understood you well that you have an intact database on the secondary node, backup the /var/lib/pve-cluster/config.db (if you killall pmxcfs on that node before, there will be only one file).

Put that file somewhere aside (off the nodes, just in case - you know that's your good configs there). Well and then do whatever you want, you can manually copy the individual vm configs, you can stop services and implant that good config.db into the broken (primary?) node, etc. Restart and see. Worst case of them all, you reinstall PVE, but will put your config.db back right after (this assumes you have had no intricate extra things set up on the node that would need manual recovery as well).
 
@esi_y, thank you very much for all advice! At the moment, I'm a bit busy dealing with some personal priorities, then the reason for the delayed replies here.

I will recover the cluster as soon as possible, and I will post back here if I find any information that may help others in similar situations.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!