Recover broken PVE node

Machine Manager · Aug 8, 2024

I had a single node which I transformed into a 2 nodes cluster, and while the secondary node had no machines yet I tried to disconnect it from the cluster.
However, I ran the following commands in the main node instead of the secondary node, so causing issues:

Bash:

systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l
rm /etc/pve/corosync.conf
rm /etc/corosync/*
killall pmxcfs
systemctl start pve-cluster

Trying to fix it, I made it even worse by reinstalling the package corosync-pve, which seems to have deleted the virtual machines' configurations at the main node.

I see the configurations are still available in the secondary node, that now cannot connect to the broken node.

Can I just copy these configurations from the secondary node to the main node in order to restore the `/etc/pve` there? If so, is there any other configurations I should be aware of?

esi_y · Aug 8, 2024

Machine Manager said:
I had a single node which I transformed into a 2 nodes cluster, and while the secondary node had no machines yet I tried to disconnect it from the cluster.
However, I ran the following commands in the main node instead of the secondary node, so causing issues:

Bash:

systemctl stop pve-cluster systemctl stop corosync pmxcfs -l rm /etc/pve/corosync.conf rm /etc/corosync/* killall pmxcfs systemctl start pve-cluster

Trying to fix it, I made it even worse by reinstalling the package corosync-pve, which seems to have deleted the virtual machines' configurations at the main node.

I see the configurations are still available in the secondary node, that now cannot connect to the broken node.

Can I just copy these configurations from the secondary node to the main node in order to restore the `/etc/pve` there? If so, is there any other configurations I should be aware of?

You can, but before you cause yourself even (potentially) more damage, do yourself a favour and copy into a backup place the /var/lib/pve-cluster/config.db* files on the node which still has your configs. It is this database that actually holds the virtual "filesystem" that at runtime gets mounted into the /etc/pve.

Ideally you want to do this with pmxcfs killed (that empty node is not doing any writes there anyhow).

Machine Manager · Aug 12, 2024

@esi_y thank you for the advice! So, as far as I understand, I just have to:
- Make a backup of the database directory from the main node
- Make a backup of the database directory from the secondary node
- Stop the services
- Copy the database backup from the secondary (working) node to the main node

Maybe the configurations of the virtual machines will be automatically recovered. If not, copy the configuration from the secondary to the main node.

Is it correct? Anything else I'm missing?

esi_y · Aug 12, 2024

Machine Manager said:
@esi_y thank you for the advice! So, as far as I understand, I just have to:
- Make a backup of the database directory from the main node
- Make a backup of the database directory from the secondary node
- Stop the services
- Copy the database backup from the secondary (working) node to the main node

Maybe the configurations of the virtual machines will be automatically recovered. If not, copy the configuration from the secondary to the main node.

Is it correct? Anything else I'm missing?

It's a bit of risk to reply with delay (amongst multiple) threads and get the terms right, but if I understood you well that you have an intact database on the secondary node, backup the /var/lib/pve-cluster/config.db (if you killall pmxcfs on that node before, there will be only one file).

Put that file somewhere aside (off the nodes, just in case - you know that's your good configs there). Well and then do whatever you want, you can manually copy the individual vm configs, you can stop services and implant that good config.db into the broken (primary?) node, etc. Restart and see. Worst case of them all, you reinstall PVE, but will put your config.db back right after (this assumes you have had no intricate extra things set up on the node that would need manual recovery as well).

Machine Manager · Aug 15, 2024

@esi_y, thank you very much for all advice! At the moment, I'm a bit busy dealing with some personal priorities, then the reason for the delayed replies here.

I will recover the cluster as soon as possible, and I will post back here if I find any information that may help others in similar situations.

Search

Search

Recover broken PVE node

Machine Manager

Member

esi_y

Active Member

Machine Manager

Member

esi_y

Active Member

Machine Manager

Member