Accidentally deleted /etc/pve/nodes in main node in cluster how to recover

feradz · May 21, 2024

Hi all,

I have a cluster of 3 nodes: node1, node2, node3
Accidentally I run `rm -rf /etc/pve/nodes` on the primary node.

After that I cannot login through the the web console into node1.

I can ssh to the three nodes.

None of the CTs are visible right now.

I don't have backup of /etc/pve/nodes

How can I recover the cluster node?

Kingneutron · May 22, 2024

https://forum.proxmox.com/threads/etc-pve-nodes-accidentally-deleted.55617/

https://github.com/kneutron/ansitest/tree/master/proxmox

If you get back up and running, setup and run the bkpcrit script - and do that frequently. Especially before ANY system changes.

PROTIP - I recommend installing Midnight Commander. Get to know it, and start using it for deleting files and directories recursively instead of rm -- MC will show you exactly what you're doing and ask you to confirm deletion

justinclift · May 22, 2024

@feradz Do you have anything useful in the /var/lib/pve-cluster/backup directory on any of your nodes?

For example, this is on one of my nodes and seems to be automatically setup/created:

Bash:

# ls -la /var/lib/pve-cluster/backup/
total 36
drwxr-xr-x 2 root root     3 May 16 13:07 .
drwxr-xr-x 3 root root     7 May 21 23:18 ..
-rw-r--r-- 1 root root 14473 May 16 13:07 config-1715864850.sql.gz

That file looks like it's a compressed dump of the SQLite commands needed to recreate the cluster configuration:

Bash:

# zmore /var/lib/pve-cluster/backup/config-1715864850.sql.gz
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE tree (  inode INTEGER PRIMARY KEY NOT NULL,  parent INTEGER NOT NULL CHECK(typeof(parent)=='integer'),  version INTEGER NOT NULL CHECK(typeof(version)=='integer'),  write
r INTEGER NOT NULL CHECK(typeof(writer)=='integer'),  mtime INTEGER NOT NULL CHECK(typeof(mtime)=='integer'),  type INTEGER NOT NULL CHECK(typeof(type)=='integer'),  name TEXT NOT NUL
L,  data BLOB);
INSERT INTO tree VALUES(0,0,1392,0,1715864846,8,'__version__',NULL);
INSERT INTO tree VALUES(2,0,3,0,1715862102,8,'datacenter.cfg',X'6b6579626f6172643a20656e2d75730a');
(etc)

So if you have a file in the backup directory, there's a chance it might be from before you nuked the /etc/pve/nodes directory. If that's the case, then it shouldn't be tooooo hard to recover.

feradz · May 22, 2024

Thanks for the feedback.
The cluster configuration was total mess. I have installed the proxmox from the scratch and manually configured the LXCs. It was easy but until I understand that it is easy I had setup a test environment and was breaking it in different ways and trying to restore until I realized what how to recover the existing LXC subvolumes.

After this experience I have learnt that this SQLLite mapped file system can create big mess. I came across many users falling into this problem after update/upgrade of proxmox.

I have learnt that is necessary to setup backups to easily recover VMs/LXCs.

I want to thank to the professional support from Proxmox, which they have guided me how to proceed.

Ironically, I created this mess accidentally while trying to add a backup server to make proper backups

bashizip · Sep 25, 2024

Hi all, litsen carefully.

If this happens to you too, DO NOT restart any node!

Follow theses steps to recover your cluster (Copy and Paste):

cp /var/lib/pve-cluster/config.db /var/lib/clusterconfig.db
systemctl stop pve-cluster.service & systemctl stop corosync.service
cp /var/lib/clusterconfig.db /var/lib/pve-cluster/config.db
systemctl start pve-cluster.service & systemctl start corosync.service

This helped me, I hope it will help you too.

esi_y · Sep 25, 2024

bashizip said:
Hi all, litsen carefully.

If this happens to you too, DO NOT restart any node!

Follow theses steps to recover your cluster (Copy and Paste):

cp /var/lib/pve-cluster/config.db /var/lib/clusterconfig.db

systemctl stop pve-cluster.service & systemctl stop corosync.service

cp /var/lib/clusterconfig.db /var/lib/pve-cluster/config.db

systemctl start pve-cluster.service & systemctl start corosync.service

This helped me, I hope it will help you too.

This looks wrong, first you copy out the config.db (the SQL database backing up the /etc/pve fileystem) AFTER you already deleted the files from it and while everything is running, then you stop services and ... copy it back, then start them again?

I can only imagine it works because your delete was not checkpointed from the write-ahead-log to the base, very hacky.

Actually, you should keep proper backups, e.g.:
https://forum.proxmox.com/threads/backup-cluster-config-pmxcfs-etc-pve.154569/

esi_y · Sep 25, 2024

justinclift said:
@feradz Do you have anything useful in the /var/lib/pve-cluster/backup directory on any of your nodes?

For example, this is on one of my nodes and seems to be automatically setup/created:

Bash:

# ls -la /var/lib/pve-cluster/backup/ total 36 drwxr-xr-x 2 root root 3 May 16 13:07 . drwxr-xr-x 3 root root 7 May 21 23:18 .. -rw-r--r-- 1 root root 14473 May 16 13:07 config-1715864850.sql.gz

This is normally created before joining a cluster, it's not automated/regular, unfortunately.

zlpw91d · Dec 22, 2024

bashizip said:
Hi all, litsen carefully.

If this happens to you too, DO NOT restart any node!

Follow theses steps to recover your cluster (Copy and Paste):

cp /var/lib/pve-cluster/config.db /var/lib/clusterconfig.db

systemctl stop pve-cluster.service & systemctl stop corosync.service

cp /var/lib/clusterconfig.db /var/lib/pve-cluster/config.db

systemctl start pve-cluster.service & systemctl start corosync.service

This helped me, I hope it will help you too.

Thank you very much! it works!

Search

Search

Accidentally deleted /etc/pve/nodes in main node in cluster how to recover

feradz

New Member

Kingneutron

Renowned Member

justinclift

Well-Known Member

feradz

New Member

bashizip

New Member

esi_y

Renowned Member

esi_y

Renowned Member

zlpw91d

New Member

We value your privacy