Proxmox 6.1-2

w4rh0und

Member
Oct 28, 2009
17
1
23
Hi everyone

I just encountered the biggest possible issue on a cluster made out of 3 Proxmox machines.

I ran a daily backup script which had a space in it and instead of removing files older than 1 day from /backup it removed them from / on one of the servers.

the server itself is now ... useless, i have a session on it which still runs. All the images on it are running

The issue is that all the files from /etc/pve were also removed and the other proxmox servers are affected as well.

All the virtual machines are running on all physical hosts, we just can't back them up, or do anything. Their config files are missing, we cannot login on the web interface.

We only have the .qcow2 files which are currently in use on all machines

How can we recover from this? Can we re-create the config files? can we copy the qcow2 files and add them to another machine and create a new image using that disk?

Please help me

thank you
 
Can we re-create the config files?
Yes you can afail
As long as the data is present you can rebuild the configs.
I'd do the following: Get the VM ids from the disk names, the MACs from the NIC in the guest. Then rebuild the cfg file in /etc/PVE/qemu-server

Good luck!
 
thank you

Do i need to stop all cluster services on each machine in order to do this?

Is there a guide on how to re-create the config file for each machine?

Thank you
 
I'd start by dumping the command line of currently running VMs:
Code:
for pid in $(pidof kvm); do cat /proc/$pid/cmdline | tr '\0' '\n'; done

when recreating your VM configs, you can compare that to qm showcmd VMID --pretty
 
Hi Fabian,

I already dumped the setting, unfortunately t does not show the disk size

The questions i still have:

Since these machines were clustered, how can i remove each and everyone of them from the cluster to remain as standalone, gain access to web console without stopping the running machines
Can i just create the files/folders for the files "id.conf "

I've created another machine from web on a new server and copied over the qcow2 file

The conf file generated by Proxmox on the new server is like below. Would this suffice for most images?

306.conf
bootdisk: ide0
cores: 2
ide0: local:306/vm-306-disk-0.qcow2,backup=0,size=50G
ide2: none,media=cdrom
memory: 4096
name: PC06
net0: virtio=36:9B:35:A5:5E:B3,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=4e141ea1-4911-40e8-9484-f52493e29d95
sockets: 1
vmgenid: 88be9f65-04bf-4de3-8505-e3a95abf138d
 
the size should be recoverable from the images themselves.

if you want to reset your pmxcfs (/etc/pve) to a clean slate like on a freshly installed, standalone node with no configuration done yet, that should work with the following sequence of commands, but I haven't tested them so maybe try them out inside a test VM first:
Code:
systemctl stop pve-cluster corosync
rm -rf /etc/corosync/*
rm -rf /var/lib/pve-cluster/*
systemctl start pve-cluster
pvecm updatecerts

obviously only run this on nodes where /etc/pve is already completely hosed, not on other production nodes ;)
 
the size should be recoverable from the images themselves.

if you want to reset your pmxcfs (/etc/pve) to a clean slate like on a freshly installed, standalone node with no configuration done yet, that should work with the following sequence of commands, but I haven't tested them so maybe try them out inside a test VM first:
Code:
systemctl stop pve-cluster corosync
rm -rf /etc/corosync/*
rm -rf /var/lib/pve-cluster/*
systemctl start pve-cluster
pvecm updatecerts

obviously only run this on nodes where /etc/pve is already completely hosed, not on other production nodes ;)

/etc/pve is already lost and has been on all cluster nodes. It was removed on Server1 and now it has been removed everywhere
 
any idea on what to do with the main server?

The images are working on it, but can't even login physically on proxmox.

Shutdown, boot from a USB and copy over the qcow2 files or there are better options?
 
depending on what was deleted there, a clean reinstall might be the best option.. if the qcow2 files are actually still there and not just not-yet-deleted because the running guests are using them, then copying them via some live environment should be possible. I'd do in-VM backups as well before shutting down the host though..
 
We managed to recover completely by creating the conf files for all the images

For the first server in the cluster on which we removed all under root, We re-installed and we used the qcow2 files which worked great after we created the .conf file. All in all we got lucky.

Thank you everyone for the hints provided.
 
  • Like
Reactions: fabian