[SOLVED] Huge screw up - Deleted filesystem

RocketRammer

New Member
Mar 31, 2022
11
2
3
Hi All,

I've made a massive screw up and somehow in a script I was testing on a host inside a cluster, I deleted alot of files from the host which instantly replicated across to the other host in the cluster.

I don't know what got deleted. All I got was this output suggesting it was trying everything:

INFO: /bin/rm: cannot remove './etc/pve/.debug': Permission denied
INFO: /bin/rm: cannot remove './etc/pve/.vmlist': Permission denied
INFO: /bin/rm: cannot remove './etc/pve/.members': Permission denied
INFO: /bin/rm: cannot remove './etc/pve/.rrd': Permission denied
INFO: /bin/rm: cannot remove './etc/pve/.version': Permission denied
INFO: /bin/rm: cannot remove './etc/pve/.clusterlog': Permission denied
INFO: /bin/rm: cannot remove './run/rpc_pipefs/gssd/clntXX/info': Operation not permitted
INFO: find: '/bin/rm': No such file or directory
INFO: find: '/bin/rm': No such file or directory

In the console which I still had connected, that host pretty much disapeared and whilst the VM's were still up - it wasn't reporting. I rebooted that Host and it's dead - failing to boot. The remaining host I can no longer connect to in the GUI, this host I had a backup of /pve /etc and /root which I've restored but it hasn't automatically come back online - whilst I can't connect to the GUI, it's still up and has critical items like my pfsense router, domain controllers etc running. If I reboot it and it doesn't come back up, I am absolutley up the river without a paddle.

I can still access the current host via SSH. Massivley in need of help. Anything is appreciated.
 
You could try to boot a PVE Iso and enter rescue mode or boot a Ubuntu Live ISO and try to access your backups of /etc/pve.
 
Thanks @Dunuin. Haven't used the rescue mode before but I like the sound of it! - Downloading the latest iso now and going to stick on a USB ready for the inevitable reboot of the remaining host. I really hope it'll come back up and then worst case I have to rebuild the other one (and then figure out how to import all the VM's which I hope to god are still there).
 
Jeepers. Well the one which held on suprisingly came back online after a very long reboot! - So I'm only up a creek but I do atleast have a paddle. Had to kill the quorum with 'pvecm expected 1' as it wouldn't start waiting for the other host, but it seems to be up and the VM's are running.

Now to investigate the dead dead failing to boot one. Will start with rescue mode and see where we get to.

Whilst I'm doing that a question to the forum; if I get this other server back up which we should assume has dodgy /pve files and now I've killed the quorum(I think?). What will happen when it starts / what direction will the sync take / will it even sync? / recreate cluster etc?

Thanks!
 
Tried Boot-Repair and GParted which both see the partitions and are seemingly happy with the boot config, and yet it refuses to boot.

Think It's dead. So now I need to think about reinstalling on this host. All the VM's are on separate disks and I think they're all there, but of course I don't know about their configs. My current Server is still referenceing the old one but of course offline.

Before I wipe the dead one, how best am I going to restand this up, import the VM's/Disks/Config ?


1664897345039.png
 
Right - I've rebuilt the host and got it to rejoin the cluster by copying the /etc/corosync folder. Manually configured networking and it's synced up all PVE stuff from the working host. Good news.

Now however, I've plugged all my Disks back in but I can't get them to mount(?).

1664914440649.png

I question this because in /mnt/pve above I do have the folders there, but they are all empty. How can I tell if these are actually mounted and going to the disks, or if I'm royally screwed and in actual fact they are mounted fine but everything inside was deleted?

Further image which may give me some hope, the 'MOUNTPOINT' is empty against all the disks, which leads me to think they are not mounted. Now the question is why, and hope they still contain the VM's.

1664914755892.png
 
Last edited:
Updating as I go just encase anyone cares..

This host is missing the mount services in /etc/systemd/system. There should be 'mnt-pve-NAMEOFDISK.mount' in there for each disk with contents of:

Code:
[Install]
WantedBy=multi-user.target

[Mount]
Options=defaults
Type=ext4
What=/dev/disk/by-uuid/c6673330-46c8-488b-9b6f-dc418616dd22
Where=/mnt/pve/RaidArray

[Unit]
Description=Mount storage 'RaidArray' under /mnt/pve

Going to manually create these for each of my Disks and then likely try to find how to register them assuming it won't just figure itself out... COME ON.
 
Yep so I manually created them, and they now show under directories but they are still not mounted. Feels like something hasn't picked up that it needs to run them. I don't know enough about this so just aimlessly googling currently!

1664917368790.png

I've seen people say it needs to be in /etc/fstab - but my working host has no entries for it's drives there outside /root and /swap. I must be missing something!..
 
Last edited:
systemctl enable mnt-pve-NAMEOFDISK.mount

To remind me later, I don't know if this factual, but it seems fstab is old, systemd is new and the current way of doing it, hence old google posts throwing confusion. Need to enable it like above. Probably a way to restart a service but having to restart... waiting for it to come back up.

This is officially where I think out loud and jot my progress. I've had a lot of coffee. Today can go to hell.
 
Friggin A! Screw the coffee it's whiskey time!!

1664920514798.png
Right - Notes before I drink....

When creating the .mount files, the NAMEOFDISK in 'mnt-pve-NAMEOFDISK.mount' MUST match the final part of the WHERE in the file AND it's case sensitive.. i.e:
Code:
[Install]
WantedBy=multi-user.target

[Mount]
Options=defaults
Type=ext4
What=/dev/disk/by-uuid/c6673330-46c8-488b-9b6f-dc418616dd22
Where=/mnt/pve/RaiDArraY

[Unit]
Description=Mount storage 'RaidArray' under /mnt/pve

That file MUST be named: 'mnt-pve-RaiDArraY.mount'.

Next, in the screenshot above I have one named SSD-DS-RAID.. guess what, it hates hyphens. I ran this command:
Code:
systemd-escape -p --suffix=mount "SSD-DS-RAID"
which outputted this escaped string
Code:
SSD\x2dDS\x2dRAID.mount
. This is then the file name; 'mnt-pve-SSD\x2dDS\x2dRAID.mount' and inside it's still Where=/mnt/pve/SSD-DS-RAID. Whatever.

Reboot, it's all mostly up! Looks like everything that was on at the time of my absolutely stupid but very impressive deletion script got denied and thus it survived. The couple VM's I had switched off look to have dissapeared, but as per screen shot, only really 1 of them was important and nothing I can't quickly remake - nothing Production unlike the Domain Controller, CCTV and TrueNas instance running on this.

It's over. It's done. I hope to never have to do that again.

Thank you for attending my ted talk.
 
/etc/fstab is parsed by systemd at boot so it always supported
imo, fstab is more simple because one file to check / backup / update with only storage.cfg
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!