Proxmox Support

wifipilot · Jul 22, 2022

guys please can you help me.

i have searched for help but could not find what i need to do

i have a proxmox 7 server ssd boot drive that has crashed.

my problem is that the server has 2 x 120gb ssd boot Drives? [ not sure if only one 120gb drive is used to boot ] and 4 x 1tb ssd hard drives in raid.

i need to install a new ssd hard drive then mount the raid system in proxmox to restore the vm's.

i have installed a new ssd and i need to understand how do i recover the vm's.

any help greatly appreciated

Dunuin · Jul 22, 2022

See chapter "Changing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration

Dunuin · Jul 22, 2022

What is sde for? You got two 120GB SSDs, both were PVE installed to but not as a mirror. One 120GB SSD using ZFS and one using LVM-thin. So you are basically running two individual PVE installations on the same machine on different disks? Why?

Dunuin · Jul 22, 2022

And the disk or PVE installation that failed was the ZFS one so it now boots from the LVM disk/installation instead because the ZFS installation is now missing that would have booted otherwise?

Dunuin · Jul 22, 2022

What data were on your old SSD? Also virtual disks/backups you care about or just the PVE system and config files?

What I would do is boot from a PVE Iso, enter rescue mode and try to import the failed ZFS pool by running zpool import to see the status of the failed pool and zpool import rpool to try to mount it. If that won't work you can try it with zpool import -R /restore -o readonly=on -f rpool. If mounting still fails you could try it with zpool import -R /restore -F rpool to discard the latest changes.

Its also a good idea to backup your failed SSD before trying to recover it so you can restore it in case the recovery fails. You could boot into a clonezilla for that.

Dunuin · Jul 22, 2022

Problem probably is that your PVE system disk stored all your config files. The guests config files too. In case your virtual disk were stored on the other ZFS pool (the 4x 1TB one) you only got the data that your guests stored but there is no guest to run as the config files that defined what your VMs/LXCs should look like were lost with the system disk.
So you either:
1.) Somehow restore your configs from the old PVE system disk which can be hard because those directly aren't accessible if your PVE can't boot (configs are actually stored in a SQLite DB and mounted as files in /etc/pve. But if your PVE services can't start it won't mount the dtabase there and the config folder will be empty)
2.) you got vzdump/PBS backups of all your guests and extract the config files from there or you manually backed up your /etc/pve folder previously
3.) try to remember what options these guests used, you create new VMs/LXCs from scratch matching the old guests (wrong options and the VMs won't boot) and attach the old virtual disks to it

Dunuin · Jul 22, 2022

I would unplug it. That way you won't get confused with multiple PVE installations.

Dunuin · Jul 22, 2022

And what you probably want to backup is "/etc/pve" (which most likely will be empty so not useful), "/etc/network/interfaces", "/etc/vzdump.conf", "/etc/hosts", "/etc/resolv.conf", "/root", "/home", "/var/lib/vz" (which stores your isos, snippets, templates, backups) and the SQLite DB that stores the pmxcfs configs at "/var/lib/pve-cluster/config.db".

In case you got that backed up you could install PVE as ZFS to your SSD where you installed ZFS as LVM. Then boot from that new PVE installation, stop the pve-cluster service (systemctl stop pve-cluster), replace the files with the ones you backed up and reboot the server and see if it works again.

LnxBil · Jul 22, 2022

wifipilot said:
there was a power surge and the original ssd boot drive is defective

Looks like a cheap SSD without protection.

Dunuin · Jul 22, 2022

wifipilot said:
LnxBill - was a samsung 120gb ssd

ZFS should only be used with enterprise/datacenter grade SSDs with powerloss protection. So LnxBill is right that you use a too cheap SSD that will loose data on an power outage.

wifipilot said:
so now i need to COPY ALL the Folders as you mentioned? yes / no
on a USB Flash Drive to Restore later?

Yes, but verify that you really copy the files from your old SSD and not the files from the Rescue Mode.

wifipilot said:
so the following drives are using ZFS ?

NAME STATE READ WRITE CKSUM
sfast10 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WDS400T1R0A-68A4W0_21271F800149 ONLINE 0 0 0
ata-WDC_WDS400T1R0A-68A4W0_212862800192 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WDS400T1R0A-68A4W0_212862800316 ONLINE 0 0 0
ata-WDC_WDS400T1R0A-68A4W0_212862800046 ONLINE 0 0 0

would that be correct

I wouldn't touch that sfast10 pool if that is still working.

I thought you only got a single disk UFS pool as your system drive but you got a mirror with a missing member. Why is that member missing?

LnxBil · Jul 23, 2022

Please post your ouput in CODE tags with correct indentation. This is very hard to read.

Dunuin · Jul 24, 2022

You still not explained what actually happened. For me it looks like you previously got two 120GB Samsung Evos (one 750 EVO + 850 EVO) in a ZFS mirror. Then the 850 EVO failed but the pool was still working because of the mirroring and the working 750 EVO.
But instead of adding a new disk replacing the failed disk and resilvering the pool you added another 120GB SSD (Apacer 120GB) and just reinstalled PVE as LVM to that so you got 2 PVE installations in parallel.

So in my opinion all that stuff with the PVE on LVM was useless and you should have followed the guide "Replacing a failed boot device" in the wiki to replace the failed disk to get your rpool from "degraded" to "online" status again.

Dunuin · Jul 24, 2022

wifipilot said:
Dunuin thank you for your explanation

yes you are correct there was 2 x 120GB Samsung SSD Drives and the one drive is faulty so the system will not boot

so all i need to do is replace the one drive?

then how do i get the system back up as the system will not boot

sorry for the noob questions. this is the first time i am trying to recover from a faulty proxmox system

i just need the system back up asap

zpool replace -f <pool> <old device> <new device> - what pool and what new device?

do i have both the old 120GB Samsung and New 120GB Connected and then boot with the Proxmox USB in " rescue mode " ?

No, you should read the Wiki Paragraph that I linked. You can't just run "zpool replace -f <pool> <old device> <new device>" as you also will have to clone the partition table from the health pool member and copy the bootloader first, like described there.
But if your remaining pool member isn't able to boot you would first have to get it booting again. For example by writing a new grub to it or using the proxmox-boot-tool to write a new systemd bootloader.
So first I would try to use the rescue mode to write the bootloader so that your remaining SSD will be able to boot again.

fortechitsolutions · Jul 24, 2022

This is not really helpful very likely, just an observation-comment.

My opinion on the topic of ZFS (or unsupported linux SW raid) and then vanilla Hardware Raid. Or no raid at all.

Raid is important for production generally.
data backups / VM Backups are really important
Part of raid being a good thing is that you now how to use it and how to recover from it.
this means either testing on devtest a full recovery cycle, or on prod when it is pre-prod and tests are ok/doable
proxmox-backup-server (or vanilla scheduled VM Dumps to NFS or other storage, etc, for old-style solution) are great ways to make sure you have your VM data backed up and easily accessible.
the fact that proxmox appliance installer supports a ZFS install config, is nice, but also means there is possible situation where you have setup something and don't really know how it works / or how to support it / or how to recover from a fail.

IMHO. If you are uncomfortable with ZFS then you should not use it. Personally I am not a big fan of ZFS so I don't use it. Just my own preference

I have some proxmox boxes deployed on "Unsupported LInux SW raid" because I am happy / comfortable supporting that / familiar with making it work

Similarly, I have some proxmox boxes deployed on "hardware raid" because generally that is fairly easy to support situations with a failed drive

End of the day, it is important that if you have big problems with proxmox server in production, you should always have the option of

Blow away the server
Clean install to your preferred config / good hardware / etc
Restore VMs from last recent good backup
Life goes on.

In your situation. This could mean possibly

-- attach external USB HDD with various TB of storage
-- manually backup your config files and then the raw VM data files. If you don't have good backups
-- make sure you are good. If you are not sure, use a new set of disks, in your server, and set old disks aside. that way you have a failback in case you screw up and destroy things by accident. ie, Disks are cheap in some situations compared to "oops I destroy my only copy of server VM ha ha oops". ie, disks do cost money, but destroying only copy of a VM by accident can cost more money quite possibly.
-- once you know for sure you are good, get good clean proxmox in place, on a config you are comfortable with and know how to support
-- bring your VMs back to life, ie, this might be "manually create one VM, do nothing with this place holder, now drop in your config and raw VM Disk files to suitable places, so that now your old VM exists and is bootable". once that is done for one VM successfully, rinse and repeat. for all your VM.
-- once all VM are alive and life is good, make sure to setup good backups (Proxmox backup server is recommended) with good suitable easy backups on desired schedule
-- then life moves forward smoothly, and there is less drama hopefully.

End of the day, hard drive failure is inevitable part of server deployment, so it is important that the platform is setup to anticipate this inevitable situation and move through it gracefully. Cars run out of gas, Tires wear out, hard drives fail. We generally avoid running out of gas on the highway, and we generally remove-and-replace bad/worn out care tires before they kill us / with no treads on them.

Sometimes we don't learn how important good reliable backups are until we learn the hard way. ie, lose something that is really important, or end up in an avoidable situation that could have been more more painless, if we had good backups that worked well.

hope this situation is one you can recover from without too much pain loss frustration
and that long term you end up in better place possibly

Tim

barchetta · Jul 24, 2022

to wifipilot. I just wanted to say I hope this all works out for you. I think we have all been here at one stage of our lives.

Dunuin · Jul 24, 2022

ZFS is very robust when running on suitable hardware. One problem here is that consumer SSDs were used instead of proper enterprise grade SSD with powerloss protection.
Would have you used proper SSDs in the first place its very unlikely that your SSDs won't boot anymore as no data corrupts on an power outage. Usually when people write here that PVE won't boot anymore they had a power outage right before but were too parsimonious to get a UPS or enterprise SSD.

Also keep in mind that the bootloader isn't part of ZFS. Grub/ESP are other partitions outside of the ZFS pool.

But you are right, with proper backups and a worked out recovery plan we wouldn't need to discuss here.

wifipilot said:
is that what you recommend?

by using the usb in rescue mode?

Jup, but find out if you are using grub or systemd to boot. You only need one of them.

LnxBil · Jul 25, 2022

wifipilot said:
i boot from my USB in " rescue mode "

You should not boot from there, just boot your OS.

Dunuin · Jul 25, 2022

Its no problem to import that 4 Disk Pool you stored your virtual disks on using a new ZFS installation. But virtual disks won't help that much without the VMs config files. But you rescued all the configs meanwhile?

Did you tried what you posted?

proxmox-boot-tool format <new disk's ESP>
proxmox-boot-tool init <new disk's ESP>

Your ESP should be partition 2 of the system disk so "/dev/sda2".

But not sure if you need additional steps. I once repaired a Debian VM that don't wanted to boot anymore and for that I had to boot a Debian Live CD and chroot into the non-bootable Debian to be able to write a new bootloader.

LnxBil · Jul 25, 2022

Dunuin said:
But not sure if you need additional steps. I once repaired a Debian VM that don't wanted to boot anymore and for that I had to boo a Debian Live CD and chroot into the non-bootable Debian to be able to write a new bootloader.

Yeah, also my tool of choice.

The problem is, why can't your second disk boot? Have you tried browsing your EFI boot menu in the BIOS in order to search for the other EFI? Can you just plug in your one working SSD, live boot and do a lsblk?

LnxBil · Jul 25, 2022

BTW: There is "reply" button that will quote the selected text and you can just answer like (almost all other people) do.

wifipilot said:
1. Its no problem to import that 4 Disk Pool you stored your virtual disks on using a new ZFS installation. So if a create a New 120GB Proxmox Boot Drive with LVM i can " IMPORT " the VM's ?

No, you need the configuration, without it, it's almost useless.

wifipilot said:
But virtual disks won't help that much without the VMs config files. - All the VM's are basically Windows 10 with 250GB HDD Space and 4gb Ram. So can i not just create them and then "Import the VM's " ?

You can, but you then have additional network devices and will probably have to activate everything again.

wifipilot said:
But you rescued all the configs meanwhile? I have tried from the Old Second Drive but where will the Config Files be for me to use?

After booting up, you have them in /etc/pve/. One way to rescue them would also be to copy the folder /var/lib/pve-cluster and extract the configuration from the SQLite database by yourself.

Again: I'd try to fix your non-booting but supposedly working copy of your data.

Proxmox Support

New Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Renowned Member

New Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member