[SOLVED] Corrupted drive - how to repair?

Ninjas · Aug 4, 2021

This is a long story, so I'm just gonna summarize it. This week my proxmox server started having issues with its SSD. I managed to fix it by running fsck a few times + lvconvert --repair. I thought it was a one-time issue, so I left it be. Today the issues returned.

The server has 3 drives: call them sda (the SSD), sdb (HDD), sdc (HDD) and 2 VMs.

Figuring sda was a lost cause, I installed a fresh copy of Proxmox 7 on sdc.
During the install, the volume group on sda was renamed to pve-OLD-059360C7 by the installer.

Now I'm on the proxmox install on sdc and I want to try to recover as much from sda as possible. For instance, I'm pretty sure I can salvage most of the 50GB Windows install. I created an LVM-Thin storage with the pve-OLD-059360C7 volume group and data thin pool, attached the old Windows disk to my instance, and then tried to "Move" the instance over to my sdc HDD where I know it would be safe from further corruption.

At around 42% in the transfer, qemu-img failed with input/output error. Understandable, the storage is corrupt after all:

Code:

qemu-img: error while reading at byte 21967662080: Input/output error
TASK ERROR: storage migration failed: copy failed: command '/usr/bin/qemu-img convert -p -n -f raw -O raw /dev/pve-OLD-059360C7/vm-200-disk-0 zeroinit:/dev/pve/vm-200-disk-1' failed: exit code 1

So I figured I can try to repair the pve-OLD-059360C7/data LVM:

Code:

lvchange -an pve-OLD-059360C7/data -ff
lvconvert --repair pve-OLD-059360C7/data
  Active pools cannot be repaired.  Use lvchange -an first.

Repairing pve-OLD-059360C7/vm-200-disk-0 doesn't work either:

Code:

# lvconvert --repair pve-OLD-059360C7/vm-200-disk-0
  Command on LV pve-OLD-059360C7/vm-200-disk-0 does not accept LV type thin.
  Command not permitted on LV pve-OLD-059360C7/vm-200-disk-0.

I tried many other commands - fsck, e2fsck, mount, and others I can't remember - almost none of them successfully run.

At this point I'm out of ideas of what to try. While I have most of my important data backed up, I would like to regain access to the files on the sda drive. If anything, it might be easier to repair the storage and clone it to sdc instead of re-installing and re-configuring the OSes. It would also help with the files that I don't have good backups for.

Any ideas?

fabian · Aug 4, 2021

well the repair command failed because the pool was active (did you disable the storage first? else it might get re-activated automatically).

there are more specialized recovery tools - if your volumes with data are still properly exposed as block devices by LVM, something like ddrescue will allow you to make a copy and control how errors are handled. you need a raw target file or block device (you can allocate one with pvesm alloc, on your storage that is on the new hard disk).

Ninjas · Aug 4, 2021

fabian said:
well the repair command failed because the pool was active (did you disable the storage first? else it might get re-activated automatically).

there are more specialized recovery tools - if your volumes with data are still properly exposed as block devices by LVM, something like ddrescue will allow you to make a copy and control how errors are handled. you need a raw target file or block device (you can allocate one with pvesm alloc, on your storage that is on the new hard disk).

ddrescue worked amazing for the Windows volume, and I have that VM back up and running now. Thanks for that suggestion!

I have ddrescue going for the 400GB Linux VM now too. It'll take a few more hours but I'm hoping for a similar successful result.

Update: worked well with 99.9% of data having been recovered.

MrPete · Mar 31, 2024

I realize this is an old issue, but I just ran across it for the first time. I have extensive ddrescue experience, having helped the author add some improvements. Here are a few suggestions based on more experience than I wish

For both rotating and SSD storage, if you have any concern that the drive may be dying, consider these suggested priority goals:
* Preserve as much data as possible, as quickly as possible
* Avoid heating the dying drive

To do this, take advantage of optional ddrescue switches:
* Set a minimum "good" data transfer rate. I usually use ~20kB/sec -- if going slower than that, something is terribly wrong
* Set a sizeable jump-on-failure distance. I usually use 1GB. You can go back later to grab more data, but to start with, you want to find the good areas and grab them.
* Set a sizeable read-size to start. I usually use 128kb to 256kb.

Next, pay attention during your first ddrescue on a given machine and drive. The big question: if there is a failure, does the controller properly recover without being fully reset? Particularly with the above switches set, you do NOT want to see endless failures after the first... but it is quite possible, not because the drive is so bad, but the controller gets confused. There's a ddrescue switch to force a reset-on-fail. Don't use it unless needed (it causes things to slow down a lot.)

Final hint: if you have time to pay attention, instead of just letting it go with the above, get through the first pass with above settings, but stop before the second pass: by default, ddrescue on pass 2 starts reading backwards in small increments. Instead, start a new run with reduced jump size (maybe 100MB), etc. Up to you what increments to use. Just remember: reading forwards is a LOT faster than sector-at-a-time backwards.

Hope that helps someone out there

Search

Search

[SOLVED] Corrupted drive - how to repair?

Ninjas

Active Member

fabian

Proxmox Staff Member

Ninjas

Active Member

MrPete

Active Member