[SOLVED] How to recover failed Raid / access storage

Fladi

Renowned Member
Feb 27, 2015
31
9
73
Hi all,

a friend of a fried has had a Proxmox running. And as I sometimes play around with Proxmox I was asked for help.

The situation is the following.

The server hat 3 disks. One SATA-DOM with Proxmox (3.x *argh*) installed. And two spinning drives which are in a raid somehow. Now one of the spinners died completely and the SATA-Dom has IO Error and is not able to boot. It seems to be setup with ZFS.

So basically a server which seemed to run for many years without any love given to.

I plugged in another drive and installed latest proxmox 5.3 on it. I thought I could that perhaps just import the pool.

Normal import was not possible due to the IO-Errors but I manged with my limited skills to import the rpool as read-only.

root@pve001:~# zpool status
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0h3m with 0 errors on Thu Jan 14 08:46:13 2016
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
sda3 ONLINE 0 0 0

errors: 3 data errors, use '-v' for a list


The problem is, that I can't see the "data"-disk in the pool and I have no clue how it was setup. I only can see the data-disk with fdisk:

Disk /dev/sdd: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F53023F3-D600-5240-AF69-4FAEDB529D9A

Device Start End Sectors Size Type
/dev/sdd1 2048 3907012607 3907010560 1.8T Solaris /usr & Apple ZFS
/dev/sdd9 3907012608 3907028991 16384 8M Solaris reserved 1


So there seems to be some ZFS-stuff on the disk which holds the vm-images . The 3 data errors are from older images-files and could be ignored.

I even tried to find out something about how it was configured, but could not access the pve-installation on the rpool as / is already mounted. In general the dataset should be available, but I'm not sure how to mount it to another path.

root@pve001:~# zfs get all | grep mount
rpool mounted yes -
rpool mountpoint /rpool default
rpool canmount on default
rpool/ROOT mounted yes -
rpool/ROOT mountpoint /rpool/ROOT default
rpool/ROOT canmount on default
rpool/ROOT/pve-1 mounted no -
rpool/ROOT/pve-1 mountpoint / local
rpool/ROOT/pve-1 canmount on default


Long story short. Any idea how I can access the data on this sdd1?

best regards
Tim
 
Hi,

As zpool staus show, you must try

zpool staus -v
so you can see what errors are.

As a guess, I think that even the un-broken disk have problems as zpool status suggest. You could also try to check if the smart statistics are ok or not with a long test.

Good luck!
 
Hi guys,
thanks for the hint. This helped in looking on the old PVE-Installation which on the other hand didn't help to access the data on the other drive.

But the solution is quite easy (but took me a lot of time to figure out).

So for others with similar problems.

1. use "fdisk -l" to check for disks presented in your system
2. use "zdb -l /dev/sddX" (where X is your zfs partion) => this gives you some information like pool-name
3. now do an"zpool import -d /dev/disk/by-id/"

Now system should see the pool. Perhaps it mentions that it was connected to another machine (happens when you are running this from another computer or fresh proxmox installation)

4. Do the import with "zpool import -f -d /dev/disk/by-id/ -F <poolname>" (poolname could be taken from above zdb-command. You may add "-o readonly=on" between -f and -d ;-)

After this for me the pool was available (of course degraded) but I could copy data out of.

Hope this might help others :)
 
  • Like
Reactions: guletz
Hi guys,
thanks for the hint. This helped in looking on the old PVE-Installation which on the other hand didn't help to access the data on the other drive.

But the solution is quite easy (but took me a lot of time to figure out).

So for others with similar problems.

1. use "fdisk -l" to check for disks presented in your system
2. use "zdb -l /dev/sddX" (where X is your zfs partion) => this gives you some information like pool-name
3. now do an"zpool import -d /dev/disk/by-id/"

Now system should see the pool. Perhaps it mentions that it was connected to another machine (happens when you are running this from another computer or fresh proxmox installation)

4. Do the import with "zpool import -f -d /dev/disk/by-id/ -F <poolname>" (poolname could be taken from above zdb-command. You may add "-o readonly=on" between -f and -d ;-)

After this for me the pool was available (of course degraded) but I could copy data out of.

Hope this might help others :)
How did you copy the data out after mounting Read only?-
 
Hi @novafreak69,

it's been a while since I wrote that. I can't fully remember. But basically it depends what you have stored. If you have the pool mounted you might just copy the files to another target.
 
  • Like
Reactions: novafreak69
Hi @novafreak69,

it's been a while since I wrote that. I can't fully remember. But basically it depends what you have stored. If you have the pool mounted you might just copy the files to another target.
I have an upgrade gone bad... the server crashed after a dist upgrade... and the only way I can get the pool to mount is in the previous version in the Grub trying to copy off the read only data from there fails... :/
 
Code:
root@novafreakVM:~# zdb -l /dev/sdb1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'Pool_1'
    state: 0
    txg: 11716518
    pool_guid: 5884478477829533997
    errata: 0
    hostid: 2851753585
    hostname: 'novafreakVM'
    top_guid: 15039349642334934681
    guid: 15039349642334934681
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 15039349642334934681
        path: '/dev/sdb1'
        devid: 'scsi-3690b11c049f962002571f9dd0dd43fe8-part1'
        phys_path: 'pci-0000:03:00.0-scsi-0:2:1:0'
        whole_disk: 1
        metaslab_array: 256
        metaslab_shift: 34
        ashift: 12
        asize: 5997906624512
        is_log: 0
        DTL: 2432
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3
root@novafreakVM:~# df -h /mnt
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/pve-root   33G  9.4G   22G  30% /
root@novafreakVM:~# zpool import -d /dev/disk/by-id/
no pools available to import
root@novafreakVM:~# zpool import -f -o readonly=on  -d /dev/disk/by-id/ -F Pool_1

This fails in the current version of grub....

well should not say fails... but it never imports I have let it run overnight and it still on this....
 
I'm not sure, but it might be that there is missing something in my instruction. I would expect to have the actual disk-id written after the /dev/disk/by-id/ - ? But it seems you have not built your pool with "by-id" but with /dev/.... (see zdb output).

so try to zpool import with /dev/sdb1 (or just /dev)
 
Code:
root@novafreakVM:~# zpool status -v
  pool: Pool_1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 380K in 06:08:38 with 1 errors on Sun Oct 24 18:06:30 2021
config:

        NAME        STATE     READ WRITE CKSUM
        Pool_1      ONLINE       0     0     0
          sdb       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Pool_1/vm-110-disk-0:<0x1>
root@novafreakVM:~# cd /Pool_1
root@novafreakVM:/Pool_1# ls
root@novafreakVM:/Pool_1#

I would expect to see some VM disk files yes?


Code:
root@novafreakVM:/Pool_1# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                  8:0    0 136.1G  0 disk
├─sda1               8:1    0  1007K  0 part
├─sda2               8:2    0   512M  0 part
└─sda3               8:3    0 135.6G  0 part
  ├─pve-swap       253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       253:1    0  33.8G  0 lvm  /
  ├─pve-data_tmeta 253:2    0     1G  0 lvm
  │ └─pve-data     253:4    0  75.9G  0 lvm
  └─pve-data_tdata 253:3    0  75.9G  0 lvm
    └─pve-data     253:4    0  75.9G  0 lvm
sdb                  8:16   0   5.5T  0 disk
├─sdb1               8:17   0   5.5T  0 part
└─sdb9               8:25   0     8M  0 part
sr0                 11:0    1   3.7G  0 rom
sr1                 11:1    1  1024M  0 rom
zd0                230:0    0   100G  1 disk
├─zd0p1            230:1    0     1M  1 part
└─zd0p2            230:2    0   100G  1 part
zd16               230:16   0   100G  1 disk
├─zd16p1           230:17   0     1M  1 part
└─zd16p2           230:18   0   100G  1 part
zd32               230:32   0   100G  1 disk
├─zd32p1           230:33   0   579M  1 part
└─zd32p2           230:34   0  99.4G  1 part
zd48               230:48   0   4.4T  1 disk

Code:
root@novafreakVM:/# zpool status -v
  pool: Pool_1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 380K in 06:08:38 with 1 errors on Sun Oct 24 18:06:30 2021
config:

        NAME        STATE     READ WRITE CKSUM
        Pool_1      ONLINE       0     0     0
          sdb       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Pool_1/vm-110-disk-0:<0x1>
root@novafreakVM:/# zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Pool_1  5.45T  4.57T   901G        -         -     0%    83%  1.00x    ONLINE  -
root@novafreakVM:/# zfs list
NAME                   USED  AVAIL     REFER  MOUNTPOINT
Pool_1                4.83T   459G       96K  /Pool_1
Pool_1/vm-100-disk-0   103G   471G     90.9G  -
Pool_1/vm-110-disk-0   103G   471G     90.6G  -
Pool_1/vm-110-disk-1  4.53T   645G     4.35T  -
Pool_1/vm-120-disk-0   103G   518G     44.2G  -
 
Last edited:
the zfs list outputs your vm-disks. now try to rescue them. (e.g. zfs send and receive to other machine). Running zfs only on one disk is not good practice.
 
  • Like
Reactions: novafreak69
the zfs list outputs your vm-disks. now try to rescue them. (e.g. zfs send and receive to other machine). Running zfs only on one disk is not good practice.
I was able to use the copy command,,,, will this be enough to restore these machines to a new Host?
Code:
    root@novafreakVM:/dev/zvol/Pool_1# ls
vm-100-disk-0  vm-100-disk-0-part1  vm-100-disk-0-part2  vm-110-disk-0  vm-110-disk-0-part1  vm-110-disk-0-part2  vm-110-disk-1  vm-120-disk-0  vm-120-disk-0-part1  vm-120-disk-0-part2
root@novafreakVM:/dev/zvol/Pool_1# cp vm-100-disk-0 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-100-disk-0-part1 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-100-disk-0-part2 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# mkdir /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-101-disk-0 /mnt/pve/Backups/backups/
cp: cannot stat 'vm-101-disk-0': No such file or directory
root@novafreakVM:/dev/zvol/Pool_1# cp vm-110-disk-0 /mnt/pve/Backups/backups/
cp: error reading 'vm-110-disk-0': Input/output error
root@novafreakVM:/dev/zvol/Pool_1# cp vm-110-disk-0-part1 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-110-disk-0-part2 /mnt/pve/Backups/backups/
cp: error reading 'vm-110-disk-0-part2': Input/output error
root@novafreakVM:/dev/zvol/Pool_1# cp vm-120-disk-0 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1#
root@novafreakVM:/dev/zvol/Pool_1# mkdir /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-120-disk-0-part1 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-120-disk-0-part2 /mnt/pve/Backups/backups/
root@novafreakVM:/dev/zvol/Pool_1# cp vm-110-disk-1 /mnt/pve/Backups/backups/


They all moved successfully... But I worry this may not be the right files... they appear to be the right size... but no file extension.??
 
Seems like you didn't use use zvol block storage for you VMs? So if these are just raw or qcow files you should be fine. Just give it a try. You can copy your proxmox config from /etc/pve/nodes/....<your-node-name>../qemu-server/100.conf to new server. Is your proxmox running? If so - create a new vm and replace the created disk with one from your backup and see if it works.
 
  • Like
Reactions: novafreak69
Seems like you didn't use use zvol block storage for you VMs? So if these are just raw or qcow files you should be fine. Just give it a try. You can copy your proxmox config from /etc/pve/nodes/....<your-node-name>../qemu-server/100.conf to new server. Is your proxmox running? If so - create a new vm and replace the created disk with one from your backup and see if it works.
I am not sure which file is the .RAW file... looks like each "Disk" has a file for each partition maybe?
 

Attachments

  • disk images.JPG
    disk images.JPG
    17.5 KB · Views: 3
Thought I would also add that I was able to figure out a way to mount the disk as read only and copy data from them....

Code:
root@novafreakVM:/dev/zvol/Pool_1# ls -w 1
vm-100-disk-0
vm-100-disk-0-part1
vm-100-disk-0-part2
vm-110-disk-0
vm-110-disk-0-part1
vm-110-disk-0-part2
vm-110-disk-1
vm-120-disk-0
vm-120-disk-0-part1
vm-120-disk-0-part2
root@novafreakVM:/dev/zvol/Pool_1# ls -l --block-size=M
total 0M
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-100-disk-0 -> ../../zd0
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-100-disk-0-part1 -> ../../zd0p1
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-100-disk-0-part2 -> ../../zd0p2
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-110-disk-0 -> ../../zd16
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-110-disk-0-part1 -> ../../zd16p1
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-110-disk-0-part2 -> ../../zd16p2
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-110-disk-1 -> ../../zd48
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-120-disk-0 -> ../../zd32
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-120-disk-0-part1 -> ../../zd32p1
lrwxrwxrwx 1 root root 1M Oct 26 15:25 vm-120-disk-0-part2 -> ../../zd32p2

root@novafreakVM:/dev# mkdir /mnt/data
root@novafreakVM:/dev# mount /dev/zd48 /mnt/data
mount: /mnt/data: cannot mount /dev/zd48 read-only.
root@novafreakVM:/dev# mount -o ro,noload /dev/zd48 /mnt/data
 
well... i tried something off the cuff here... after mounting my POOL as read only... I was able to CLONE my VMs :D and they start up just fine... Now I will blow away the old ones and do a fresh install of PROXMOX... Start over and not make the same mistakes again... Like not having backups... :)
 
  • Like
Reactions: entilza
Better not use ZFS on top of Hardware-Raid. Get a cheap HBA-Controller instead so that ZFS can fully access the disks.
 
  • Like
Reactions: novafreak69