All zfs disk degraded, is it possible to recover?

dewangga

Member
May 2, 2020
16
2
8
35
I have situation related to ZFS, if it's wrong forum, please lemme know.

The context is, my engineer accidentally remove healthy disk when the pool is on degraded state. This is the last state before the healthy disk was removed.
1631766394562.jpeg

My engineer was removed sdg, so the state changed to suspended.

1631766428085.jpeg

And since the removal make disk name change from sdg to sdh, I tried to make sdc online by invoking 'zpool clear vmdata sdc'. The pool state is changed to DEGRADED, sdc state changed to online, but the rest of them changed to DEGRADED.

Screen Shot 2021-09-16 at 11.31.17.png

My question is, since the zfs status is on resilvering progress, is it possible to make in HEALTHY again?

Thanks.
 
A radiz1 can handle one failed drive (sdc) but it appears that another drive (sdg) was removed. It cannot recover from two non-working drives: sdc failed and the original sdg is now missing (replaced with an empty drive). Maybe putting the original sdg drive back in can help recover some of the data? I fear that this pool is broken and you'll need to restore from your backups, sorry.
 
A radiz1 can handle one failed drive (sdc) but it appears that another drive (sdg) was removed. It cannot recover from two non-working drives: sdc failed and the original sdg is now missing (replaced with an empty drive). Maybe putting the original sdg drive back in can help recover some of the data? I fear that this pool is broken and you'll need to restore from your backups, sorry.
So it's mean, with zfs scrub there's still no possibilities to bring the data back, isn't it?
 
So it's mean, with zfs scrub there's still no possibilities to bring the data back, isn't it?
Yes, you basically got more failed drives the pool can handle. Maybe for the future it would make sense to...
1.) use a raidz2 if you got 6 disks, so it can handle another failing disk (and should be less overhead because you can use a blocksize of 16K instead of 32K. If you are limited to max 16K blocksize anyway, because of 16K SQL queries, both raidz1 and raidz2 allow you only to effectivly use 4 of 6 drives, because with raidz1 you would loose one drive to parity and one drive due to padding overhead. With raidz2 you loose 2 drives to parity but you don't get padding overhead with the same blocksize)
2.) use "/dev/disk/by-id/proto-vendor_model_serial" instead of "/dev/sdX" when creating the pool so you can work with unique ID and disks wouldn't switch names. And because the serial of the drive is in the disks path you can easily verify that you got the right drive before physically removing it.
 
Last edited:
Thanks for your feedback @Dunuin.

Related to this incident, the ZFS disk or pool is confirmed broken, but... I still able to see the partition from grub rescue, the partition from ls (hd0,1) is contain the data. Is it still possible to recover and export it to external harddrive or something?
Screen Shot 2021-09-20 at 19.19.25.png

I tried export using snapshot method with NFS or shared storage, it's error.
2021-09-20 19.23.43.jpg
I also tried this method, but no luck

Code:
mount -o loop,offset=32256,ro /dev/vmdata/vm-301-disk-0 /mnt/vmid301
mount: /mnt/vmid301: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.

Is there still any possibilities to get the data?
Any advice and feedback are appreciated.