ZFS pool lost after power outage

VartKat · Sep 17, 2021

Hello,

I've installed a PBS server as a virtual machine on PVE. Attached a external hard drive as ZFS pool. It was working fine but we suffered a power outage.
On reboot the ZFS pool was not there.

Here is what I tried :

Code:

root@pbs:~# dmesg |grep sdb
[    1.372620] sd 2:0:0:2: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[    1.372656] sd 2:0:0:2: [sdb] Write Protect is off
[    1.372658] sd 2:0:0:2: [sdb] Mode Sense: 63 00 00 08
[    1.372769] sd 2:0:0:2: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.375823]  sdb: sdb1 sdb9
[    1.376512] sd 2:0:0:2: [sdb] Attached SCSI disk

root@pbs:~# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda            8:0    0    32G  0 disk
├─sda1         8:1    0  1007K  0 part
├─sda2         8:2    0   512M  0 part
└─sda3         8:3    0  31.5G  0 part
  ├─pbs-swap 253:0    0   3.9G  0 lvm  [SWAP]
  └─pbs-root 253:1    0  23.8G  0 lvm  /
sdb            8:16   0 931.5G  0 disk
├─sdb1         8:17   0 931.5G  0 part
└─sdb9         8:25   0     8M  0 part
sr0           11:0    1  1024M  0 rom


root@pbs:~# ls -la /dev/disk/by-id/
total 0
drwxr-xr-x 2 root root 300 Sep 15 14:20 .
drwxr-xr-x 8 root root 160 Sep 15 14:20 ..
lrwxrwxrwx 1 root root   9 Sep 15 14:20 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1 root root  10 Sep 15 14:20 dm-name-pbs-root -> ../../dm-1
lrwxrwxrwx 1 root root  10 Sep 15 14:20 dm-name-pbs-swap -> ../../dm-0
lrwxrwxrwx 1 root root  10 Sep 15 14:20 dm-uuid-LVM-I7Md2lgFHPqfbQfb9CcdRPHsDPAvotSx2uYkVxK3ghF7qzpVisRmUR5C4W6x0akA -> ../../dm-1
lrwxrwxrwx 1 root root  10 Sep 15 14:20 dm-uuid-LVM-I7Md2lgFHPqfbQfb9CcdRPHsDPAvotSxJsDAm2chlroWXaV8QBY1fgTMdf7wJdH4 -> ../../dm-0
lrwxrwxrwx 1 root root  10 Sep 15 14:20 lvm-pv-uuid-qB4eVV-RZ7S-QT7w-WK8p-G81T-Gh9t-23KUwY -> ../../sda3
lrwxrwxrwx 1 root root   9 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0 -> ../../sda
lrwxrwxrwx 1 root root  10 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part2 -> ../../sda2
lrwxrwxrwx 1 root root  10 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part3 -> ../../sda3
lrwxrwxrwx 1 root root   9 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi2 -> ../../sdb
lrwxrwxrwx 1 root root  10 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Sep 15 14:20 scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part9 -> ../../sdb9


root@pbs:~# zpool import -a -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1
cannot import 'BckDsk1T': I/O error
    Destroy and re-create the pool from
    a backup source.

root@pbs:~# zpool status -x
no pools available

root@pbs:~# zpool status -P
no pools available

root@pbs:~# zpool import -F
   pool: BckDsk1T
     id: 6864508705939350378
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

    BckDsk1T    ONLINE
      sdb       ONLINE

root@pbs:~# zpool list
no pools available

root@pbs:~# zpool import BckDsk1T
cannot import 'BckDsk1T': I/O error
    Destroy and re-create the pool from
    a backup source.
root@pbs:~# zpool status -x
no pools available
root@pbs:~# zpool upgrade -a
This system supports ZFS pool feature flags.

All pools are already formatted using feature flags.

Every feature flags pool already has all supported features enabled.
root@pbs:~# zpool import -F BckDsk1T
cannot import 'BckDsk1T': I/O error
    Destroy and re-create the pool from
    a backup source.


root@pbs:~# ls -la /mnt/datastore/
total 12
drwxr-xr-x 3 root root 4096 May 13 12:07 .
drwxr-xr-x 4 root root 4096 May 13 12:07 ..
drwxr-xr-x 2 root root 4096 May 13 12:07 BckDsk1T
root@pbs:~# ls -la /mnt/datastore/BckDsk1T/
total 8
drwxr-xr-x 2 root root 4096 May 13 12:07 .
drwxr-xr-x 3 root root 4096 May 13 12:07 ..


root@pbs:~# zpool import -nfFX -R /mnt/datastore/BckDsk1T/
   pool: BckDsk1T
     id: 6864508705939350378
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

    BckDsk1T    ONLINE
      sdb       ONLINE


root@pbs:~# zpool import -a -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1
cannot import 'BckDsk1T': I/O error
    Destroy and re-create the pool from
    a backup source.
root@pbs:~# zpool import -fFX
   pool: BckDsk1T
     id: 6864508705939350378
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

    BckDsk1T    ONLINE
      sdb       ONLINE


root@pbs:~# zdb -l /dev/sdb[1-9]
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'BckDsk1T'
    state: 0
    txg: 4085812
    pool_guid: 6864508705939350378
    errata: 0
    hostid: 3121281629
    hostname: 'pbs'
    top_guid: 10634939957079973197
    guid: 10634939957079973197
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 10634939957079973197
        path: '/dev/sdb1'
        devid: 'scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1'
        phys_path: 'pci-0000:00:05.0-scsi-0:0:0:2'
        whole_disk: 1
        metaslab_array: 131
        metaslab_shift: 33
        ashift: 12
        asize: 1000189984768
        is_log: 0
        DTL: 14642
        create_txg: 4
        degraded: 1
        aux_state: 'err_exceeded'
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3
root@pbs:~# zpool import -fFX BckDsk1T
cannot import 'BckDsk1T': one or more devices is currently unavailable


root@pbs:~# iostat -m /dev/sd?
Linux 5.4.106-1-pve (pbs)     09/16/2021     _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    0.03    0.07    0.02   99.83

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               1.28         0.02         0.01        883        537
sdb               0.47         0.01         0.00        598          2

root@pbs:~# zpool import -F -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1 BckDsk1T
cannot import 'BckDsk1T': I/O error
    Destroy and re-create the pool from
    a backup source.


root@pbs:~# zpool import -F -d /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi2-part1 BckDsk1Tb
cannot import 'BckDsk1Tb': no such pool available

root@pbs:~# systemctl status zfs-import-cache.service
● zfs-import-cache.service - Import ZFS pools by cache file
   Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2021-09-15 14:20:49 CDT; 15h ago
     Docs: man:zpool(8)
 Main PID: 475 (code=exited, status=1/FAILURE)

Sep 15 14:20:48 pbs systemd[1]: Starting Import ZFS pools by cache file...
Sep 15 14:20:49 pbs zpool[475]: cannot import 'BckDsk1T': I/O error
Sep 15 14:20:49 pbs zpool[475]:         Destroy and re-create the pool from
Sep 15 14:20:49 pbs zpool[475]:         a backup source.
Sep 15 14:20:49 pbs systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 14:20:49 pbs systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Sep 15 14:20:49 pbs systemd[1]: Failed to start Import ZFS pools by cache file.

I tried smartctl to see if there was any physical problem but as the disk is attached with a ASMedia USB bridge I can't see smart infos.

If anybody has an idea on how to get back my data that would ne wonderfull. If it's impossible I think I will destroy the partitions on the disk using fdisk and redo it from scratch, but that means redoing all my backups.

Thanks for your help

Regards

V.

VartKat · Sep 17, 2021

I succeed to ask a smart diag from the pve server :

Code:

root@pve:~# smartctl -a -d sat /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.128-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile (SMR)
Device Model:     WDC WD10SPZX-00Z10T0
Serial Number:    WD-WX61A38NLPCH
LU WWN Device Id: 5 0014ee 65dee7b10
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Sep 17 05:21:11 2021 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (11280) seconds.
Offline data collection
capabilities:                    (0x71) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 182) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   192   021    Pre-fail  Always       -       1216
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4229
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       4558
194 Temperature_Celsius     0x0022   112   100   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Seems disk is fine.

wigor · Sep 17, 2021

Hey,
have you checked if the pve mounted the pool already?

Dunuin · Sep 17, 2021

Your drive uses SMR (shingled magnetic recording). SMR HDDs shouldn't be used with ZFS because of SMR the writes gets horrible slow after the cache has been filled up and that results in a latency of the drive that is so slow that ZFS thinks the drive is dead, because it isn'T answering in time, so ZFS will handle that missing answers as IO errors and your pool will degrade. I've seen average latencies going up from milliseconds to seconds or even MINUTES if writing big chucks of data to SMR HDDs.
So first you need some CMR HDDs instead of SMR and second doesn't using ZFS make much sense if you only use one drive. You would need atleast a second drive to create a mirror so you got some parity. Without parity ZFS wont help you if if drive dies and even if your drive didn't die but just some data degrades, ZFS won't be able to repair that degraded data. So it will work with one drive but you are missing most of the great features that ZFS offers to ensure data integrity. And many features of ZFS like encryption, compression, deduplication, checksumming and so on aren'T really needed because PBS is already doing this on the application layer so you should disable that for ZFS anyway.
And ZFS is only as secure as the hardware it is running on. A USB-to-SATA-controller isn't really great if you want your storage to be reliable.

VartKat · Sep 17, 2021

wigor said:
Hey,
have you checked if the pve mounted the pool already?

PVE only knows about the drive as a PBS server drive not the other way around. If it doens't exist in PBS it can't exist in PVE.

VartKat · Sep 17, 2021

Dunuin said:
Your drive uses SMR (shingled magnetic recording). SMR HDDs shouldn't be used with ZFS because of SMR the writes gets horrible slow after the cache has been filled up and that results in a latency of the drive that is so slow that ZFS thinks the drive is dead, because it isn'T answering in time, so ZFS will handle that missing answers as IO errors and your pool will degrade. I've seen average latencies going up from milliseconds to seconds or even MINUTES if writing big chucks of data to SMR HDDs.
So first you need some CMR HDDs instead of SMR and second doesn't using ZFS make much sense if you only use one drive. You would need atleast a second drive to create a mirror so you got some parity. Without parity ZFS wont help you if if drive dies and even if your drive didn't die but just some data degrades, ZFS won't be able to repair that degraded data. So it will work with one drive but you are missing most of the great features that ZFS offers to ensure data integrity. And many features of ZFS like encryption, compression, deduplication, checksumming and so on aren'T really needed because PBS is already doing this on the application layer so you should disable that for ZFS anyway.
And ZFS is only as secure as the hardware it is running on. A USB-to-SATA-controller isn't really great if you want your storage to be reliable.

I knew of what you're telling me. The only thing I wasn't aware was SMR vs CMR drives. Actually this drive was the only one I had on hand at the time I built the pool and I didn't expect mush of it, only to be as reliable as a standard ext4 drive, I mean beeing aware of imminent failure thanks to smartctl and change it the day it occurs.

Many thanks for these infos but you're not telling me if there's any mean to avoid these I/O errors due to a kind of timetout. If this is some sort of timeout it must exist a way to increase the timeout limit, no ?

Last if you're positive this pool if dead, I will break it by deleting the partition s-using fdisk and redoing it.
If I well understand your advice is either to buy two CMR drives or to make it an ext4 single drive ( in this case I'll do an offsite copy to be more reliable).

Dunuin · Sep 17, 2021

VartKat said:
Many thanks for these infos but you're not telling me if there's any mean to avoid these I/O errors due to a kind of timetout. If this is some sort of timeout it must exist a way to increase the timeout limit, no ?

As far as I know there is no fix. SMR drives are just not meant to be used with write intense workloads and shouldn't be used with ZFS. Its just bad hardware by design. Its similar to SLC vs QLC for SSDs where you sacrifice latency and write performance for more capacity so you can manufacture the drives cheaper.

VartKat said:
If I well understand your advice is either to buy two CMR drives or to make it an ext4 single drive ( in this case I'll do an offsite copy to be more reliable).

I would go with etx4 because ZFS doesn't really offer a great benefit when used as a PBS datastore. And an additional offsite backup is always a good idea.

chrcoluk · Sep 17, 2021

Also wouldnt the background activity of SMR drives be risky in terms of power loss as well?

You write data to a SMR drive, like TLC/QLC ssd's with their SLC cache it has a fast CMR area used for writes, then later on when drive is under low load it moves the data to SMR areas, I assume during this type of activity you are extra vulnerable to a power loss.

Dunuin · Sep 18, 2021

chrcoluk said:
Also wouldnt the background activity of SMR drives be risky in terms of power loss as well?

You write data to a SMR drive, like TLC/QLC ssd's with their SLC cache it has a fast CMR area used for writes, then later on when drive is under low load it moves the data to SMR areas, I assume during this type of activity you are extra vulnerable to a power loss.

I dont know but I would guess it only deletes the CMR data after SMR data has been written, so that there is always one valid copy even if the RAM cache is lost on an power outage. But exaclty that is the problem. The CMR area isn't that big and as soon as it is full your can only write directly to the SMR area and then you only get some kb/s of write performance and evrything gets unusable slow with horrible latencies. Its really only meant for simple office usage where you store low amounts of data at once and with big idle intervals between the writes so the firmware has time to free up the CMR area again und write stuff slowly in the background to the SMR area. But with a PBS datastore you want to store dozens or hunderts of GB as fast as possible and only in really big bursts.

Search

Search

ZFS pool lost after power outage

VartKat

Member

VartKat

Member

wigor

Well-Known Member

Dunuin

Distinguished Member

VartKat

Member

VartKat

Member

Dunuin

Distinguished Member

chrcoluk

Renowned Member

Dunuin

Distinguished Member

We value your privacy