[SOLVED] ZFS Pool lost after disk failure

citgot · May 31, 2022

Ok, I know this is a long read but I guess it could be interesting knowing the background.

I'm running Proxmox with a with a few VMs and Proxmox is running on a NVMe disk and the VM's are running from a ZFS pool consisting of 2 x 3TB mirrored disks. I also have a 1TB disk that one of the VM's uses for storage ("BlueIrisData"). As the 1 TB disk showed SMART errors I decided to replace it with a new 1TB disk. I mounted the new disk, edited fstab and partitioned/formated/labeled the disk (same as before) and made it available as storage in the GUI and assigned it as a disc to the VM. I then rebooted and there is where the problem started.

When I boot I get plenty of error messages and Proxmox enters emergency mode. One of the discs in the zfspool sounds horrible (loud ticking noices) and the only way to get out of emergency mode is to comment out the entered info in fstab regarding the new 1TB disc. But then I have no zfspool ("ZFSDrives") and no "BlueIrisData" drive. Proxmox can't find them in the GUI and times out looking for them.

These are my discs and partitions.


root@proxmox:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0 931.5G  0 disk
└─sda1                 8:1    0 931.5G  0 part
sdb                    8:16   0   2.7T  0 disk
├─sdb1                 8:17   0   2.7T  0 part
└─sdb9                 8:25   0     8M  0 part
sdc                    8:32   0   2.7T  0 disk
├─sdc1                 8:33   0   2.7T  0 part
└─sdc9                 8:41   0     8M  0 part
sr0                   11:0    1  1024M  0 rom 
nvme0n1              259:0    0 119.2G  0 disk
├─nvme0n1p1          259:1    0  1007K  0 part
├─nvme0n1p2          259:2    0   512M  0 part /boot/efi
└─nvme0n1p3          259:3    0 118.7G  0 part
  ├─pve-swap         253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root         253:1    0  29.5G  0 lvm  /
  ├─pve-data_tmeta   253:2    0     1G  0 lvm 
  │ └─pve-data-tpool 253:4    0  64.5G  0 lvm 
  │   └─pve-data     253:5    0  64.5G  1 lvm 
  └─pve-data_tdata   253:3    0  64.5G  0 lvm 
    └─pve-data-tpool 253:4    0  64.5G  0 lvm 
      └─pve-data     253:5    0  64.5G  1 lvm

sda1 is possible to manually mount through cli. The ZFS pool however is lost.

zpool import results in the following


root@proxmox:~# zpool import
no pools available to import

zpool import ZFSDrives renders after a minutes wait


root@proxmox:~# zpool import ZFSDrives
cannot import 'ZFSDrives': one or more devices is currently unavailable

I also tested root@proxmox:~# zpool import ZFSDrives -f with the same result

zpool status -v was next


root@proxmox:~# zpool status -v
no pools available

zdb -l /dev/sdb1 renders


root@proxmox:~# zdb -l /dev/sdb1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'ZFSDrives'
    state: 0
    txg: 3240347
    pool_guid: 171953915263981592
    errata: 0
    hostid: 3243471785
    hostname: 'proxmox'
    top_guid: 10039642933645888486
    guid: 13533403846860154412
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 10039642933645888486
        metaslab_array: 132
        metaslab_shift: 34
        ashift: 12
        asize: 3000578342912
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 13533403846860154412
            path: '/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1'
            devid: 'ata-ST3000VN000-1H4167_Z300L7WE-part1'
            phys_path: 'pci-0000:00:17.0-ata-2.0'
            whole_disk: 1
            DTL: 788
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 5455137475863193922
            path: '/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1'
            devid: 'ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1'
            phys_path: 'pci-0000:00:17.0-ata-4.0'
            whole_disk: 1
            DTL: 446
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3

citgot · May 31, 2022

journalctl -xe renders a lot of info. (Copied and pasted lines regarding discs and zfs)

May 31 21:10:18 proxmox udevadm[455]: systemd-udev-settle.service is deprecated. Please fix zfs-import-cache.service, zfs-import-scan.service not to pull it in.
May 31 21:10:19 proxmox systemd[1]: Starting Import ZFS pools by cache file...
May 31 21:10:19 proxmox systemd[1]: Condition check resulted in Import ZFS pools by device scanning being skipped.
May 31 21:10:19 proxmox systemd[1]: Starting Import ZFS pool ZFSDrives...
May 31 21:10:19 proxmox zpool[713]: cannot import 'ZFSDrives': no such pool available
May 31 21:10:19 proxmox systemd[1]: zfs-import@ZFSDrives.service: Main process exited, code=exited, status=1/FAILURE
May 31 21:10:19 proxmox systemd[1]: zfs-import@ZFSDrives.service: Failed with result 'exit-code'.
The unit zfs-import@ZFSDrives.service has entered the 'failed' state with result 'exit-code'.
May 31 21:10:19 proxmox systemd[1]: Failed to start Import ZFS pool ZFSDrives.
░░ Subject: A start job for unit zfs-import@ZFSDrives.service has failed
░░ Defined-By: systemd
May 31 21:10:26 proxmox kernel: ata2.00: exception Emask 0x0 SAct 0xc008 SErr 0x0 action 0x0
May 31 21:10:26 proxmox kernel: ata2.00: irq_stat 0x40000008
May 31 21:10:26 proxmox kernel: ata2.00: failed command: READ FPDMA QUEUED
May 31 21:10:26 proxmox kernel: ata2.00: cmd 60/00:18:30:ac:20/01:00:4a:01:00/40 tag 3 ncq dma 131072 in
                                         res 41/40:00:30:ac:20/00:01:4a:01:00/00 Emask 0x409 (media error) <F>
May 31 21:10:26 proxmox kernel: ata2.00: status: { DRDY ERR }
May 31 21:10:26 proxmox kernel: ata2.00: error: { UNC }
May 31 21:10:26 proxmox kernel: ata2.00: configured for UDMA/133
May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s
May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 Sense Key : Medium Error [current] 
May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 01 4a 20 ac 30 00 00 01 00 00 00
May 31 21:10:26 proxmox kernel: blk_update_request: I/O error, dev sdb, sector 5538622512 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0
May 31 21:10:26 proxmox kernel: zio pool=ZFSDrives vdev=/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1 error=5 type=1 offset=2835773677568 size=131>
May 31 21:10:26 proxmox kernel: ata2: EH complete
May 31 21:11:54 proxmox systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
May 31 21:11:54 proxmox systemd[1]: Failed to start Import ZFS pools by cache file.
May 31 21:11:54 proxmox zvol_wait[1491]: No zvols found, nothing to do.
May 31 21:11:54 proxmox systemd[1]: Finished Mount ZFS filesystems.
May 31 21:11:54 proxmox systemd[1]: Finished Wait for ZFS Volume (zvol) links in /dev.
May 31 21:11:54 proxmox systemd[1]: Reached target ZFS volumes are ready.
May 31 21:11:54 proxmox systemd[1]: Starting ZFS file system shares...
May 31 21:11:54 proxmox systemd[1]: Started ZFS Event Daemon (zed).
May 31 21:11:54 proxmox systemd[1]: Finished ZFS file system shares.
May 31 21:11:54 proxmox systemd[1]: Reached target ZFS startup target.
May 31 21:11:54 proxmox watchdog-mux[1548]: Watchdog driver 'Software Watchdog', version 0
May 31 21:11:54 proxmox zed[1552]: ZFS Event Daemon 2.1.4-pve1 (PID 1552)
May 31 21:11:54 proxmox kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
May 31 21:11:54 proxmox kernel: softdog:              soft_reboot_cmd=<not set> soft_active_on_boot=0
May 31 21:11:54 proxmox dbus-daemon[1533]: [system] AppArmor D-Bus mediation is enabled
May 31 21:11:54 proxmox zed[1552]: Processing events since eid=0
May 31 21:11:54 proxmox zed[1581]: eid=4 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=1>
May 31 21:11:54 proxmox zed[1592]: eid=5 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=9>
May 31 21:11:54 proxmox rsyslogd[1541]: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.2102.0]
May 31 21:11:54 proxmox rsyslogd[1541]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1541" x-info="https://www.rsyslog.com"] start
May 31 21:11:54 proxmox systemd[1]: Started System Logging Service.
May 31 21:11:54 proxmox zed[1603]: eid=9 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1602]: eid=11 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1594]: eid=7 class=io pool='ZFSDrives' size=131072 offset=2835769483264 priority=0 err=5 flags=0x100190 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1612]: eid=6 class=io pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 size=131072 offset=2835773677568 priority=0 >
May 31 21:11:54 proxmox zed[1587]: eid=3 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset=>
May 31 21:11:54 proxmox zed[1616]: eid=15 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1618]: eid=10 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1619]: eid=13 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1614]: eid=2 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=2>
May 31 21:11:54 proxmox zed[1620]: eid=14 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1621]: eid=16 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1613]: eid=8 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1611]: eid=12 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset>
May 31 21:11:54 proxmox zed[1605]: eid=1 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=8>
May 31 21:11:54 proxmox smartd[1544]: smartd 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-6-pve] (local build)
May 31 21:11:54 proxmox smartd[1544]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
May 31 21:11:54 proxmox smartd[1544]: Opened configuration file /etc/smartd.conf
May 31 21:11:54 proxmox smartd[1544]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
May 31 21:11:54 proxmox smartd[1544]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
May 31 21:11:54 proxmox systemd-logind[1545]: New seat seat0.
May 31 21:11:54 proxmox zed[1657]: eid=17 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset>
May 31 21:11:54 proxmox zed[1658]: eid=18 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1660]: eid=19 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1667]: eid=21 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1668]: eid=20 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=>
May 31 21:11:54 proxmox zed[1672]: eid=22 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset>
May 31 21:11:54 proxmox zed[1675]: eid=23 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1677]: eid=24 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1683]: eid=25 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1687]: eid=26 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1686]: eid=28 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1692]: eid=27 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1693]: eid=30 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1695]: eid=29 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox zed[1694]: eid=32 class=log_replay pool='ZFSDrives'
May 31 21:11:54 proxmox zed[1697]: eid=31 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.ST1000LX015_1U7172-WDEWWJ80.ata.state
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb, type changed from 'scsi' to 'sat'
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], opened
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], ST3000VN000-1H4167, S/N:Z300L7WE, WWN:5-000c50-063c94a8e, FW:SC42, 3.00 TB
May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], found in smartd database: Seagate NAS HDD
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.ST3000VN000_1H4167-Z300L7WE.ata.state
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc, type changed from 'scsi' to 'sat'
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], opened
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], WDC WD30EFRX-68EUZN0, S/N:WD-WMC4N0E5L7P3, WWN:5-0014ee-604fc1cea, FW:82.00A82, 3.00 TB
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], found in smartd database: Western Digital Red
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD30EFRX_68EUZN0-WD_WMC4N0E5L7P3.ata.state
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, opened
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, GIGABYTE GP-GSM2NE3128GNTD, S/N:SN210408943744, FW:EDFMB0.5, 128 GB
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list.
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, state read from /var/lib/smartmontools/smartd.GIGABYTE_GP_GSM2NE3128GNTD-SN210408943744.nvme.sta>
May 31 21:11:55 proxmox smartd[1544]: Monitoring 3 ATA/SATA, 0 SCSI/SAS and 1 NVMe devices

May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Currently unreadable (pending) sectors
May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Offline uncorrectable sectors
May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 50 to 66
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1
May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Currently unreadable (pending) sectors
May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Offline uncorrectable sectors
May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 77 to 76
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 91 to 96
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 52 to 50
May 31 21:11:55 proxmox postfix/qmgr[2010]: 91B62120F39: from=<root@proxmox.lan>, size=1010, nrcpt=1 (queue active)
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 5
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], ATA error count increased from 2204 to 2578
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, number of Error Log entries increased from 19 to 23

When testing the health of the sda drive in BIOS it shows up as faulty. The sound it makes backs that result. So I guess I have to replace the disc. But how can I get the new disc in the pool if there is no pool to import it to? Any suggestions on what to do next?

bobmc · May 31, 2022

It does look like /dev/sdb has an issue, but you should still have a working member of the pool in /dev/sdc...

what does the output of zdb -l /dev/sdc1 look like?

try mv /etc/zfs/zpool.cache /etc/zfs/zpool.cache.backup and reboot, then try zpool import again

citgot · Jun 1, 2022

bobmc said:
It does look like /dev/sdb has an issue, but you should still have a working member of the pool in /dev/sdc...

what does the output of zdb -l /dev/sdc1 look like?

try mv /etc/zfs/zpool.cache /etc/zfs/zpool.cache.backup and reboot, then try zpool import again

Thanks for your time

after reboot zpool import shows same as before


root@proxmox:~# zpool import
no pools available to import

zdb -l /dev/sdc1 results in


root@proxmox:~# zdb -l /dev/sdc1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'ZFSDrives'
    state: 0
    txg: 3240393
    pool_guid: 171953915263981592
    errata: 0
    hostid: 3243471785
    hostname: 'proxmox'
    top_guid: 10039642933645888486
    guid: 5455137475863193922
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 10039642933645888486
        metaslab_array: 132
        metaslab_shift: 34
        ashift: 12
        asize: 3000578342912
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 13533403846860154412
            path: '/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1'
            devid: 'ata-ST3000VN000-1H4167_Z300L7WE-part1'
            phys_path: 'pci-0000:00:17.0-ata-2.0'
            whole_disk: 1
            DTL: 788
            create_txg: 4
            faulted: 1
        children[1]:
            type: 'disk'
            id: 1
            guid: 5455137475863193922
            path: '/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1'
            devid: 'ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1'
            phys_path: 'pci-0000:00:17.0-ata-4.0'
            whole_disk: 1
            DTL: 446
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3

It is strange how the pool just disappeared. Any more ideas to try?

bobmc · Jun 1, 2022

I agree, it is strange. At this stage I'd be inclined to try the drive in a new system or under a live cd boot - e.g https://openzfs.github.io/openzfs-docs/Getting Started/Debian/Debian Buster Root on ZFS.html and see if the pool is recognised there

Stoiko Ivanov · Jun 1, 2022

citgot said:
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Currently unreadable (pending) sectors

citgot said:
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Currently unreadable (pending) sectors

and some other errors for both drives in the log:
* check/replace the cables
* check/replace the hba/controller
* check/replace the power-supply

and I'd agree with @bobmc - try plugging the drives in another system with ZFS and try importing the pool read-only

citgot · Jun 1, 2022

bobmc said:
I agree, it is strange. At this stage I'd be inclined to try the drive in a new system or under a live cd boot - e.g https://openzfs.github.io/openzfs-docs/Getting Started/Debian/Debian Buster Root on ZFS.html and see if the pool is recognised there

Thanks. A great idea. I tested on the server using a live USB as I have no other systems able to use 3,5" HDDs.

lsblk


$ lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0                  7:0    0   1.7G  1 loop /live/linux
loop1                  7:1    0   1.6G  0 loop /home
sda                    8:0    0 931.5G  0 disk 
└─sda1                 8:1    0 931.5G  0 part 
sdb                    8:16   0   2.7T  0 disk 
├─sdb1                 8:17   0   2.7T  0 part 
└─sdb9                 8:25   0     8M  0 part 
sdc                    8:32   0   2.7T  0 disk 
├─sdc1                 8:33   0   2.7T  0 part 
└─sdc9                 8:41   0     8M  0 part 
sdd                    8:48   1  14.5G  0 disk 
├─sdd1                 8:49   1  14.4G  0 part /live/boot-dev
└─sdd2                 8:50   1    49M  0 part 
sr0                   11:0    1  1024M  0 rom  
nvme0n1              259:0    0 119.2G  0 disk 
├─nvme0n1p1          259:1    0  1007K  0 part 
├─nvme0n1p2          259:2    0   512M  0 part 
└─nvme0n1p3          259:3    0 118.7G  0 part 
  ├─pve-swap         254:0    0     8G  0 lvm  
  ├─pve-root         254:1    0  29.5G  0 lvm  
  ├─pve-data_tmeta   254:2    0     1G  0 lvm  
  │ └─pve-data-tpool 254:4    0  64.5G  0 lvm  
  │   └─pve-data     254:5    0  64.5G  1 lvm  
  └─pve-data_tdata   254:3    0  64.5G  0 lvm  
    └─pve-data-tpool 254:4    0  64.5G  0 lvm  
      └─pve-data     254:5    0  64.5G  1 lvm

So I looked for the pool


$ zpool status
no pools available

So I try to import in readonly mode

$ 
sudo zpool import -o readonly=on ZFSDrives
cannot import 'ZFSDrives': pool was previously in use from another system.
Last accessed by proxmox (hostid=c15373a9) at Mon May 30 20:21:25 2022
The pool can be imported, use 'zpool import -f' to import the pool.

So it looks like the pool is there!
What can I take from this revelation? How would I go about to get it back to proxmox? What would my next step be?

citgot · Jun 1, 2022

Stoiko Ivanov said:
and some other errors for both drives in the log:
* check/replace the cables
* check/replace the hba/controller
* check/replace the power-supply

and I'd agree with @bobmc - try plugging the drives in another system with ZFS and try importing the pool read-only

Thanks

I have switched the cables and it seems that the problem stays with the same disc regardless. All cables are firmly seated. I'm thinking about the health of the PSU but usually HP Workstations have rather bulletproof PSUs. But something is making my proxmox system fail at boot with disc errors. But maybe I have to buy another to check if it sorts everything out.

bobmc · Jun 2, 2022

Good news then - did you try zpool import -f and then zpool export ZFSDrives ?

If the drive has logged errors due to a bad cable - usually CRC errors rather than uncorrectable sectors for cable faults, then the errors remain in the log and the drive's smart status will never be clear again. So swapping the cables around will not always be conclusive.

In any case, for peace of mind, I would be looking at backing up my data asap and replacing /dev/sdb

citgot · Jun 2, 2022

bobmc said:
Good news then - did you try zpool import -f and then zpool export ZFSDrives ?

If the drive has logged errors due to a bad cable - usually CRC errors rather than uncorrectable sectors for cable faults, then the errors remain in the log and the drive's smart status will never be clear again. So swapping the cables around will not always be conclusive.

In any case, for peace of mind, I would be looking at backing up my data asap and replacing /dev/sdb

Yes and I'm afraid it won't work. So some good news and some bad.


$ sudo zpool import -f
   pool: ZFSDrives
     id: 171953915263981592
  state: ONLINE
status: One or more devices were being resilvered.
 action: The pool can be imported using its name or numeric identifier.
 config:

    ZFSDrives                                     ONLINE
      mirror-0                                    ONLINE
        ata-ST3000VN000-1H4167_Z300L7WE           ONLINE
        ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3  ONLINE

and...


$ sudo zpool import -f ZFSDrives
cannot import 'ZFSDrives': one or more devices is currently unavailable

At least the Live USB system can see the pool. But if I can't import it I can't replace the faulty disc either, or will zpool be able to import the pool if I disconnect the sdb disc? My replacement disc is incoming so I have a day or two to prepare.

I also must find the reason why Proxmox won't boot properly. I don't want this to happen again.

citgot · Jun 2, 2022

Went at it with fresh eyes this morning and realised that I tried to mount the pool in read-write mode yesterday. So I tried again with the live USB system.

$ sudo zpool import -f -o readonly=on ZFSDrives rendered no response! Good news. So I checked the status of the pool


$ zpool status
  pool: ZFSDrives
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon May 30 19:52:52 2022
        0B scanned at 0B/s, 0B issued at 0B/s, 184G total
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                          STATE     READ WRITE CKSUM
        ZFSDrives                                     ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-ST3000VN000-1H4167_Z300L7WE           ONLINE       2     0     0
            ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3  ONLINE       0     0     0

errors: No known data errors

OK, so resilvering is being done since the disk failure. 10 minutes later....


$ zpool status
  pool: ZFSDrives
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon May 30 19:52:52 2022
        0B scanned at 0B/s, 0B issued at 0B/s, 184G total
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                                          STATE     READ WRITE CKSUM
        ZFSDrives                                     ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-ST3000VN000-1H4167_Z300L7WE           ONLINE       2     0     0
            ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3  ONLINE       0     0     0

errors: No known data errors

Tried looking for zpool events


$ sudo zpool events
TIME                           CLASS
Jun  2 2022 09:29:46.865208419 ereport.fs.zfs.checksum
Jun  2 2022 09:29:46.865208419 ereport.fs.zfs.checksum
Jun  2 2022 09:29:46.869208402 ereport.fs.zfs.checksum
Jun  2 2022 09:29:46.885208327 ereport.fs.zfs.checksum
Jun  2 2022 09:29:46.901208255 ereport.fs.zfs.checksum
Jun  2 2022 09:29:55.329169793 ereport.fs.zfs.io
Jun  2 2022 09:29:55.329169793 ereport.fs.zfs.io
Jun  2 2022 09:30:00.497146253 ereport.fs.zfs.data
Jun  2 2022 09:30:00.525146123 ereport.fs.zfs.log_replay
Jun  2 2022 09:30:00.769145008 ereport.fs.zfs.checksum
Jun  2 2022 09:30:00.789144919 ereport.fs.zfs.checksum
Jun  2 2022 09:30:00.797144884 ereport.fs.zfs.checksum
Jun  2 2022 09:30:14.317083442 ereport.fs.zfs.data
Jun  2 2022 09:30:14.317083442 ereport.fs.zfs.log_replay
Jun  2 2022 09:30:14.649081941 ereport.fs.zfs.checksum
Jun  2 2022 09:30:14.665081866 ereport.fs.zfs.checksum
Jun  2 2022 09:30:14.697081722 ereport.fs.zfs.checksum
Jun  2 2022 09:30:28.193020615 ereport.fs.zfs.data
Jun  2 2022 09:30:28.193020615 ereport.fs.zfs.log_replay
Jun  2 2022 09:30:28.513019167 ereport.fs.zfs.checksum
Jun  2 2022 09:30:28.553018986 ereport.fs.zfs.checksum
Jun  2 2022 09:30:28.565018932 ereport.fs.zfs.checksum
Jun  2 2022 09:30:42.532955914 ereport.fs.zfs.data
Jun  2 2022 09:30:42.532955914 ereport.fs.zfs.log_replay
Jun  2 2022 09:32:47.544371894 ereport.fs.zfs.io
Jun  2 2022 09:32:51.412352375 ereport.fs.zfs.io
Jun  2 2022 09:32:51.520351829 sysevent.fs.zfs.pool_import

Doesn't tell me anything more than that there are errors.

zpool events -v is attached in the file below

Should I wait or try to export or something completely different? I don't want to create more data loss at this moment.

bobmc · Jun 2, 2022

I would wait for the rebuild but as soon as a replacement drive is available, I would be replacing /dev/sdd. I would also do a zpool export followed by a zpool import -d /dev/disk/by-id ZFSDrives is this makes the pool more tolerant of drive letters being reassigned by the system. Make a note of the drive serial numbers and matching id which you can query by using smartctl -a /dev/sdb so you know which is which should you need to swap a drive again.

citgot · Jun 2, 2022

Thanks. I'll wait and see if the new drive solves it all. I managed to import the pool on the Proxmox system this afternoon. However after next reboot it wasn't doable again.

At least I solved most of the other boot failures. Proxmox referenced a directory from the old broken sda disk and threw errors upon boot. When I deleted the .mount file in /etc/systemd/system and rebooted all errors were gone except the ones from the sdb drive. Hopefully the new disk will erase them.

Thanks @bobmc for all the help.

citgot · Jun 7, 2022

Got the new disk and Proxmox did everything for me except the replace disk command which was the only thing I had to do myself. That worked as a charm and now almost everything is back to normal. Looks like no data loss which is great! Case closed!

[SOLVED] ZFS Pool lost after disk failure

citgot

New Member

citgot

New Member

bobmc

Famous Member

citgot

New Member

bobmc

Famous Member

Stoiko Ivanov

Proxmox Staff Member

citgot

New Member

citgot

New Member

bobmc

Famous Member

citgot

New Member

citgot

New Member

Attachments

bobmc

Famous Member

citgot

New Member

citgot

New Member

We value your privacy