[SOLVED] ZFS Pool lost after disk failure

citgot

New Member
Oct 17, 2020
16
2
3
54
Ok, I know this is a long read but I guess it could be interesting knowing the background.

I'm running Proxmox with a with a few VMs and Proxmox is running on a NVMe disk and the VM's are running from a ZFS pool consisting of 2 x 3TB mirrored disks. I also have a 1TB disk that one of the VM's uses for storage ("BlueIrisData"). As the 1 TB disk showed SMART errors I decided to replace it with a new 1TB disk. I mounted the new disk, edited fstab and partitioned/formated/labeled the disk (same as before) and made it available as storage in the GUI and assigned it as a disc to the VM. I then rebooted and there is where the problem started.

When I boot I get plenty of error messages and Proxmox enters emergency mode. One of the discs in the zfspool sounds horrible (loud ticking noices) and the only way to get out of emergency mode is to comment out the entered info in fstab regarding the new 1TB disc. But then I have no zfspool ("ZFSDrives") and no "BlueIrisData" drive. Proxmox can't find them in the GUI and times out looking for them.

These are my discs and partitions.

root@proxmox:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 931.5G 0 disk └─sda1 8:1 0 931.5G 0 part sdb 8:16 0 2.7T 0 disk ├─sdb1 8:17 0 2.7T 0 part └─sdb9 8:25 0 8M 0 part sdc 8:32 0 2.7T 0 disk ├─sdc1 8:33 0 2.7T 0 part └─sdc9 8:41 0 8M 0 part sr0 11:0 1 1024M 0 rom nvme0n1 259:0 0 119.2G 0 disk ├─nvme0n1p1 259:1 0 1007K 0 part ├─nvme0n1p2 259:2 0 512M 0 part /boot/efi └─nvme0n1p3 259:3 0 118.7G 0 part ├─pve-swap 253:0 0 8G 0 lvm [SWAP] ├─pve-root 253:1 0 29.5G 0 lvm / ├─pve-data_tmeta 253:2 0 1G 0 lvm │ └─pve-data-tpool 253:4 0 64.5G 0 lvm │ └─pve-data 253:5 0 64.5G 1 lvm └─pve-data_tdata 253:3 0 64.5G 0 lvm └─pve-data-tpool 253:4 0 64.5G 0 lvm └─pve-data 253:5 0 64.5G 1 lvm

sda1 is possible to manually mount through cli. The ZFS pool however is lost.

zpool import results in the following
root@proxmox:~# zpool import no pools available to import

zpool import ZFSDrives renders after a minutes wait
root@proxmox:~# zpool import ZFSDrives cannot import 'ZFSDrives': one or more devices is currently unavailable

I also tested root@proxmox:~# zpool import ZFSDrives -f with the same result

zpool status -v was next
root@proxmox:~# zpool status -v no pools available


zdb -l /dev/sdb1 renders
root@proxmox:~# zdb -l /dev/sdb1 ------------------------------------ LABEL 0 ------------------------------------ version: 5000 name: 'ZFSDrives' state: 0 txg: 3240347 pool_guid: 171953915263981592 errata: 0 hostid: 3243471785 hostname: 'proxmox' top_guid: 10039642933645888486 guid: 13533403846860154412 vdev_children: 1 vdev_tree: type: 'mirror' id: 0 guid: 10039642933645888486 metaslab_array: 132 metaslab_shift: 34 ashift: 12 asize: 3000578342912 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 13533403846860154412 path: '/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1' devid: 'ata-ST3000VN000-1H4167_Z300L7WE-part1' phys_path: 'pci-0000:00:17.0-ata-2.0' whole_disk: 1 DTL: 788 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 5455137475863193922 path: '/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1' devid: 'ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1' phys_path: 'pci-0000:00:17.0-ata-4.0' whole_disk: 1 DTL: 446 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data labels = 0 1 2 3
 
Last edited:
journalctl -xe renders a lot of info. (Copied and pasted lines regarding discs and zfs)

May 31 21:10:18 proxmox udevadm[455]: systemd-udev-settle.service is deprecated. Please fix zfs-import-cache.service, zfs-import-scan.service not to pull it in. May 31 21:10:19 proxmox systemd[1]: Starting Import ZFS pools by cache file... May 31 21:10:19 proxmox systemd[1]: Condition check resulted in Import ZFS pools by device scanning being skipped. May 31 21:10:19 proxmox systemd[1]: Starting Import ZFS pool ZFSDrives... May 31 21:10:19 proxmox zpool[713]: cannot import 'ZFSDrives': no such pool available May 31 21:10:19 proxmox systemd[1]: zfs-import@ZFSDrives.service: Main process exited, code=exited, status=1/FAILURE May 31 21:10:19 proxmox systemd[1]: zfs-import@ZFSDrives.service: Failed with result 'exit-code'. The unit zfs-import@ZFSDrives.service has entered the 'failed' state with result 'exit-code'. May 31 21:10:19 proxmox systemd[1]: Failed to start Import ZFS pool ZFSDrives. ░░ Subject: A start job for unit zfs-import@ZFSDrives.service has failed ░░ Defined-By: systemd May 31 21:10:26 proxmox kernel: ata2.00: exception Emask 0x0 SAct 0xc008 SErr 0x0 action 0x0 May 31 21:10:26 proxmox kernel: ata2.00: irq_stat 0x40000008 May 31 21:10:26 proxmox kernel: ata2.00: failed command: READ FPDMA QUEUED May 31 21:10:26 proxmox kernel: ata2.00: cmd 60/00:18:30:ac:20/01:00:4a:01:00/40 tag 3 ncq dma 131072 in res 41/40:00:30:ac:20/00:01:4a:01:00/00 Emask 0x409 (media error) <F> May 31 21:10:26 proxmox kernel: ata2.00: status: { DRDY ERR } May 31 21:10:26 proxmox kernel: ata2.00: error: { UNC } May 31 21:10:26 proxmox kernel: ata2.00: configured for UDMA/133 May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=4s May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 Sense Key : Medium Error [current] May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed May 31 21:10:26 proxmox kernel: sd 1:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 01 4a 20 ac 30 00 00 01 00 00 00 May 31 21:10:26 proxmox kernel: blk_update_request: I/O error, dev sdb, sector 5538622512 op 0x0:(READ) flags 0x700 phys_seg 2 prio class 0 May 31 21:10:26 proxmox kernel: zio pool=ZFSDrives vdev=/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1 error=5 type=1 offset=2835773677568 size=131> May 31 21:10:26 proxmox kernel: ata2: EH complete May 31 21:11:54 proxmox systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'. May 31 21:11:54 proxmox systemd[1]: Failed to start Import ZFS pools by cache file. May 31 21:11:54 proxmox zvol_wait[1491]: No zvols found, nothing to do. May 31 21:11:54 proxmox systemd[1]: Finished Mount ZFS filesystems. May 31 21:11:54 proxmox systemd[1]: Finished Wait for ZFS Volume (zvol) links in /dev. May 31 21:11:54 proxmox systemd[1]: Reached target ZFS volumes are ready. May 31 21:11:54 proxmox systemd[1]: Starting ZFS file system shares... May 31 21:11:54 proxmox systemd[1]: Started ZFS Event Daemon (zed). May 31 21:11:54 proxmox systemd[1]: Finished ZFS file system shares. May 31 21:11:54 proxmox systemd[1]: Reached target ZFS startup target. May 31 21:11:54 proxmox watchdog-mux[1548]: Watchdog driver 'Software Watchdog', version 0 May 31 21:11:54 proxmox zed[1552]: ZFS Event Daemon 2.1.4-pve1 (PID 1552) May 31 21:11:54 proxmox kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0) May 31 21:11:54 proxmox kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0 May 31 21:11:54 proxmox dbus-daemon[1533]: [system] AppArmor D-Bus mediation is enabled May 31 21:11:54 proxmox zed[1552]: Processing events since eid=0 May 31 21:11:54 proxmox zed[1581]: eid=4 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=1> May 31 21:11:54 proxmox zed[1592]: eid=5 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=9> May 31 21:11:54 proxmox rsyslogd[1541]: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd. [v8.2102.0] May 31 21:11:54 proxmox rsyslogd[1541]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1541" x-info="https://www.rsyslog.com"] start May 31 21:11:54 proxmox systemd[1]: Started System Logging Service. May 31 21:11:54 proxmox zed[1603]: eid=9 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1602]: eid=11 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1594]: eid=7 class=io pool='ZFSDrives' size=131072 offset=2835769483264 priority=0 err=5 flags=0x100190 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1612]: eid=6 class=io pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 size=131072 offset=2835773677568 priority=0 > May 31 21:11:54 proxmox zed[1587]: eid=3 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset=> May 31 21:11:54 proxmox zed[1616]: eid=15 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1618]: eid=10 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1619]: eid=13 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1614]: eid=2 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=2> May 31 21:11:54 proxmox zed[1620]: eid=14 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1621]: eid=16 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1613]: eid=8 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1611]: eid=12 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset> May 31 21:11:54 proxmox zed[1605]: eid=1 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=8> May 31 21:11:54 proxmox smartd[1544]: smartd 7.2 2020-12-30 r5155 [x86_64-linux-5.13.19-6-pve] (local build) May 31 21:11:54 proxmox smartd[1544]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org May 31 21:11:54 proxmox smartd[1544]: Opened configuration file /etc/smartd.conf May 31 21:11:54 proxmox smartd[1544]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf May 31 21:11:54 proxmox smartd[1544]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices May 31 21:11:54 proxmox systemd-logind[1545]: New seat seat0. May 31 21:11:54 proxmox zed[1657]: eid=17 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset> May 31 21:11:54 proxmox zed[1658]: eid=18 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1660]: eid=19 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1667]: eid=21 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1668]: eid=20 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=4096 offset=> May 31 21:11:54 proxmox zed[1672]: eid=22 class=checksum pool='ZFSDrives' vdev=ata-ST3000VN000-1H4167_Z300L7WE-part1 algorithm=fletcher4 size=12288 offset> May 31 21:11:54 proxmox zed[1675]: eid=23 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1677]: eid=24 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1683]: eid=25 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1687]: eid=26 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1686]: eid=28 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1692]: eid=27 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1693]: eid=30 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1695]: eid=29 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox zed[1694]: eid=32 class=log_replay pool='ZFSDrives' May 31 21:11:54 proxmox zed[1697]: eid=31 class=data pool='ZFSDrives' priority=0 err=5 flags=0x1808991 bookmark=387:0:-2:7413 May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list. May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.ST1000LX015_1U7172-WDEWWJ80.ata.state May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb, type changed from 'scsi' to 'sat' May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], opened May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], ST3000VN000-1H4167, S/N:Z300L7WE, WWN:5-000c50-063c94a8e, FW:SC42, 3.00 TB May 31 21:11:54 proxmox smartd[1544]: Device: /dev/sdb [SAT], found in smartd database: Seagate NAS HDD May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list. May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.ST3000VN000_1H4167-Z300L7WE.ata.state May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc, type changed from 'scsi' to 'sat' May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], opened May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], WDC WD30EFRX-68EUZN0, S/N:WD-WMC4N0E5L7P3, WWN:5-0014ee-604fc1cea, FW:82.00A82, 3.00 TB May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], found in smartd database: Western Digital Red May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list. May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD30EFRX_68EUZN0-WD_WMC4N0E5L7P3.ata.state May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, opened May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, GIGABYTE GP-GSM2NE3128GNTD, S/N:SN210408943744, FW:EDFMB0.5, 128 GB May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list. May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, state read from /var/lib/smartmontools/smartd.GIGABYTE_GP_GSM2NE3128GNTD-SN210408943744.nvme.sta> May 31 21:11:55 proxmox smartd[1544]: Monitoring 3 ATA/SATA, 0 SCSI/SAS and 1 NVMe devices May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Currently unreadable (pending) sectors May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Offline uncorrectable sectors May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 50 to 66 May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1 May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... May 31 21:11:55 proxmox smartd[1544]: Warning via /usr/share/smartmontools/smartd-runner to root: successful May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Currently unreadable (pending) sectors May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Offline uncorrectable sectors May 31 21:11:55 proxmox smartd[1544]: Sending warning via /usr/share/smartmontools/smartd-runner to root ... May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 77 to 76 May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 91 to 96 May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 52 to 50 May 31 21:11:55 proxmox postfix/qmgr[2010]: 91B62120F39: from=<root@proxmox.lan>, size=1010, nrcpt=1 (queue active) May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], Self-Test Log error count increased from 2 to 5 May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], ATA error count increased from 2204 to 2578 May 31 21:11:55 proxmox smartd[1544]: Device: /dev/nvme0, number of Error Log entries increased from 19 to 23

When testing the health of the sda drive in BIOS it shows up as faulty. The sound it makes backs that result. So I guess I have to replace the disc. But how can I get the new disc in the pool if there is no pool to import it to? Any suggestions on what to do next?
 
It does look like /dev/sdb has an issue, but you should still have a working member of the pool in /dev/sdc...

what does the output of zdb -l /dev/sdc1 look like?

try mv /etc/zfs/zpool.cache /etc/zfs/zpool.cache.backup and reboot, then try zpool import again
 
It does look like /dev/sdb has an issue, but you should still have a working member of the pool in /dev/sdc...

what does the output of zdb -l /dev/sdc1 look like?

try mv /etc/zfs/zpool.cache /etc/zfs/zpool.cache.backup and reboot, then try zpool import again
Thanks for your time

after reboot zpool import shows same as before
root@proxmox:~# zpool import no pools available to import

zdb -l /dev/sdc1 results in
root@proxmox:~# zdb -l /dev/sdc1 ------------------------------------ LABEL 0 ------------------------------------ version: 5000 name: 'ZFSDrives' state: 0 txg: 3240393 pool_guid: 171953915263981592 errata: 0 hostid: 3243471785 hostname: 'proxmox' top_guid: 10039642933645888486 guid: 5455137475863193922 vdev_children: 1 vdev_tree: type: 'mirror' id: 0 guid: 10039642933645888486 metaslab_array: 132 metaslab_shift: 34 ashift: 12 asize: 3000578342912 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 13533403846860154412 path: '/dev/disk/by-id/ata-ST3000VN000-1H4167_Z300L7WE-part1' devid: 'ata-ST3000VN000-1H4167_Z300L7WE-part1' phys_path: 'pci-0000:00:17.0-ata-2.0' whole_disk: 1 DTL: 788 create_txg: 4 faulted: 1 children[1]: type: 'disk' id: 1 guid: 5455137475863193922 path: '/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1' devid: 'ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3-part1' phys_path: 'pci-0000:00:17.0-ata-4.0' whole_disk: 1 DTL: 446 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data labels = 0 1 2 3

It is strange how the pool just disappeared. Any more ideas to try?
 
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sda [SAT], 200 Currently unreadable (pending) sectors
May 31 21:11:55 proxmox smartd[1544]: Device: /dev/sdb [SAT], 34928 Currently unreadable (pending) sectors
and some other errors for both drives in the log:
* check/replace the cables
* check/replace the hba/controller
* check/replace the power-supply

and I'd agree with @bobmc - try plugging the drives in another system with ZFS and try importing the pool read-only
 
  • Like
Reactions: citgot
I agree, it is strange. At this stage I'd be inclined to try the drive in a new system or under a live cd boot - e.g https://openzfs.github.io/openzfs-docs/Getting Started/Debian/Debian Buster Root on ZFS.html and see if the pool is recognised there
Thanks. A great idea. I tested on the server using a live USB as I have no other systems able to use 3,5" HDDs.

lsblk

$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1.7G 1 loop /live/linux loop1 7:1 0 1.6G 0 loop /home sda 8:0 0 931.5G 0 disk └─sda1 8:1 0 931.5G 0 part sdb 8:16 0 2.7T 0 disk ├─sdb1 8:17 0 2.7T 0 part └─sdb9 8:25 0 8M 0 part sdc 8:32 0 2.7T 0 disk ├─sdc1 8:33 0 2.7T 0 part └─sdc9 8:41 0 8M 0 part sdd 8:48 1 14.5G 0 disk ├─sdd1 8:49 1 14.4G 0 part /live/boot-dev └─sdd2 8:50 1 49M 0 part sr0 11:0 1 1024M 0 rom nvme0n1 259:0 0 119.2G 0 disk ├─nvme0n1p1 259:1 0 1007K 0 part ├─nvme0n1p2 259:2 0 512M 0 part └─nvme0n1p3 259:3 0 118.7G 0 part ├─pve-swap 254:0 0 8G 0 lvm ├─pve-root 254:1 0 29.5G 0 lvm ├─pve-data_tmeta 254:2 0 1G 0 lvm │ └─pve-data-tpool 254:4 0 64.5G 0 lvm │ └─pve-data 254:5 0 64.5G 1 lvm └─pve-data_tdata 254:3 0 64.5G 0 lvm └─pve-data-tpool 254:4 0 64.5G 0 lvm └─pve-data 254:5 0 64.5G 1 lvm

So I looked for the pool

$ zpool status no pools available

So I try to import in readonly mode

$ sudo zpool import -o readonly=on ZFSDrives cannot import 'ZFSDrives': pool was previously in use from another system. Last accessed by proxmox (hostid=c15373a9) at Mon May 30 20:21:25 2022 The pool can be imported, use 'zpool import -f' to import the pool.

So it looks like the pool is there!
What can I take from this revelation? How would I go about to get it back to proxmox? What would my next step be?
 
and some other errors for both drives in the log:
* check/replace the cables
* check/replace the hba/controller
* check/replace the power-supply

and I'd agree with @bobmc - try plugging the drives in another system with ZFS and try importing the pool read-only
Thanks

I have switched the cables and it seems that the problem stays with the same disc regardless. All cables are firmly seated. I'm thinking about the health of the PSU but usually HP Workstations have rather bulletproof PSUs. But something is making my proxmox system fail at boot with disc errors. But maybe I have to buy another to check if it sorts everything out.
 
Good news then - did you try zpool import -f and then zpool export ZFSDrives ?

If the drive has logged errors due to a bad cable - usually CRC errors rather than uncorrectable sectors for cable faults, then the errors remain in the log and the drive's smart status will never be clear again. So swapping the cables around will not always be conclusive.

In any case, for peace of mind, I would be looking at backing up my data asap and replacing /dev/sdb
 
Good news then - did you try zpool import -f and then zpool export ZFSDrives ?

If the drive has logged errors due to a bad cable - usually CRC errors rather than uncorrectable sectors for cable faults, then the errors remain in the log and the drive's smart status will never be clear again. So swapping the cables around will not always be conclusive.

In any case, for peace of mind, I would be looking at backing up my data asap and replacing /dev/sdb

Yes and I'm afraid it won't work. So some good news and some bad.

$ sudo zpool import -f pool: ZFSDrives id: 171953915263981592 state: ONLINE status: One or more devices were being resilvered. action: The pool can be imported using its name or numeric identifier. config: ZFSDrives ONLINE mirror-0 ONLINE ata-ST3000VN000-1H4167_Z300L7WE ONLINE ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3 ONLINE

and...

$ sudo zpool import -f ZFSDrives cannot import 'ZFSDrives': one or more devices is currently unavailable

At least the Live USB system can see the pool. But if I can't import it I can't replace the faulty disc either, or will zpool be able to import the pool if I disconnect the sdb disc? My replacement disc is incoming so I have a day or two to prepare.

I also must find the reason why Proxmox won't boot properly. I don't want this to happen again.
 
Went at it with fresh eyes this morning and realised that I tried to mount the pool in read-write mode yesterday. So I tried again with the live USB system.

$ sudo zpool import -f -o readonly=on ZFSDrives rendered no response! Good news. So I checked the status of the pool

$ zpool status pool: ZFSDrives state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon May 30 19:52:52 2022 0B scanned at 0B/s, 0B issued at 0B/s, 184G total 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM ZFSDrives ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-ST3000VN000-1H4167_Z300L7WE ONLINE 2 0 0 ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3 ONLINE 0 0 0 errors: No known data errors

OK, so resilvering is being done since the disk failure. 10 minutes later....

$ zpool status pool: ZFSDrives state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon May 30 19:52:52 2022 0B scanned at 0B/s, 0B issued at 0B/s, 184G total 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM ZFSDrives ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-ST3000VN000-1H4167_Z300L7WE ONLINE 2 0 0 ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0E5L7P3 ONLINE 0 0 0 errors: No known data errors

Tried looking for zpool events

$ sudo zpool events TIME CLASS Jun 2 2022 09:29:46.865208419 ereport.fs.zfs.checksum Jun 2 2022 09:29:46.865208419 ereport.fs.zfs.checksum Jun 2 2022 09:29:46.869208402 ereport.fs.zfs.checksum Jun 2 2022 09:29:46.885208327 ereport.fs.zfs.checksum Jun 2 2022 09:29:46.901208255 ereport.fs.zfs.checksum Jun 2 2022 09:29:55.329169793 ereport.fs.zfs.io Jun 2 2022 09:29:55.329169793 ereport.fs.zfs.io Jun 2 2022 09:30:00.497146253 ereport.fs.zfs.data Jun 2 2022 09:30:00.525146123 ereport.fs.zfs.log_replay Jun 2 2022 09:30:00.769145008 ereport.fs.zfs.checksum Jun 2 2022 09:30:00.789144919 ereport.fs.zfs.checksum Jun 2 2022 09:30:00.797144884 ereport.fs.zfs.checksum Jun 2 2022 09:30:14.317083442 ereport.fs.zfs.data Jun 2 2022 09:30:14.317083442 ereport.fs.zfs.log_replay Jun 2 2022 09:30:14.649081941 ereport.fs.zfs.checksum Jun 2 2022 09:30:14.665081866 ereport.fs.zfs.checksum Jun 2 2022 09:30:14.697081722 ereport.fs.zfs.checksum Jun 2 2022 09:30:28.193020615 ereport.fs.zfs.data Jun 2 2022 09:30:28.193020615 ereport.fs.zfs.log_replay Jun 2 2022 09:30:28.513019167 ereport.fs.zfs.checksum Jun 2 2022 09:30:28.553018986 ereport.fs.zfs.checksum Jun 2 2022 09:30:28.565018932 ereport.fs.zfs.checksum Jun 2 2022 09:30:42.532955914 ereport.fs.zfs.data Jun 2 2022 09:30:42.532955914 ereport.fs.zfs.log_replay Jun 2 2022 09:32:47.544371894 ereport.fs.zfs.io Jun 2 2022 09:32:51.412352375 ereport.fs.zfs.io Jun 2 2022 09:32:51.520351829 sysevent.fs.zfs.pool_import

Doesn't tell me anything more than that there are errors.

zpool events -v is attached in the file below

Should I wait or try to export or something completely different? I don't want to create more data loss at this moment.
 

Attachments

  • events.txt
    46.9 KB · Views: 2
Last edited:
I would wait for the rebuild but as soon as a replacement drive is available, I would be replacing /dev/sdd. I would also do a zpool export followed by a zpool import -d /dev/disk/by-id ZFSDrives is this makes the pool more tolerant of drive letters being reassigned by the system. Make a note of the drive serial numbers and matching id which you can query by using smartctl -a /dev/sdb so you know which is which should you need to swap a drive again.
 
  • Like
Reactions: citgot
Thanks. I'll wait and see if the new drive solves it all. I managed to import the pool on the Proxmox system this afternoon. However after next reboot it wasn't doable again.

At least I solved most of the other boot failures. Proxmox referenced a directory from the old broken sda disk and threw errors upon boot. When I deleted the .mount file in /etc/systemd/system and rebooted all errors were gone except the ones from the sdb drive. Hopefully the new disk will erase them.

Thanks @bobmc for all the help.
 
  • Like
Reactions: bobmc
Got the new disk and Proxmox did everything for me except the replace disk command which was the only thing I had to do myself. That worked as a charm and now almost everything is back to normal. Looks like no data loss which is great! Case closed!
 
  • Like
Reactions: bobmc

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!