zfs pool wont mount, all disks healthy

kyeotic · Aug 2, 2024

I have (had?) a ZFS pool setup through proxmox that stopped working. All the disks in shows up in Node > Disks, with "SMART: PASSED" but "Mounted: No"

Bash:

root@homelab:~# zpool status
no pools available

root@homelab:~# zpool history
no pools available

root@homelab:~# zpool import tank
cannot import 'tank': I/O error
        Destroy and re-create the pool from
        a backup source.

root@homelab:~# ls /dev/disk/by-id
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K-part9                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part2
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part3
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H-part9                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part2
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part3
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L-part9                             wwn-0x5002538f33112e23
dm-name-pve-root                                                              wwn-0x5002538f33112e23-part1
dm-name-pve-swap                                                              wwn-0x5002538f33112e23-part9
dm-uuid-LVM-R6n6naJnN0MGscNHJMQliRntP3DadNPO3W8zZskqP5j8BDVtr2CLOlYH4JDVeuBz  wwn-0x5002538f33112e27
dm-uuid-LVM-R6n6naJnN0MGscNHJMQliRntP3DadNPOJ0bU7RRhVlJBUNsR4GvNBk09kxghMlvV  wwn-0x5002538f33112e27-part1
lvm-pv-uuid-1JMPCW-SdpW-Gmfq-srge-NVyN-rdvN-D6ndtC                            wwn-0x5002538f33112e27-part9
nvme-eui.6479a784e0000385                                                     wwn-0x5002538f33112e6f
nvme-eui.6479a784e0000385-part1                                               wwn-0x5002538f33112e6f-part1
nvme-eui.6479a784e0000385-part2                                               wwn-0x5002538f33112e6f-part9
nvme-eui.6479a784e0000385-part3

I would like to recover this pool if possible, but if not I would love to know what happened. This was using RAIDZ1, and my understanding was that a disk failure should be recoverable. How did the entire pool fail?

In case this isn't an issue with the pool, but with PVE, this is the error that all the containers report when starting.

TASK ERROR: activating LV 'pve/data' failed: Check of pool pve/data failed (status:1). Manual repair required!

The summary shows lots of free space

but the Disks > LVM shows 97% usage. I am not sure what this means.

---
EDIT 1

I found this post and set thin_check_options = [ "-q", "--skip-mappings" ] and rebooted. The pve/data error is gone, and some containers started, but the ZFS pool is still not showing up and still returning an error trying to import it.

peter247 · Aug 3, 2024

Hi , the assigned to LMs is 100% on mine , it's the amount assigned to the LMs not the amount used in the LMs, From the last time my system went wrong I've learned one thing , never store any VMs , Backups on the boot drive , that way if it get totally messed up it's sometime easier to nuke from orbit and start again with the backups than trying to get it back working .
It should be hard to mess up a ZFS array , because it should scan the drives and tell you which drives are missing .

kyeotic · Aug 3, 2024

I’m happy to delete all the VMs and backups from the boot drive it that get me access to the pool again. That doesn’t appear to work though. I still get the I/o error when trying to import the pool.

If I can’t get the pool back I would at least like to understand how this happened. Does the boot drive filling up cause the pool to get corrupted?

simonhoffmann · Aug 18, 2024

can you just post "zpool import" without any additional parameters? I am guessing that there are read/write/checksum errors on the disks or the disks show up but for some reason are read-only as something in the disk is broken that is not reported by SMART.

You might be lucky importing the pool in read only mode and without mounting the datasets (zpool import -o readonly=on -N) and then performing a ZFS send to recover the data. You could also try the recovery options zpool import -F, or zpool import -F -X, but be aware that this might lead to data loss as the pool is rolled back to a functioning state, if possible.

kyeotic · Aug 19, 2024

Here is the result

Code:

root@homelab:~# zpool import
   pool: tank
     id: 12367558582491436533
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank                                             ONLINE
          raidz1-0                                       ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K  ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H  ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L  ONLINE

I also went through several rounds of debugging with a user on redddit, in this thread. It has the result of many other commands.

alexskysilk · Aug 19, 2024

What you're describing appears consistent with firmware/timing issues.

the first order of business is defining the worst acceptable outcome: do you need the data?

if so, install your drives on a generic sata hba and see if the pool will import. if it does, proceed to check/update the firmware of both your hba and ssds.

If not, wipe the drives and try again. proceed to produce sample data until the problem replicates, and then post relevent entries from dmesg.

kyeotic · Aug 19, 2024

I do not need the data. I would like it, re configuring everything on it will take hours, but I do not need it. I just want to understand what happened so I can avoid it in the future.

Can you elaborate on "relevent entries from dmesg"? I do not know what that is.

alexskysilk · Aug 19, 2024

dmesg | grep 'scsi\|sd\|mpt2sas'

kyeotic · Aug 19, 2024

Currently that returns nothing. Would you expect it to return anything in the current state?

alexskysilk · Aug 19, 2024

egg on my face; I didnt realize this isnt trivial with a pve8 system. I tried looking at journalctl on a pve8 system for disk events, and couldn't find any. Perhaps someone from proxmox team can help?

in the meantime, might be good to install rsyslog and reboot (apt install rsyslog)

Search

Search

zfs pool wont mount, all disks healthy

kyeotic

New Member

Attachments

peter247

Member

kyeotic

New Member

simonhoffmann

Member

kyeotic

New Member

alexskysilk

Distinguished Member

kyeotic

New Member

alexskysilk

Distinguished Member

kyeotic

New Member

alexskysilk

Distinguished Member

We value your privacy