zfs pool wont mount, all disks healthy

kyeotic

New Member
Jan 5, 2024
8
0
1
I have (had?) a ZFS pool setup through proxmox that stopped working. All the disks in shows up in Node > Disks, with "SMART: PASSED" but "Mounted: No"

1722618256201.png


Bash:
root@homelab:~# zpool status
no pools available

root@homelab:~# zpool history
no pools available

root@homelab:~# zpool import tank
cannot import 'tank': I/O error
        Destroy and re-create the pool from
        a backup source.

root@homelab:~# ls /dev/disk/by-id
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K-part9                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part2
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761_1-part3
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H-part9                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part1
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L                                   nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part2
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L-part1                             nvme-TEAM_TM8FP6512G_TPBF2310170070320761-part3
ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L-part9                             wwn-0x5002538f33112e23
dm-name-pve-root                                                              wwn-0x5002538f33112e23-part1
dm-name-pve-swap                                                              wwn-0x5002538f33112e23-part9
dm-uuid-LVM-R6n6naJnN0MGscNHJMQliRntP3DadNPO3W8zZskqP5j8BDVtr2CLOlYH4JDVeuBz  wwn-0x5002538f33112e27
dm-uuid-LVM-R6n6naJnN0MGscNHJMQliRntP3DadNPOJ0bU7RRhVlJBUNsR4GvNBk09kxghMlvV  wwn-0x5002538f33112e27-part1
lvm-pv-uuid-1JMPCW-SdpW-Gmfq-srge-NVyN-rdvN-D6ndtC                            wwn-0x5002538f33112e27-part9
nvme-eui.6479a784e0000385                                                     wwn-0x5002538f33112e6f
nvme-eui.6479a784e0000385-part1                                               wwn-0x5002538f33112e6f-part1
nvme-eui.6479a784e0000385-part2                                               wwn-0x5002538f33112e6f-part9
nvme-eui.6479a784e0000385-part3

I would like to recover this pool if possible, but if not I would love to know what happened. This was using RAIDZ1, and my understanding was that a disk failure should be recoverable. How did the entire pool fail?

In case this isn't an issue with the pool, but with PVE, this is the error that all the containers report when starting.
TASK ERROR: activating LV 'pve/data' failed: Check of pool pve/data failed (status:1). Manual repair required!

The summary shows lots of free space

1722619746082.png

but the Disks > LVM shows 97% usage. I am not sure what this means.

1722619777067.png

---
EDIT 1

I found this post and set thin_check_options = [ "-q", "--skip-mappings" ] and rebooted. The pve/data error is gone, and some containers started, but the ZFS pool is still not showing up and still returning an error trying to import it.
 

Attachments

  • 1722619725447.png
    1722619725447.png
    10 KB · Views: 2
Last edited:
Hi , the assigned to LMs is 100% on mine , it's the amount assigned to the LMs not the amount used in the LMs, From the last time my system went wrong I've learned one thing , never store any VMs , Backups on the boot drive , that way if it get totally messed up it's sometime easier to nuke from orbit and start again with the backups than trying to get it back working .
It should be hard to mess up a ZFS array , because it should scan the drives and tell you which drives are missing .
 
Last edited:
I’m happy to delete all the VMs and backups from the boot drive it that get me access to the pool again. That doesn’t appear to work though. I still get the I/o error when trying to import the pool.

If I can’t get the pool back I would at least like to understand how this happened. Does the boot drive filling up cause the pool to get corrupted?
 
can you just post "zpool import" without any additional parameters? I am guessing that there are read/write/checksum errors on the disks or the disks show up but for some reason are read-only as something in the disk is broken that is not reported by SMART.

You might be lucky importing the pool in read only mode and without mounting the datasets (zpool import -o readonly=on -N) and then performing a ZFS send to recover the data. You could also try the recovery options zpool import -F, or zpool import -F -X, but be aware that this might lead to data loss as the pool is rolled back to a functioning state, if possible.
 
Here is the result

Code:
root@homelab:~# zpool import
   pool: tank
     id: 12367558582491436533
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        tank                                             ONLINE
          raidz1-0                                       ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105328K  ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105332H  ONLINE
            ata-Samsung_SSD_870_EVO_2TB_S6PNNS0W105404L  ONLINE


I also went through several rounds of debugging with a user on redddit, in this thread. It has the result of many other commands.
 
What you're describing appears consistent with firmware/timing issues.

the first order of business is defining the worst acceptable outcome: do you need the data?

if so, install your drives on a generic sata hba and see if the pool will import. if it does, proceed to check/update the firmware of both your hba and ssds.

If not, wipe the drives and try again. proceed to produce sample data until the problem replicates, and then post relevent entries from dmesg.
 
I do not need the data. I would like it, re configuring everything on it will take hours, but I do not need it. I just want to understand what happened so I can avoid it in the future.

Can you elaborate on "relevent entries from dmesg"? I do not know what that is.
 
Currently that returns nothing. Would you expect it to return anything in the current state?
 
egg on my face; I didnt realize this isnt trivial with a pve8 system. I tried looking at journalctl on a pve8 system for disk events, and couldn't find any. Perhaps someone from proxmox team can help?

in the meantime, might be good to install rsyslog and reboot (apt install rsyslog)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!