ZFS pool disappeared after upgrade from 1.0.8 to 1.0.11

DynFi User

Renowned Member
Apr 18, 2016
148
16
83
49
dynfi.com
I had a pool in a degraded state on a server with the following config :

Code:
root@tremoctopus:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:00:04 with 0 errors on Sun Mar 14 00:24:05 2021
config:

    NAME                                                  STATE     READ WRITE CKSUM
    rpool                                                 ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        ata-2.5__SATA_SSD_3MG2-P_BCA11812100260002-part3  ONLINE       0     0     0
        ata-2.5__SATA_SSD_3MG2-P_BCA11812100260001-part3  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: resilvered 4.11G in 0 days 00:01:39 with 0 errors on Wed Feb 17 11:25:00 2021
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sdc     ONLINE       0     1     0
            sdd     ONLINE       0     0     0
            sde     UNAVAIL      4 1.45K     0
            sdf     ONLINE       0     1     0
        logs
          nvme0n1   ONLINE       0     0     0

errors: No known data errors

I have rebooted the server and after the server has finished reboot, the tank pool has simply vanished !


I think that there are couple of problems here :

  1. there should be a BIG RED FLAG on the GUI if you have a disk pool that has one unit broken / failed (today nothing is displayed in the GUI, you have to go all the way to your disk and ZFS to figure out what is going on)
  2. this has to be tied to some e-mail / warning system somehow
  3. I can't understand what happened to the system to have the pool simply wiped with no prior warning or anything like It. The pool my be in failed state, or corrupted or something else in terms of status, but it has vanished so something must be really wrong somewhere…

Thanks for your help recovering this lost pool
 
Ok so there is definitely a bug here !

I was put on the right track thanks to this post : https://forum.proxmox.com/threads/zfs-pool-disappears-after-reboot.54736/


The zfs import service can't seem to start because of this :

Code:
-- Subject: A start job for unit zfs-import-cache.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit zfs-import-cache.service has begun execution.
--
-- The job identifier is 530.
Apr 01 13:08:01 tremoctopus proxmox-backup-proxy[1482]: GET /api2/json/admin/datastore/BackupPMX/status: 400 Bad Request: [client [::ffff:192.168.210.11]:38556] unable to open chunk store 'BackupPMX' at
Apr 01 13:08:02 tremoctopus proxmox-backup-proxy[1482]: GET /api2/json/admin/datastore/BackupPMX/status: 400 Bad Request: [client [::ffff:192.168.210.12]:49214] unable to open chunk store 'BackupPMX' at
Apr 01 13:08:02 tremoctopus zpool[15730]: cannot import 'tank': one or more devices is currently unavailable
Apr 01 13:08:02 tremoctopus systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An ExecStart= process belonging to unit zfs-import-cache.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 1.
Apr 01 13:08:02 tremoctopus systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit zfs-import-cache.service has entered the 'failed' state with result 'exit-code'.
Apr 01 13:08:02 tremoctopus systemd[1]: Failed to start Import ZFS pools by cache file.
-- Subject: A start job for unit zfs-import-cache.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit zfs-import-cache.service has finished with a failure.
--
-- The job identifier is 530 and the job result is failed.
Apr 01 13:08:02 tremoctopus zed[15842]: eid=9 class=statechange pool_guid=0xFF21611A82023449 vdev_path=/dev/sde1 vdev_state=UNAVAIL
Apr 01 13:08:03 tremoctopus zed[15846]: eid=10 class=vdev.no_replicas pool_guid=0xFF21611A82023449
Apr 01 13:08:03 tremoctopus zed[15942]: eid=11 class=zpool pool_guid=0xFF21611A82023449
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: read disk_usage on "/tank/BckpPmxSrv" failed - ENOENT: No such file or directory
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: find_mounted_device failed - ENOENT: No such file or directory
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: read disk_usage on "/tank/BackupCT" failed - ENOENT: No such file or directory
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: find_mounted_device failed - ENOENT: No such file or directory
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: read disk_usage on "/tank/BackupMontmartre" failed - ENOENT: No such file or directory
Apr 01 13:08:03 tremoctopus proxmox-backup-proxy[1482]: find_mounted_device failed - ENOENT: No such file or directory
Apr 01 13:08:04 tremoctopus proxmox-backup-proxy[1482]: GET /api2/json/admin/datastore/BackupPMX/status: 400 Bad Request: [client [::ffff:192.168.210.10]:38318] unable to open chunk store 'BackupPMX' at
Apr 01 13:08:05 tremoctopus proxmox-backup-proxy[1482]: GET /api2/json/admin/datastore/BackupPMX/status: 400 Bad Request: [client [::ffff:192.168.210.13]:50230] unable to open chunk store 'BackupPMX' at


Would you mind letting me know what to from here ?
 
I just bought my subscription to PBS hopping that It will help solve the issue !
 
I had to

zpool import mypool

This has worked, and I was then able to start the zfs-import-cache.service

systemctl enable zfs-import-cache.service


I'll need to pay a visit to my DC in order to change the failed disk !
 
I wanted to mark this thread as solved, but there is still a problem : It is not normal that the zfs-import-cache.service can't be started if you have a failed disk on a pool.

So I am leaving it as is hopping that there could have a cleaner way to solve this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!