zpool mirror faulted

Romainp

Active Member
Jan 23, 2018
19
3
43
53
Hello!
So I would like to have your input about the issue I have. I have a PBS 3.1 setup with those disks:
Code:
NAME         FSTYPE      FSVER    LABEL  UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1
├─sda2       vfat        FAT32           399C-565D                               510.7M     0% /boot/efi
└─sda3       LVM2_member LVM2 001        kG1e8u-qINS-TtPR-Mqvu-bVlj-II68-VBEEYF
  ├─pbs-swap swap        1               0d6b2516-fc82-4a58-8c27-d62a039f88c4                  [SWAP]
  └─pbs-root ext4        1.0             c79ccb95-242e-4bfd-a38f-3b4332001c40    376.4G     4% /
sdb
├─sdb1       zfs_member  5000     pool01 9173327591722584431
└─sdb9
sdc
├─sdc1       zfs_member  5000     pool01 9173327591722584431
└─sdc9
sdd
├─sdd1       zfs_member  5000     pool01 9173327591722584431
└─sdd9
sde
├─sde1       zfs_member  5000     pool01 9173327591722584431
└─sde9
sdf
├─sdf1       zfs_member  5000     pool01 9173327591722584431
└─sdf9
sdg
├─sdg1       zfs_member  5000     pool01 9173327591722584431
└─sdg9
sdh
├─sdh1       zfs_member  5000     pool01 9173327591722584431
└─sdh9
sdi
├─sdi1       zfs_member  5000     pool01 9173327591722584431
└─sdi9
sdj
├─sdj1       zfs_member  5000     pool01 9173327591722584431
└─sdj9
sdk
├─sdk1       zfs_member  5000     pool01 9173327591722584431
└─sdk9
sdl
├─sdl1       zfs_member  5000     pool01 9173327591722584431
└─sdl9
sdm
├─sdm1       zfs_member  5000     pool01 9173327591722584431
└─sdm9
sdn
├─sdn1       zfs_member  5000     pool01 9173327591722584431
└─sdn9
sdo
├─sdo1       zfs_member  5000     pool01 9173327591722584431
└─sdo9
sdp
├─sdp1       zfs_member  5000     pool01 9173327591722584431
└─sdp9
├─sdq1       zfs_member  5000     pool01 9173327591722584431
└─sdq9
sdr
├─sdr1       zfs_member  5000     pool01 9173327591722584431
└─sdr9
sds
├─sds1       zfs_member  5000     pool01 9173327591722584431
└─sds9
sdt
├─sdt1       zfs_member  5000     pool01 9173327591722584431
└─sdt9
sdu
├─sdu1       zfs_member  5000     pool01 9173327591722584431
└─sdu9
sdv
├─sdv1       zfs_member  5000     pool01 9173327591722584431
└─sdv9
sdw
├─sdw1       zfs_member  5000     pool01 9173327591722584431
└─sdw9
sdx
├─sdx1       zfs_member  5000     pool01 9173327591722584431
└─sdx9
sdy
├─sdy1       zfs_member  5000     pool01 9173327591722584431
└─sdy9

All the disks are ok:
Code:
root@srv4:~# ssacli ctrl slot=3 pd all show


Smart Array P440 in Slot 3 (HBA Mode)


   HBA Drives


      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:14 (port 1I:box 1:bay 14, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:15 (port 1I:box 1:bay 15, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:16 (port 1I:box 1:bay 16, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:17 (port 1I:box 1:bay 17, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:18 (port 1I:box 1:bay 18, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:19 (port 1I:box 1:bay 19, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:20 (port 1I:box 1:bay 20, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:21 (port 1I:box 1:bay 21, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:22 (port 1I:box 1:bay 22, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:23 (port 1I:box 1:bay 23, SAS HDD, 1.8 TB, OK)
      physicaldrive 1I:1:24 (port 1I:box 1:bay 24, SAS HDD, 1.8 TB, OK
)

After a reboot I got some morror issues (no big deal I have a synched pbs for those backup)


Code:
root@srv4:~# zpool status -x
  pool: pool01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 672M in 00:04:14 with 107062 errors on Wed Jan 31 18:02:53 2024
config:


        NAME                      STATE     READ WRITE CKSUM
        pool01                    DEGRADED     0     0     0
          mirror-0                ONLINE       0     0     0
            sdb                   ONLINE       0     0     0
            sdc                   ONLINE       0     0     0
          mirror-1                ONLINE       0     0     0
            sdd                   ONLINE       0     0     0
            sde                   ONLINE       0     0     0
          mirror-2                ONLINE       0     0     0
            sdf                   ONLINE       0     0     0
            sdg                   ONLINE       0     0     0
          mirror-3                ONLINE       0     0     0
            sdh                   ONLINE       0     0     0
            sdi                   ONLINE       0     0     0
          mirror-4                DEGRADED  105K     0     0
            sdj                   DEGRADED     0     0  209K  too many errors
            8662495910803711342   FAULTED      0     0     0  was /dev/sdj1
          mirror-5                DEGRADED     0     0     0
            15846917852308872208  FAULTED      0     0     0  was /dev/sdk1
            sdm                   ONLINE       0     0     0
          mirror-6                ONLINE       0     0     0
            sdn                   ONLINE       0     0     0
            sdo                   ONLINE       0     0     0
          mirror-7                ONLINE       0     0     0
            sdp                   ONLINE       0     0     0
            sdq                   ONLINE       0     0     0
          mirror-8                ONLINE       0     0     0
            sdr                   ONLINE       0     0     0
            sds                   ONLINE       0     0     0
          mirror-9                ONLINE       0     0     0
            sdt                   ONLINE       0     0     0
            sdu                   ONLINE       0     0     0
          mirror-10               ONLINE       0     0     0
            sdv                   ONLINE       0     0     0
            sdw                   ONLINE       0     0     0
          mirror-11               ONLINE       0     0     0
            sdx                   ONLINE       0     0     0
            sdy                   ONLINE       0     0     0

Problem is that I am not sure to do at this point..
* Seems that the reboot "renamed" some disks so a reboot may have changed the disk name (only a guess)
* Not sure about what disk has been rename/lost in the config, like how to identify the new "was /dev/sdj1" ??
* What could be the steps to fix this if possible?
* Since this naming is not really good some post onm the internet mention to use the dev-by-id naming but how can I do this?


Any help will be more than welcome :)
 
* Seems that the reboot "renamed" some disks so a reboot may have changed the disk name (only a guess)
ZFS doesn't care about that. It identifies the disks by the metadata stored on the disks. So changed names shouldn't be a problem.

What could be the steps to fix this if possible?
Your striped mirror got 2 mirrors with missing/failed disks. And of one of that mirrors the remaining disk got 210.000 corrupted records. Once a mirror is running on only a single disk every error won't be recoverable. So you best follow what ZFS is telling you:
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
 
Thanks for the reply :)
Ok so I will rebuild the pool then. Thanks for the advices!
 
By the way...its also possible to stripe 3-disk mirrors for additional reliability so ZFS could repair currupted records even if a disk of that mirror completely failed. But you will of cause lose some capacity.