[SOLVED] How to remove dead ZIL/ZLOG from ZFS?

yurtesen

Active Member
Nov 14, 2020
38
5
28
Hello,

I am testing ZFS and I created a raidz2 vdev with 2 log devices. For testing purposes I pull the log devices, zfs shows them as faulted. Which is all fine.

I then try to remove them using `zpool remove rpool device1 device2` but it causes "rpool has encountered an uncorrectable I/O failure and has been suspended" error. I also tried `zpool disable rpool device1 device2`. These are the log devices which were pulled and marked as faulted.

How can one remove a log device if it does not exist anymore? Is that not possible?

One other interesting thing is that after a while the alloc goes down on the pulled log devices. But one device alloc goes to 0 and the other one stays at 128k. I am able to remove the device with 0 alloc without the I/O failure problem!

If I restart proxmox with log devices pulled. It refuses to restart with error that device is unavailable. I can force import the pool. When system boots I see the drive as UNAVAIL and I am able to remove it.

So how does one deal with broken ZIL/SLOG from a pool?

Thanks!
 
Whats the output of zpool status?
 
Hi @aaron

It is entirely possible that I was too hasty and did not wait enough time for ZFS to figure out what is going on. After all it was a test so I pulled the drives and after few seconds I was trying to remove the log devices :)

When I pull the drives I see:

Code:
root@proxmox1:~# zpool status
  pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 01:54:58 with 0 errors on Fri Mar  5 23:23:03 2021
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             DEGRADED     0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5009ef10d73-part3  ONLINE       0     0     0
            scsi-35000c5009ef13a7b-part3  ONLINE       0     0     0
            scsi-35000c5009ee0739f-part3  ONLINE       0     0     0
            scsi-35000c5009ef12caf-part3  ONLINE       0     0     0
            scsi-35000c5009ef10dbb-part3  ONLINE       0     0     0
            scsi-35000c5009ef16347-part3  ONLINE       0     0     0
            scsi-35000c5009ee08af3-part3  ONLINE       0     0     0
        logs
          scsi-351402ec00188c3ce-part2    FAULTED      6     2     0  too many errors
          scsi-351402ec00188c3cf-part2    FAULTED      6     4     0  too many errors

errors: No known data errors
root@proxmox1:~#

iostat -v shows

Code:
root@proxmox1:~# zpool iostat -v
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              206G  11.3T  48.8K    231   285M  13.9M
  raidz2                           206G  11.3T  48.8K    173   285M  13.5M
    scsi-35000c5009ef10d73-part3      -      -  7.01K     26  40.6M  1.92M
    scsi-35000c5009ef13a7b-part3      -      -  6.95K     23  40.7M  1.92M
    scsi-35000c5009ee0739f-part3      -      -  6.97K     24  40.7M  1.92M
    scsi-35000c5009ef12caf-part3      -      -  6.99K     25  40.6M  1.92M
    scsi-35000c5009ef10dbb-part3      -      -  6.95K     23  40.7M  1.92M
    scsi-35000c5009ef16347-part3      -      -  7.00K     25  40.6M  1.92M
    scsi-35000c5009ee08af3-part3      -      -  6.90K     23  40.9M  1.92M
logs                                  -      -      -      -      -      -
  scsi-351402ec00188c3ce-part2        0    83G      0     37    985   307K
  scsi-351402ec00188c3cf-part2     216K  83.0G      0     44    985   362K
--------------------------------  -----  -----  -----  -----  -----  -----
root@proxmox1:~#

I think the issue is that last 216K which is not cleared. I found out that I can run. `zpool clear` on the device to get that last bit sorted out....

Code:
root@proxmox1:~# zpool clear rpool scsi-351402ec00188c3cf-part2
root@proxmox1:~# zpool iostat -v
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              205G  11.3T  44.0K    210   257M  12.6M
  raidz2                           205G  11.3T  44.0K    157   257M  12.1M
    scsi-35000c5009ef10d73-part3      -      -  6.32K     24  36.6M  1.73M
    scsi-35000c5009ef13a7b-part3      -      -  6.27K     21  36.8M  1.74M
    scsi-35000c5009ee0739f-part3      -      -  6.29K     22  36.7M  1.74M
    scsi-35000c5009ef12caf-part3      -      -  6.30K     23  36.7M  1.74M
    scsi-35000c5009ef10dbb-part3      -      -  6.28K     21  36.8M  1.74M
    scsi-35000c5009ef16347-part3      -      -  6.31K     22  36.7M  1.74M
    scsi-35000c5009ee08af3-part3      -      -  6.23K     21  36.9M  1.74M
logs                                  -      -      -      -      -      -
  scsi-351402ec00188c3ce-part2        0    83G      0     32    855   267K
  scsi-351402ec00188c3cf-part2        0    83G      0     38    855   315K
--------------------------------  -----  -----  -----  -----  -----  -----
root@proxmox1:~#


The `zpool status` actually says:
Code:
action: Replace the faulted device, or use 'zpool clear' to mark the device

        repaired.
But I found out in some oracle documentation: https://docs.oracle.com/cd/E36784_01/html/E36835/gbbvf.html
Code:
status: One or more of the intent logs could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.

So actually `zpool clear` can be used to flush the logs apparently. In either case, I now tried to wait quite a bit and eventually the `alloc` went to zero by itself.

When the `alloc` is zero, removing the log device is a matter of `zpool remove` command.. But if `alloc` is NOT zero then it fails. So one has to be careful...
 
Last edited:
  • Like
Reactions: aaron