[SOLVED] How to remove dead ZIL/ZLOG from ZFS?

yurtesen

Member
Nov 14, 2020
38
5
13
Hello,

I am testing ZFS and I created a raidz2 vdev with 2 log devices. For testing purposes I pull the log devices, zfs shows them as faulted. Which is all fine.

I then try to remove them using `zpool remove rpool device1 device2` but it causes "rpool has encountered an uncorrectable I/O failure and has been suspended" error. I also tried `zpool disable rpool device1 device2`. These are the log devices which were pulled and marked as faulted.

How can one remove a log device if it does not exist anymore? Is that not possible?

One other interesting thing is that after a while the alloc goes down on the pulled log devices. But one device alloc goes to 0 and the other one stays at 128k. I am able to remove the device with 0 alloc without the I/O failure problem!

If I restart proxmox with log devices pulled. It refuses to restart with error that device is unavailable. I can force import the pool. When system boots I see the drive as UNAVAIL and I am able to remove it.

So how does one deal with broken ZIL/SLOG from a pool?

Thanks!
 
Whats the output of zpool status?
 
Hi @aaron

It is entirely possible that I was too hasty and did not wait enough time for ZFS to figure out what is going on. After all it was a test so I pulled the drives and after few seconds I was trying to remove the log devices :)

When I pull the drives I see:

Code:
root@proxmox1:~# zpool status
  pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 01:54:58 with 0 errors on Fri Mar  5 23:23:03 2021
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             DEGRADED     0     0     0
          raidz2-0                        ONLINE       0     0     0
            scsi-35000c5009ef10d73-part3  ONLINE       0     0     0
            scsi-35000c5009ef13a7b-part3  ONLINE       0     0     0
            scsi-35000c5009ee0739f-part3  ONLINE       0     0     0
            scsi-35000c5009ef12caf-part3  ONLINE       0     0     0
            scsi-35000c5009ef10dbb-part3  ONLINE       0     0     0
            scsi-35000c5009ef16347-part3  ONLINE       0     0     0
            scsi-35000c5009ee08af3-part3  ONLINE       0     0     0
        logs
          scsi-351402ec00188c3ce-part2    FAULTED      6     2     0  too many errors
          scsi-351402ec00188c3cf-part2    FAULTED      6     4     0  too many errors

errors: No known data errors
root@proxmox1:~#

iostat -v shows

Code:
root@proxmox1:~# zpool iostat -v
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              206G  11.3T  48.8K    231   285M  13.9M
  raidz2                           206G  11.3T  48.8K    173   285M  13.5M
    scsi-35000c5009ef10d73-part3      -      -  7.01K     26  40.6M  1.92M
    scsi-35000c5009ef13a7b-part3      -      -  6.95K     23  40.7M  1.92M
    scsi-35000c5009ee0739f-part3      -      -  6.97K     24  40.7M  1.92M
    scsi-35000c5009ef12caf-part3      -      -  6.99K     25  40.6M  1.92M
    scsi-35000c5009ef10dbb-part3      -      -  6.95K     23  40.7M  1.92M
    scsi-35000c5009ef16347-part3      -      -  7.00K     25  40.6M  1.92M
    scsi-35000c5009ee08af3-part3      -      -  6.90K     23  40.9M  1.92M
logs                                  -      -      -      -      -      -
  scsi-351402ec00188c3ce-part2        0    83G      0     37    985   307K
  scsi-351402ec00188c3cf-part2     216K  83.0G      0     44    985   362K
--------------------------------  -----  -----  -----  -----  -----  -----
root@proxmox1:~#

I think the issue is that last 216K which is not cleared. I found out that I can run. `zpool clear` on the device to get that last bit sorted out....

Code:
root@proxmox1:~# zpool clear rpool scsi-351402ec00188c3cf-part2
root@proxmox1:~# zpool iostat -v
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              205G  11.3T  44.0K    210   257M  12.6M
  raidz2                           205G  11.3T  44.0K    157   257M  12.1M
    scsi-35000c5009ef10d73-part3      -      -  6.32K     24  36.6M  1.73M
    scsi-35000c5009ef13a7b-part3      -      -  6.27K     21  36.8M  1.74M
    scsi-35000c5009ee0739f-part3      -      -  6.29K     22  36.7M  1.74M
    scsi-35000c5009ef12caf-part3      -      -  6.30K     23  36.7M  1.74M
    scsi-35000c5009ef10dbb-part3      -      -  6.28K     21  36.8M  1.74M
    scsi-35000c5009ef16347-part3      -      -  6.31K     22  36.7M  1.74M
    scsi-35000c5009ee08af3-part3      -      -  6.23K     21  36.9M  1.74M
logs                                  -      -      -      -      -      -
  scsi-351402ec00188c3ce-part2        0    83G      0     32    855   267K
  scsi-351402ec00188c3cf-part2        0    83G      0     38    855   315K
--------------------------------  -----  -----  -----  -----  -----  -----
root@proxmox1:~#


The `zpool status` actually says:
Code:
action: Replace the faulted device, or use 'zpool clear' to mark the device

        repaired.
But I found out in some oracle documentation: https://docs.oracle.com/cd/E36784_01/html/E36835/gbbvf.html
Code:
status: One or more of the intent logs could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.

So actually `zpool clear` can be used to flush the logs apparently. In either case, I now tried to wait quite a bit and eventually the `alloc` went to zero by itself.

When the `alloc` is zero, removing the log device is a matter of `zpool remove` command.. But if `alloc` is NOT zero then it fails. So one has to be careful...
 
Last edited:
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!