[SOLVED] How to fix degraded ZFS pool??

louie1961 · Jul 15, 2023

We had a power outage and as a result I decided to check my ZFS pools with "zpool status -v" Proxmox (or really ZFS) reports the following

pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 1.50M in 00:02:48 with 0 errors on Fri Jul 14 22:15:08 2023
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 0 too many errors

errors: No known data errors
root@pve:~#

Yet, when I check SMART on the affected drive, I see no errors. How do I fix this? I am guessing I don't need to replace the drive (based on the SMART report)? Do I wipe the drive and let it resilver? If yes, how is that done?

louie1961 · Jul 15, 2023

Well I *THINK* I fixed it. I did a zpool offline command for the affected partition, then I did a zpool online command and the disk resilvered automatically (pretty damn fast too I might add). Upon checking status, it still showed disk errors, so I did a clear command, and rescrubbed the pool. Hopefully this is the correct way to fix it??

root@pve:~# zpool status -v
pool: VMstorage
state: ONLINE
scan: scrub repaired 0B in 00:02:35 with 0 errors on Fri Jul 14 22:42:54 2023
config:

NAME STATE READ WRITE CKSUM
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors

louie1961 · Jul 15, 2023

Anyone have a point of view? Should I replace the disk in question?

UdoB · Jul 15, 2023

louie1961 said:
Should I replace the disk in question?

In my opinion: no.

Keep an eye on the pool's status, automatically by a monitoring tool or manually frequent enough. Replace the device when similar behavior repeats.

On the other hand: having a spare disk in the closet is always a good idea.

Just my 2€¢...

louie1961 · Jul 15, 2023

Cool thanks. So far so good. I really need to get a UPS for my home lab. I am thinking of getting a smallish UPS and pairing it with a power station as backup so that if the power goes out my server and router and such can stay up for a few hours.

Neobin · Jul 16, 2023

Sidenote:
In your initial post, you show your degraded rpool:

louie1961 said:
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 0 too many errors

errors: No known data errors

but in your second post (where you think, you fixed it), you show a completely different pool:

louie1961 said:
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors

Only a copy/paste mistake?

louie1961 · Jul 16, 2023

Only a copy/paste mistake?

Yes, just a copy and paste error

louie1961 · Jul 16, 2023

I decided to recheck this morning and after another scrub I have errors again. My VMs are stored on the VMstorage pool on NVME drives and seem fine. My system is installed on the rpool pool along with backups, ISOs, etc. I am going to replace one of the ATA SSD drives. Here's the question, see my drive layout at the botton. Since this mirror contain more than this pool (i.e., the BIOS boot and EFI partitions are there as well). what is the proper procedure for replacing this disk? Its in a hot swap bay. SDE is the drive that needs to be replaced

root@pve:~# zpool status -v
pool: VMstorage
state: ONLINE
scan: scrub repaired 0B in 00:02:37 with 0 errors on Sun Jul 16 09:31:23 2023
config:

NAME STATE READ WRITE CKSUM
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 164M in 00:02:51 with 0 errors on Sun Jul 16 09:45:18 2023
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 1.55K too many errors

errors: No known data errors

Neobin · Jul 16, 2023

https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration -> "Changing a failed bootable device"

PS.: https://pve.proxmox.com/wiki/Upgrad...t_.28for_ZFS_on_root_and_UEFI_systems_only.29

louie1961 · Jul 16, 2023

Neobin said:
PS.: https://pve.proxmox.com/wiki/Upgrad...t_.28for_ZFS_on_root_and_UEFI_systems_only.29

My system was a fresh install of Proxmox 8. Does this still apply in that case?

Neobin · Jul 16, 2023

If you installed in UEFI-mode, systemd-boot should already be installed; at least for me it was/is.
Check with: apt list systemd-boot, if it says: [installed].

hr556 · Jul 18, 2023

I have the exact same issue, "zpool clear" cleanes everything but after a few days it's popping up again. What's also interesting is when looking at the SSD itself:

Bash:

# fdisk -l /dev/sda
Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: DEF9CF3F-4151-4686-9CC4-DCD2108DE232

Device       Start        End    Sectors  Size Type
/dev/sda1       34       2047       2014 1007K BIOS boot
/dev/sda2     2048    1050623    1048576  512M EFI System
/dev/sda3  1050624 1953525134 1952474511  931G Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.

louie1961 · Jul 18, 2023

I took the drive in question out, formatted it and tried it on a different system. Smart tests all came back clean and the drive worked perfectly in another system. I then re-installed the drive into the same hot swap bay, got it all reinstalled and re-silvered, with out any issues. I was still having the same issue with errors on scrub. So I popped the drive out and put it into a different hot swap bay. I ran a scrub, it found and corrected a couple of errors. I then re-ran the scrub and checked status, and everything is normal. I am assuming it is either the SATA port or cable supporting the original hot swap bay.

Piotr K · Apr 10, 2024

I need to replace a disk too.
Beside other disks i had two NVME with BIOS boot, UEFI and ZFS partitions. One is operational but i am not sure if it is bootable for now. The second disk is gone, ZFS is in degraded state.

I assume the system is bootable from the degraded zfs. Now, how i rebuild the BIOS Boot and EFI partitions on the new disk?

Code:

[7886195.495022] nvme nvme3: I/O 280 (I/O Cmd) QID 2 timeout, aborting
[7886195.495039] nvme nvme3: I/O 10 (I/O Cmd) QID 3 timeout, aborting
[7886195.495044] nvme nvme3: I/O 11 (I/O Cmd) QID 3 timeout, aborting
[7886195.495050] nvme nvme3: I/O 410 (I/O Cmd) QID 9 timeout, aborting
[7886195.495056] nvme nvme3: I/O 54 (I/O Cmd) QID 13 timeout, aborting
[7886226.220661] nvme nvme3: I/O 27 QID 0 timeout, reset controller
[7886307.631830] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[7886307.654856] nvme nvme3: Abort status: 0x371
[7886307.654862] nvme nvme3: Abort status: 0x371
[7886307.654866] nvme nvme3: Abort status: 0x371
[7886307.654870] nvme nvme3: Abort status: 0x371
[7886307.654874] nvme nvme3: Abort status: 0x371
[7886327.676848] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[7886327.676958] nvme nvme3: Disabling device after reset failure: -19
[7886327.708868] I/O error, dev nvme3n1, sector 1347108840 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[7886327.708868] I/O error, dev nvme3n1, sector 693710520 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[7886327.708868] I/O error, dev nvme3n1, sector 1847867416 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 2
[7886327.708872] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=354104995840 size=4096 flags=1572992
[7886327.708872] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=945033326592 size=8192 flags=1572992
[7886327.708875] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=688644935680 size=4096 flags=1572992
[7886327.708874] I/O error, dev nvme3n1, sector 1847875864 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 2
[7886327.708875] I/O error, dev nvme3n1, sector 1855055440 op 0x1:(WRITE) flags 0x0 phys_seg 15 prio class 2

[SOLVED] How to fix degraded ZFS pool??

louie1961

Well-Known Member

louie1961

Well-Known Member

louie1961

Well-Known Member

UdoB

Distinguished Member

louie1961

Well-Known Member

Neobin

Distinguished Member

louie1961

Well-Known Member

louie1961

Well-Known Member

Neobin

Distinguished Member

louie1961

Well-Known Member

Neobin

Distinguished Member

hr556

Member

louie1961

Well-Known Member

Piotr K

Active Member

We value your privacy