[SOLVED] How to fix degraded ZFS pool??

louie1961

Active Member
Jan 25, 2023
256
66
28
We had a power outage and as a result I decided to check my ZFS pools with "zpool status -v" Proxmox (or really ZFS) reports the following

pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 1.50M in 00:02:48 with 0 errors on Fri Jul 14 22:15:08 2023
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 0 too many errors

errors: No known data errors
root@pve:~#


Yet, when I check SMART on the affected drive, I see no errors. How do I fix this? I am guessing I don't need to replace the drive (based on the SMART report)? Do I wipe the drive and let it resilver? If yes, how is that done?

1689387803689.png
 
Well I *THINK* I fixed it. I did a zpool offline command for the affected partition, then I did a zpool online command and the disk resilvered automatically (pretty damn fast too I might add). Upon checking status, it still showed disk errors, so I did a clear command, and rescrubbed the pool. Hopefully this is the correct way to fix it??

root@pve:~# zpool status -v
pool: VMstorage
state: ONLINE
scan: scrub repaired 0B in 00:02:35 with 0 errors on Fri Jul 14 22:42:54 2023
config:

NAME STATE READ WRITE CKSUM
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors
 
Should I replace the disk in question?
In my opinion: no.

Keep an eye on the pool's status, automatically by a monitoring tool or manually frequent enough. Replace the device when similar behavior repeats.

On the other hand: having a spare disk in the closet is always a good idea.

Just my 2€¢...
 
  • Like
Reactions: louie1961
Cool thanks. So far so good. I really need to get a UPS for my home lab. I am thinking of getting a smallish UPS and pairing it with a power station as backup so that if the power goes out my server and router and such can stay up for a few hours.
 
  • Like
Reactions: UdoB
Sidenote:
In your initial post, you show your degraded rpool:
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 0 too many errors

errors: No known data errors
but in your second post (where you think, you fixed it), you show a completely different pool:
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors

Only a copy/paste mistake?
 
  • Like
Reactions: UdoB
I decided to recheck this morning and after another scrub I have errors again. My VMs are stored on the VMstorage pool on NVME drives and seem fine. My system is installed on the rpool pool along with backups, ISOs, etc. I am going to replace one of the ATA SSD drives. Here's the question, see my drive layout at the botton. Since this mirror contain more than this pool (i.e., the BIOS boot and EFI partitions are there as well). what is the proper procedure for replacing this disk? Its in a hot swap bay. SDE is the drive that needs to be replaced

1689516024529.png
root@pve:~# zpool status -v
pool: VMstorage
state: ONLINE
scan: scrub repaired 0B in 00:02:37 with 0 errors on Sun Jul 16 09:31:23 2023
config:

NAME STATE READ WRITE CKSUM
VMstorage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071602 ONLINE 0 0 0
nvme-TEAM_TM8FP4001T_112302230071598 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 164M in 00:02:51 with 0 errors on Sun Jul 16 09:45:18 2023
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-TEAM_T2532TB_TPBF2209160130700352-part3 ONLINE 0 0 0
ata-TEAM_T2532TB_TPBF2301040020500887-part3 FAULTED 12 0 1.55K too many errors

errors: No known data errors
 
If you installed in UEFI-mode, systemd-boot should already be installed; at least for me it was/is.
Check with: apt list systemd-boot, if it says: [installed].
 
I have the exact same issue, "zpool clear" cleanes everything but after a few days it's popping up again. What's also interesting is when looking at the SSD itself:

Bash:
# fdisk -l /dev/sda
Disk /dev/sda: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: DEF9CF3F-4151-4686-9CC4-DCD2108DE232

Device       Start        End    Sectors  Size Type
/dev/sda1       34       2047       2014 1007K BIOS boot
/dev/sda2     2048    1050623    1048576  512M EFI System
/dev/sda3  1050624 1953525134 1952474511  931G Solaris /usr & Apple ZFS

Partition 1 does not start on physical sector boundary.
 
I took the drive in question out, formatted it and tried it on a different system. Smart tests all came back clean and the drive worked perfectly in another system. I then re-installed the drive into the same hot swap bay, got it all reinstalled and re-silvered, with out any issues. I was still having the same issue with errors on scrub. So I popped the drive out and put it into a different hot swap bay. I ran a scrub, it found and corrected a couple of errors. I then re-ran the scrub and checked status, and everything is normal. I am assuming it is either the SATA port or cable supporting the original hot swap bay.
 
I need to replace a disk too.
Beside other disks i had two NVME with BIOS boot, UEFI and ZFS partitions. One is operational but i am not sure if it is bootable for now. The second disk is gone, ZFS is in degraded state.

I assume the system is bootable from the degraded zfs. Now, how i rebuild the BIOS Boot and EFI partitions on the new disk?


Code:
[7886195.495022] nvme nvme3: I/O 280 (I/O Cmd) QID 2 timeout, aborting
[7886195.495039] nvme nvme3: I/O 10 (I/O Cmd) QID 3 timeout, aborting
[7886195.495044] nvme nvme3: I/O 11 (I/O Cmd) QID 3 timeout, aborting
[7886195.495050] nvme nvme3: I/O 410 (I/O Cmd) QID 9 timeout, aborting
[7886195.495056] nvme nvme3: I/O 54 (I/O Cmd) QID 13 timeout, aborting
[7886226.220661] nvme nvme3: I/O 27 QID 0 timeout, reset controller
[7886307.631830] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[7886307.654856] nvme nvme3: Abort status: 0x371
[7886307.654862] nvme nvme3: Abort status: 0x371
[7886307.654866] nvme nvme3: Abort status: 0x371
[7886307.654870] nvme nvme3: Abort status: 0x371
[7886307.654874] nvme nvme3: Abort status: 0x371
[7886327.676848] nvme nvme3: Device not ready; aborting reset, CSTS=0x1
[7886327.676958] nvme nvme3: Disabling device after reset failure: -19
[7886327.708868] I/O error, dev nvme3n1, sector 1347108840 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[7886327.708868] I/O error, dev nvme3n1, sector 693710520 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[7886327.708868] I/O error, dev nvme3n1, sector 1847867416 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 2
[7886327.708872] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=354104995840 size=4096 flags=1572992
[7886327.708872] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=945033326592 size=8192 flags=1572992
[7886327.708875] zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.00253845314156d2-part3 error=5 type=2 offset=688644935680 size=4096 flags=1572992
[7886327.708874] I/O error, dev nvme3n1, sector 1847875864 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 2
[7886327.708875] I/O error, dev nvme3n1, sector 1855055440 op 0x1:(WRITE) flags 0x0 phys_seg 15 prio class 2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!