ZFS issues

AlBundy

Member
Sep 26, 2017
10
0
21
31
Hi All,

I have received a ZFS issue twice in the last 2 weeks, where the ZFS says there is a i/o fault.

"The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
eid: 42
class: statechange
state: FAULTED"

This sunday it ran a scrub by himself and i got the following email:
"
ZFS has finished a scrub:

eid: 26
class: scrub_finish
host: new-stan
time: 2022-05-08 00:32:04+0200
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:08:03 with 0 errors on Sun May 8 00:32:04 2022
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.00000000000000008ce38ee20dba8601-part3 ONLINE 0 16 0
nvme-eui.00000000000000008ce38ee20dba8801-part3 ONLINE 0 1 0

errors: No known data errors"

After that we tested both of the drives are tested and they appear to be in perfect shape (they are a month old). I ran a manual scrub today and that finished without errors after clearing the pool.
The other weird thing is that the usage of the nvme list command is different, one drives states a usage of 1.88TB and the other 1.95TB.

Do you guys have some tips on how to further diagnose this issue?
 
Which NVMes do you use? (Vendor and Model)
Sometimes a firmware update can help.
 
The models and firmwares are as following:

Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 XXX KIOXIA KCD71RUG3T84 1 1.90 TB / 3.84 TB 512 B + 0 B 0104
/dev/nvme1n1 XXX KIOXIA KCD71RUG3T84 1 1.96 TB / 3.84 TB 512 B + 0 B 0104
 
Check for firmware updates. If there's one available, maybe it fixes the issue.
If not, I'd try to replace them. Write errors shouldn't be happening, especially on new drives.

Do you have other disks available you could try? Just to see if those also have write errors.
 
I asked for firmware updates, but there aren't any.

Last time i did a clear again, the hosting provider tested the drives and said they are ok. Yesterday, i got the same error again.
It seems only to be happening on Sunday, any thoughts what i can do to solve it?

My other servers have 0 errors after 1,5 years and this server has on both drive errors:
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 2 0
nvme-eui.00000000000000008ce38ee20dba8601-part3 FAULTED 0 35 0 too many errors
nvme-eui.00000000000000008ce38ee20dba8801-part3 ONLINE 0 3 0

So it's not only 1 of the 2 drives, this makes me feel the drives are not fully compatible, a motherboard issue of a software bug.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!