ZFS issues

AlBundy · May 9, 2022

Hi All,

I have received a ZFS issue twice in the last 2 weeks, where the ZFS says there is a i/o fault.

"The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
eid: 42
class: statechange
state: FAULTED"

This sunday it ran a scrub by himself and i got the following email:
"
ZFS has finished a scrub:

eid: 26
class: scrub_finish
host: new-stan
time: 2022-05-08 00:32:04+0200
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:08:03 with 0 errors on Sun May 8 00:32:04 2022
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.00000000000000008ce38ee20dba8601-part3 ONLINE 0 16 0
nvme-eui.00000000000000008ce38ee20dba8801-part3 ONLINE 0 1 0

errors: No known data errors"

After that we tested both of the drives are tested and they appear to be in perfect shape (they are a month old). I ran a manual scrub today and that finished without errors after clearing the pool.
The other weird thing is that the usage of the nvme list command is different, one drives states a usage of 1.88TB and the other 1.95TB.

Do you guys have some tips on how to further diagnose this issue?

mira · May 9, 2022

Which NVMes do you use? (Vendor and Model)
Sometimes a firmware update can help.

AlBundy · May 9, 2022

The models and firmwares are as following:

Node SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 XXX KIOXIA KCD71RUG3T84 1 1.90 TB / 3.84 TB 512 B + 0 B 0104
/dev/nvme1n1 XXX KIOXIA KCD71RUG3T84 1 1.96 TB / 3.84 TB 512 B + 0 B 0104

mira · May 10, 2022

Check for firmware updates. If there's one available, maybe it fixes the issue.
If not, I'd try to replace them. Write errors shouldn't be happening, especially on new drives.

Do you have other disks available you could try? Just to see if those also have write errors.

AlBundy · May 16, 2022

I asked for firmware updates, but there aren't any.

Last time i did a clear again, the hosting provider tested the drives and said they are ok. Yesterday, i got the same error again.
It seems only to be happening on Sunday, any thoughts what i can do to solve it?

My other servers have 0 errors after 1,5 years and this server has on both drive errors:
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 2 0
nvme-eui.00000000000000008ce38ee20dba8601-part3 FAULTED 0 35 0 too many errors
nvme-eui.00000000000000008ce38ee20dba8801-part3 ONLINE 0 3 0

So it's not only 1 of the 2 drives, this makes me feel the drives are not fully compatible, a motherboard issue of a software bug.

Search

Search

ZFS issues

AlBundy

Active Member

mira

Proxmox Staff Member

AlBundy

Active Member

mira

Proxmox Staff Member

AlBundy

Active Member

We value your privacy