Making sense of NVMe zfs and SMART errors

jackdaw

Member
Jun 23, 2022
7
1
6
Hope you’re all well.

I have a question that’s been wrecking my head for months now.
I have a 3 node cluster (Dell PowerEdge R7525) with the following drive configuration:
Node 1: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 2: 10x 2TB NVMe KINGSTON SKC2500M82000G in raidz2
Node 3: 8x 4TB Crucial MX500 in raidz2

A little while ago I got an email about a zfs scrub_finish for the pool on Node 2. The issue was with two drives.
The error mentioned: One or more devices are faulted in response to persistent errors.

The array became degraded but still accessible. I moved the VMs off of it immediately. SMART didn’t show any issues for the two problematic drives.

I then left it as I was going to remove the two drives, and rebuild the pool when I upgrade the node to PVE 9. I didn’t have any spare drives, so I planned to create a new smaller pool.

Today I did just that, and as soon as I upgraded to PVE 9, the degraded state went away, and I got an email saying that a zfs resilver finished with 0 errors. The pool changed to online and the drives were fine. I then ran: zpool upgrade as suggested.

But then I got an email saying the two drives (the ones that were marked as faulty prior to upgrading to PVE 9), had the following error:
Media and Data Integrity Errors changed from 0 to 12.

The zpool was still online, and I did a scrub that reported 0 errors. I then ran short SMART tests, and they did not report any errors.

I then upgraded Node 1 to PVE 9. After the upgrade I received an email saying one of the drives reported an error: Media and Data Integrity Errors changed from 0 to 16

The pool is still online, and the drive in question has passed a short SMART test.

Would anyone have any idea what could be happening? If the drives are bad, that’s not a problem, I can just remove them and create a new smaller pool. But I’m a bit hesitant to do that if these are just false positives. Has anyone encountered this? Could this just be some strange behaviour related to the Dell servers? I know the drives should be enterprise class, but we could not afford them at the time, and especially now.

Thank you so much!
 
  • Like
Reactions: Sunilkumar
I suspect that the upgrade changed something about the smart monitoring to reset the tracked stats, or maybe to start tracking stats that weren't tracked before. This is probably why it jumped from 0 to 12 or 0 to 16, and it's likely that you had 12 or whatever errors for a while.

The resilver probably didn't take long to fix the raid because you had moved all the data off to other nodes. Unlike hardware raid, ZFS knows when parts of the raid are free space, and can skip rebuilding those empty parts. And for that same reason the scrub doesn't tell you anything about the health of empty parts of the disks. Scrub reads the data from the disks, and verifies that it matches the hashes recorded when that data was written, but empty space isn't checked.

The short SMART test passing is a good sign, and suggests you might be able to keep using these disks for a while. I would first do the full long test on all of the disks in that first system, and if that passes that's an even better sign, since that should test even the empty parts of the drives.

SMART is a weird tech. Different manufacturers (or indeed, different models from the same manufacturer) seem to implement it differently. I have a drive in use with ~20 (I don't remember the exact number) of those "Media and Data Integrity Errors" that it has had for more than a year. So far there's been no data loss, and that number hasn't gone up any. So those 20-ish problems either the drive recovered from, or ZFS did. I keep an eye on it, but since the number isn't going up, I'm not too worried about it. I don't know exactly what event caused that. Was it a bad sector that was swapped out for a reserve one, or just read that failed but worked when retried? I don't know, but the drive passed a long self-test, and the number hasn't gone up since I started watching it, so I've stopped worrying about it until the number does go up.

It's hard to suggest you treat it as I did. I choose that because I don't have strict SLAs, and do have strict budgets. If that drive did fail and I had to restore a few VMs from backups, it would be fine as long as I got that started within an hour. So I can't say that I'd do the same if I were in your place, but my experience suggests that as long as you keep watch out for notifications of that number starting to climb again, its probably fine to keep using those drives.