Zpool Errors - Help Please

fsociety3765 · Feb 9, 2021

Hi all,

Not off to a great start. I've had my PVE up for a couple of weeks now from a fresh install and setup. I am already showing zpool errors.

I have a striped mirror zpool using 4x Samsung M.2 NVMe drives. All were purchased brand new a matter of weeks ago. I configured them into a "RAID10" using the PVE web interface and this is what I am using for VM disk storage.

Yesterday I got an email saying that one of the drives is showing too many errors.

I'm not sure what to make of it. Are there any diagnostics I can do?

Any help and advice would be greatly appreciated.

Thanks,

FS

t.lamprecht · Feb 9, 2021

Hi,

fsociety3765 said:
All were purchased brand new a matter of weeks ago.

Did those drives get "burned in"? Seems like an early "infant mortality" failure.

Components have a significant higher chance to fail when completely new, then that chance increases again when of quite old age, in between is so to say the "good stable age".
https://en.wikipedia.org/wiki/Bathtub_curve

fsociety3765 said:
I'm not sure what to make of it. Are there any diagnostics I can do?

Check it's SMART health (can do so through the PVE webinterface, Node -> Disks)

But in any case, I'd replace that drive in the pool for now to be sure:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_change_failed_dev

Then I'd put it in a test bed and do some standard IO tests with it separately - for example write a big image to it with dd read it again and compare checksums (and check kernel log for IO errors).

Note also that those are consumer grade SSDs, a datacenter grade SSD may be a better fit for a server (depending on usage).

t.lamprecht · Feb 9, 2021

You could also use the badblocks (e.g., see https://wiki.archlinux.org/index.php/badblocks ) CLI tool to test the driver after having it replaced in the ZFS pool.

fsociety3765 · Feb 9, 2021

Thanks for the info and advice. I will take a look later.

No, I didn't burn them in. I wasn't aware that was a thing, to be honest. I'm kind of learning the ropes as I go. I just installed them and then created the zpool with them. I will have to look into how I do an RMA on the faulty one.

The server is in a home office/lab environment. Nothing mission-critical.

Thanks,

FS

t.lamprecht · Feb 9, 2021

fsociety3765 said:
No, I didn't burn them in. I wasn't aware that was a thing, to be honest.

No worries, lots of people aren't. I mean it's not a must (this has a rather low probability to happen in the first place) and also not a 100% guarantee that all drives will stay fine for years, but it can help to find bad apples float up quickly and thus make return/exchange procedure easier.
It also avoids putting them in production setups in the first place, so the cost to gain ratio for checking first is quite good.

fsociety3765 said:
The server is in a home office/lab environment. Nothing mission-critical.

Ack, then those drive models are definitively fine.

fsociety3765 · Feb 11, 2021

I’ve got the ball rolling on an RMA for the faulty drive.

This is my first failed drive using ZFS. Is there anything I need to do before I remove the hardware? Do I need to offline the drive or anything like that before taking it out?

Search

Search

Zpool Errors - Help Please

fsociety3765

Member

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

fsociety3765

Member

t.lamprecht

Proxmox Staff Member

fsociety3765

Member