Zpool Errors - Help Please

Jan 23, 2021
51
17
13
Hi all,

Not off to a great start. I've had my PVE up for a couple of weeks now from a fresh install and setup. I am already showing zpool errors.:(

I have a striped mirror zpool using 4x Samsung M.2 NVMe drives. All were purchased brand new a matter of weeks ago. I configured them into a "RAID10" using the PVE web interface and this is what I am using for VM disk storage.

Yesterday I got an email saying that one of the drives is showing too many errors.

I'm not sure what to make of it. Are there any diagnostics I can do?

Any help and advice would be greatly appreciated.

Thanks,

FS

Screenshot 2021-02-09 at 12.04.49.png
 
Hi,

All were purchased brand new a matter of weeks ago.
Did those drives get "burned in"? Seems like an early "infant mortality" failure.

Components have a significant higher chance to fail when completely new, then that chance increases again when of quite old age, in between is so to say the "good stable age".
https://en.wikipedia.org/wiki/Bathtub_curve

I'm not sure what to make of it. Are there any diagnostics I can do?
Check it's SMART health (can do so through the PVE webinterface, Node -> Disks)

But in any case, I'd replace that drive in the pool for now to be sure:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_change_failed_dev

Then I'd put it in a test bed and do some standard IO tests with it separately - for example write a big image to it with dd read it again and compare checksums (and check kernel log for IO errors).

Note also that those are consumer grade SSDs, a datacenter grade SSD may be a better fit for a server (depending on usage).
 
Thanks for the info and advice. I will take a look later.

No, I didn't burn them in. I wasn't aware that was a thing, to be honest. I'm kind of learning the ropes as I go. I just installed them and then created the zpool with them. I will have to look into how I do an RMA on the faulty one.

The server is in a home office/lab environment. Nothing mission-critical.

Thanks,

FS
 
No, I didn't burn them in. I wasn't aware that was a thing, to be honest.
No worries, lots of people aren't. I mean it's not a must (this has a rather low probability to happen in the first place) and also not a 100% guarantee that all drives will stay fine for years, but it can help to find bad apples float up quickly and thus make return/exchange procedure easier.
It also avoids putting them in production setups in the first place, so the cost to gain ratio for checking first is quite good.

The server is in a home office/lab environment. Nothing mission-critical.
Ack, then those drive models are definitively fine.
 
I’ve got the ball rolling on an RMA for the faulty drive.

This is my first failed drive using ZFS. Is there anything I need to do before I remove the hardware? Do I need to offline the drive or anything like that before taking it out?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!