[SOLVED] ZFS device fault for pool rpool on backup

SoporteIB

Well-Known Member
Dec 14, 2018
31
2
48
44
Hello, we have received this error message from one of the disks in the ZFS system. I have run a "zpool status -v name" and I don't see any error. Can I take any more action? Thanks, best regards.

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
eid: 29
class: statechange
state: FAULTED
host: backup
time: 2024-02-28 15:03:46+0100
vpath: /dev/disk/by-id/ata-INTEL_SSDSC2BW120A4_PHDA435001511207GN-part3
vguid: 0x00C8769FC869172D
pool: rpool (0x8A4E837096270D08)

Captura.PNG
 
Last edited:
Hi,

I have run a "zpool status -v name" and I don't see any error.
pool: rpool (0x8A4E837096270D08)
In the screenshot provided you only show the status of another pool, but not the actual faulted pool rpool.

Please provide the output of zpool status -v rpool and smartctl -x /dev/disk/by-id/ata-INTEL_SSDSC2BW120A4_PHDA435001511207GN.
You can also look into the system journal using journalctl -b and look for any errors. Usually on disk errors there will be something printed by the kernel.
 
True, I had forgotten about the other one! :O Thank you
Capture attached. They are SSD hard drives.
They are the operating system disks.
Captura.PNG
 
Last edited:
Well, your best move now would be to replace the failed drive.
Please see our guide for Changing a failed bootable device, as special precautions must be taken when replacing a device from the rpool.

Again, you might want to checkout the output of smartctl -x /dev/disk/by-id/ata-INTEL_SSD[..], whatever the full path is exactly (since it's cut of in the screenshot you provided). The kernel log (dmesg -H) might provide some more information what kind of I/O errors are occuring, but the disk probably really is just failing.

Going by the disk name, these Intel SSDs seem to not be really enterprise ones/have PLP. ZFS on consumer SSDs will kill them pretty quickly, just search the forum for many posts on that topic.
I'd advise you to get some actual enterprise-SSDs, power-loss protection is a good search term. They have way better durability and importantly, better performance upkeep over time and PLP!
 
  • Like
Reactions: SoporteIB
Well, your best move now would be to replace the failed drive.
Please see our guide for Changing a failed bootable device, as special precautions must be taken when replacing a device from the rpool.

Again, you might want to checkout the output of smartctl -x /dev/disk/by-id/ata-INTEL_SSD[..], whatever the full path is exactly (since it's cut of in the screenshot you provided). The kernel log (dmesg -H) might provide some more information what kind of I/O errors are occuring, but the disk probably really is just failing.

Going by the disk name, these Intel SSDs seem to not be really enterprise ones/have PLP. ZFS on consumer SSDs will kill them pretty quickly, just search the forum for many posts on that topic.
I'd advise you to get some actual enterprise-SSDs, power-loss protection is a good search term. They have way better durability and importantly, better performance upkeep over time and PLP!
Thank you, I managed to replace the system hard drive with the guide you sent me.
Best regards
 
  • Like
Reactions: cheiss
Great to hear!

Please just mark the thread as SOLVED then, so others can find it easier in the future! This can be done by editing the first post, there should be a dropdown for this near the title field :)
 
  • Like
Reactions: SoporteIB
Hi,

I also sometimes receive these messages in my mailbox, but when I check the smart hard drive status, everything seems to be fine. It’s happening on the storage pool, and every time I manually launch a scrub command for this pool, it says there’s nothing to correct, always showing zero errors.

If I go on the web interface to check the status of the pool, it appears as degraded and there are a few errors, but it’s only read errors—nothing about writes. So, I wondered if there’s really something to worry about. I admit I’m not extremely comfortable with ZFS. I’m running this machine for something around four or five years; it’s an OVH-hosted server, and the storage pool is also running on two big 6TB spinning disks.

I installed Proxmox manually using the installation CD via OVH's remote access (KVM over IP), so it's a standard installation without any modifications from OVH. I’m not too worried because I use it as a standalone Proxmox host and have an extremely frequent backup strategy (several times a day) on a different server with PBS, along with basic daily classical backups on an NFS share provided by OVH and also locally as a third backup strategy on itself. There’s another local ZFS pool, the rpool, constituted by two professional (I hope so, considering the price) NVMe disks, and no pool errors there.

It’s currently running the latest PVE version, and I only receive these messages once in a while—maybe once every other month. To be honest (but maybe I’m kidding myself), I was thinking it might be more of a software issue than an actual hardware issue. Since every time I checked the smart disk status or launched a scrub analysis, it says there are zero errors and everything is all right.

For sure, when I run the `zpool status` command, it shows the same degraded status as the web interface, indicating only a few read errors—no writes—and suggesting that I could run a `zpool clear` command if I think everything is fine. Even though my confidence never went up to the point of running it, it’s always a proposed mention. Since it’s only read errors and the smart states of the disks seem to be good, I almost forgot to mention that I ran extended tests with OVH’s PXE live « rescue » system, which always came back clean.

Let’s be honest, trying to ask for a replacement disk from OVH support knowing that the drive has an OK smart status, and even with the strongest test mode settings (I forgot how they called this—maybe it’s just read and write modes), it says everything is fine and showed no issue at all with their own « homemade rescue system» !!! I dare you to succeed; if you do, you instantly become my god .

So, would you recommend me to run some more tests, and in that case, I’m all ears because, as I mentioned, I’m not a ZFS specialist? Or would you recommend me to run the `zpool clear` command, crossing my fingers and hoping as much as I can that it never pops up again? As I said, maybe it’s an old piece of information stuck somewhere, maybe a RAM error one day (though it’s supposed to be ECC), and if I « clear my zpool », it might disappear and never come back again !!!



Anyway, thank you very much for your insights,

Best regards,
 
Check the output from dmesg, after you have encountered one of these errors. Sometimes, that can shed some light on what type of temporary failure is happening and whether it is something to worry about.
 
Hi, thanks for your answer so as you said, I went to check and there actually is three times mentions about an error but with my not enough knowledgeable eye it doesn’t seem too scary though I provide the result along with this message like this. Maybe someone will understand more than I do…
Thank you for your help.
 

Attachments

I would reach out to customer support and give them this information. It's really hard to tell what is going on without knowing a lot more about your hardware, and the hosting provider should have experience with this.

You are experiencing intermittent problems accessing the disks. This could be an incompatibility of Linux with the SATA controller, it could be a bug in the SATA controller's firmware, it could be an inadequate power supply, it could be defective SATA cables, it could be insufficient cooling, it could be faulty hard drives, or any number of other issues.

At the very least, it'll result in your machine occasionally stopping for a while and waiting for an I/O operation to complete. That's not great, if you need prompt responses from your server. But it also means that you have probably lost redundancy. You are paying for a second disk, but you are effectively only using one. If that one dies too, you might lose data.

You are paying for support and for hardware that works without problems, you should make sure you get what you are paying for. At the very least, you should regularly zfs scrub your filesystems, as that can help fix things if these are just intermittent random faults (not that those are good either)
 
Last edited: