[Drive problem] HDD CKSUM errors rising, but smartctl shows no problems

J0rdone

New Member
Feb 19, 2024
2
0
1
Greetings friends,

This is my first post on here, so apologies if it's hard to be patient with me - still learning a lot!

Context
I fired up a PVE server a few months ago with spare PC parts, running the OS on an SSD, and using two 12TB HDDs configured in RAID1 for storage (via the zfs wizard in PVE)
All has been fine for months. I wanted to expand my storage and just installed two more 12TB drives. (It's worth noting that these two new drives are connected to the motherboard via a m.2 to SATA adapter, which has it's own RAID management software... still figuring that out.)

Problem
I opened the PVE gui to add these drives, and found under ZFS that my ORIGINAL drives are "DEGRADING"
I should note here that my power went out yesterday, and I didn't check on the drives after power came back on (dumb). This could be a source of the problem?

Analysis
zpool status -v shows CKSUM errors on BOTH drives (different amounts), and they increase every few seconds when I run the command
smartctl on each disk shows no problems...?

Output of zpool status -vLP:
Code:
root@pve:~# zpool status -vLP
  pool: zpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 594M in 00:00:41 with 1 errors on Fri May 17 12:44:37 2024
config:

        NAME           STATE     READ WRITE CKSUM
        zpool          DEGRADED     0     0     0
          mirror-0     DEGRADED     0     0     0
            /dev/sdb1  DEGRADED     0     0    87  too many errors
            /dev/sdc1  DEGRADED     0     0    84  too many errors

errors: Permanent errors have been detected in the following files:
(2 files are listed here)

Note: the data itself on these drives is not extremely critical, so no need to panic about backing anything up immediately
 

Attachments

  • degraded.png
    degraded.png
    73.4 KB · Views: 9
Last edited:
Check SMART-data. It may confirm trouble inside the disks. (Oh, you already did that...)

Or it may not. To have both drives show up with the same number of errors at the same day is not really expected. Then the transport way of the data from the physical disks to that software called "ZFS" is not reliable, it may have had a "hickup".

Check/replace the cables. At least unplug/re-plug them. Both ends. Including the power cables. Check the controller, if it is not baked in in the chipset. Check the RAM of the system, "memtest86" is always a good idea. It needs to run a long time --> over night, it be booted cleanly (not started like an application from inside the running primary OS).

Follow floh8's advice and run scrub. Wait for it to finish. "zpool clear" the registered errors first to start with zero'ed counters.

Usually we recommend to make a backup first, but you stated the data is not really important, so this is up to you.
 
Check SMART-data. It may confirm trouble inside the disks. (Oh, you already did that...)

Or it may not. To have both drives show up with the same number of errors at the same day is not really expected. Then the transport way of the data from the physical disks to that software called "ZFS" is not reliable, it may have had a "hickup".

Check/replace the cables. At least unplug/re-plug them. Both ends. Including the power cables. Check the controller, if it is not baked in in the chipset. Check the RAM of the system, "memtest86" is always a good idea. It needs to run a long time --> over night, it be booted cleanly (not started like an application from inside the running primary OS).

Follow floh8's advice and run scrub. Wait for it to finish. "zpool clear" the registered errors first to start with zero'ed counters.

Usually we recommend to make a backup first, but you stated the data is not really important, so this is up to you.
I'm so grateful for your help!

So I thought I had fixed the problems by wiping the drives. I heard from other sources that power blips can cause corrupted zfs pools that just won't resolve themselves, so I wiped.
Assuming that I had fixed things, I added my two new drives.
Here we are a few days later and I'm experiencing the same problems, even on the new drives (see screenshot). I like your theory that this problem isn't directly drive-related. I'll re-plug all cables and then run memtest86 tonight. Then I can follow up with scrub and clear.Screenshot 2024-05-23.png
 
Last edited:
> I fired up a PVE server a few months ago with spare PC parts, running the OS on an SSD, and using two 12TB HDDs configured in RAID1 for storage (via the zfs wizard in PVE)

> All has been fine for months. I wanted to expand my storage and just installed two more 12TB drives. (It's worth noting that these two new drives are connected to the motherboard via a m.2 to SATA adapter, which has it's own RAID management software... still figuring that out)

Problem
> I should note here that my power went out yesterday, and I didn't check on the drives after power came back on

I'm assuming these are not NAS-rated drives? If they're desktop-class then they have different firmware and are likely not great candidates for zfs. NAS / SAS / Enterprise drives are designed to kick themselves out of a RAID early so they can be replaced, and the firmware also handles vibration effects. Desktop drives will try reading and re-reading a bad sector over and over to try and recover it. This introduces unstable behavior in a RAID.

I see from the drive models that these are EXOS drives, although 2 different models, and you rebuilt as a raidz1. This maaay be part of the issue, not certain. A pool of mirrors with the same drive models in each column may be better.

Connecting 2 new drives with an adapter that has RAID software is likely problematic. Especially if you didn't create a separate pool. ZFS could be massively confused by this setup if it doesn't have 100% full disk control.

If you're running a server and you want stability:

A) Make sure everything is on UPS power
B) Setup NUT

C) Use NAS-or-better rated drives with ZFS
D) Switch from a jackleg setup to a proper HBA in IT mode, actively cooled.

If you follow best-practice recommendations, your sysadmin life should be much easier.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!