ZFS cksum errors

Jun 24, 2021
29
9
8
29
Netherlands
itty.nl
Hi guys,

For some weeks now I'm trying to tackle a issue with one of my PBS nodes.
It has worked perfectly fine for a while, a month or 2 ago I discovered that CKSUM errors were happening on the zpool across all SAS HDD's
With permanent data damage as a result.

Our pool consists of:
8 mirrors 12TB Toshiba MG07SCA12TE SAS HDD's
1 special device consisting of 2 Samsung NVMe in a mirror, PM983 iirc
1 cache device, partition on 2 normal SATA SSD's

The server is a Dell R740xd, HBA330 mini (previously perc 740 in enchanced HBA mode)
160GB DDR4 ECC ram
2x Xeon Silver 4210R

We already did the following to narrow down the cause:
- First swapped the Perc 740 for HBA330, issue still exists;
- updated all the firmwares and bioses we could find;
- full memtest, ran fine. Also no ecc errors;
- dell diagnostic tool, also no errors;
- updated PBS fully, including reboots to reload the kernel, we're running no-subscription repository of 2.x;
- replaced the whole chassis, excluding Drives (SAS, SATA and SSD), issue still remains...

SMART tests (long) are completing without error..., ZFS is only reporting CKSUM errors, read and write are consistently at 0.

Anything I can provide or try to solve this issue and get over with it?

Currently on my phone, can add screenshots later if needed.

Kind regards,
David

zpool status at a random moment, it has been worse, this is after a clear and swapping the perc for the hba.
CKSUM errors only appear on the SAS mirror's, cache and special device are fine.
WhatsApp Image 2022-07-25 at 4.53.29 PM.jpeg
 
Last edited:
It seems that the problem is related to the disk model...
https://www.truenas.com/community/threads/disc-degraded.90037/

They aren't broken, but something is not right, sadly the topic at Truenas was never solved,
yet all the possible causes there have already been excluded in our set-up.

Swapping the disks soon and will report back when done.
 
Swapped the disks last Tuesday and the issue with CKSUM fault is resolved, so in the end this specific Toshiba model doesn't seem to be playing well with ZFS.
 
Hi,

I would check the firmware version of all of this Toshiba hdds. Maybe your faulty have a older versions that could explain the checksums errors.

And I will try to clean all hdds connectors with alcohol! I do this once/year!

Good luck / Bafta !