Hi guys,
For some weeks now I'm trying to tackle a issue with one of my PBS nodes.
It has worked perfectly fine for a while, a month or 2 ago I discovered that CKSUM errors were happening on the zpool across all SAS HDD's
With permanent data damage as a result.
Our pool consists of:
8 mirrors 12TB Toshiba MG07SCA12TE SAS HDD's
1 special device consisting of 2 Samsung NVMe in a mirror, PM983 iirc
1 cache device, partition on 2 normal SATA SSD's
The server is a Dell R740xd, HBA330 mini (previously perc 740 in enchanced HBA mode)
160GB DDR4 ECC ram
2x Xeon Silver 4210R
We already did the following to narrow down the cause:
- First swapped the Perc 740 for HBA330, issue still exists;
- updated all the firmwares and bioses we could find;
- full memtest, ran fine. Also no ecc errors;
- dell diagnostic tool, also no errors;
- updated PBS fully, including reboots to reload the kernel, we're running no-subscription repository of 2.x;
- replaced the whole chassis, excluding Drives (SAS, SATA and SSD), issue still remains...
SMART tests (long) are completing without error..., ZFS is only reporting CKSUM errors, read and write are consistently at 0.
Anything I can provide or try to solve this issue and get over with it?
Currently on my phone, can add screenshots later if needed.
Kind regards,
David
zpool status at a random moment, it has been worse, this is after a clear and swapping the perc for the hba.
CKSUM errors only appear on the SAS mirror's, cache and special device are fine.

For some weeks now I'm trying to tackle a issue with one of my PBS nodes.
It has worked perfectly fine for a while, a month or 2 ago I discovered that CKSUM errors were happening on the zpool across all SAS HDD's
With permanent data damage as a result.
Our pool consists of:
8 mirrors 12TB Toshiba MG07SCA12TE SAS HDD's
1 special device consisting of 2 Samsung NVMe in a mirror, PM983 iirc
1 cache device, partition on 2 normal SATA SSD's
The server is a Dell R740xd, HBA330 mini (previously perc 740 in enchanced HBA mode)
160GB DDR4 ECC ram
2x Xeon Silver 4210R
We already did the following to narrow down the cause:
- First swapped the Perc 740 for HBA330, issue still exists;
- updated all the firmwares and bioses we could find;
- full memtest, ran fine. Also no ecc errors;
- dell diagnostic tool, also no errors;
- updated PBS fully, including reboots to reload the kernel, we're running no-subscription repository of 2.x;
- replaced the whole chassis, excluding Drives (SAS, SATA and SSD), issue still remains...
SMART tests (long) are completing without error..., ZFS is only reporting CKSUM errors, read and write are consistently at 0.
Anything I can provide or try to solve this issue and get over with it?
Currently on my phone, can add screenshots later if needed.
Kind regards,
David
zpool status at a random moment, it has been worse, this is after a clear and swapping the perc for the hba.
CKSUM errors only appear on the SAS mirror's, cache and special device are fine.

Last edited: