zfs, loosing raidz-members due to reboot

Jackobli · Nov 22, 2022

Hello
Unclear behavior most probably hardware related, may be config related.

Supermicro barebone with 8 x 3.5 " SAS disks on LSI SAS 3008 in IT-mode.

While reboot, the raidz looses the devices on phy 6 and 7. They appear as "faulted" with a number insted of device.
Example:

Bash:

config:

    NAME                      STATE     READ WRITE CKSUM
    backup-01-pool01          DEGRADED     0     0     0
      raidz2-0                DEGRADED     0     0     0
        sda                   ONLINE       0     0     0
        sdb                   ONLINE       0     0     0
        sdc                   ONLINE       0     0     0
        sdf                   ONLINE       0     0     0
        sdd                   ONLINE       0     0     0
        sdg1                  DEGRADED     0     0    51  too many errors
        sde                   ONLINE       0     0     0
        18086653234219275637  FAULTED      0     0     0  was /dev/sdh1

The disks are appearing at boot, at least they where seen in dmesg with no problems.
I was able to clear the disk and to rejoin with

Bash:

zpool labelclear -f /dev/sdg1
zpool replace backup-01-pool01 /dev/sdg1

The device was ok for some hours and now again shows CKSUM errors. But the disk itself has been replaced.

What could be a problem, the write cache of the disks is "ON" and I dunno how to reconfigure to "OFF".
The behavior is just since we had a power outage, so it might also be a problem with the backplane.
Or perhaps, PBS does a unclean shutdown? I don't think.[/CODE]

Neobin · Nov 22, 2022

What stands out is, that the two problematic drives are also the only ones, that use a partition (sdg1 and sdh1) instead of the raw device.
If this mixing in a vdev could lead to such a behavior, I unfortunately do not know, sorry.

Dunuin · Nov 22, 2022

Why are you using partitions instead of whole disks in the first place?
I guess there is something else running on those disks on another partition? Maybe that is somehow causing problems.

Jackobli · Nov 22, 2022

Dunuin said:
Why are you using partitions instead of whole disks in the first place?
I guess there is something else running on those disks on another partition? Maybe that is somehow causing problems.

IMHO, there are the usual partition layouts on that disks. This has to be numberings, as the original /sdg or /sdh were still seen when the first replace begun.
I would need to double check tomorrow. But lsblk is showing on every disk a small first partition (used for metadata imho) und the larger second partition for the data itself (or vice versa?).

Dunuin · Nov 22, 2022

Usually its a big 1st partition and a small 9th partition. And disks are added as "/dev/sdg" and not as "/dev/sdg1" but I guess you did the latter one when replacing the disks. But usually that should be fine too. So they are probably not causing errors because they are called sdg1 and sdh1. They are just called that by, because you replaced them with using partitions instead of whole disks.

Jackobli · Nov 22, 2022

Dunuin said:
Usually its a big 1th partition and a small 9th partition. And disks are added as "/dev/sdg" and not as "/dev/sdg1" but I guess you did the latter one when replacing the disks. But usually that should be fine too. So they are probably not causing errors because they are called sdg1 and sdh1. They are just called that by, because you replaced them with using partitions instead of whole disks.

Might be, but one disk was added newly to be sure, the first crc errors were not due to a disk error. the other one was just zapped (cleared) and re-added through zpool replace. I did not remove the existing zfs partitions nor wiped out the whole disk.
Destroying a label of a zpool member seems odd. We thought perhaps the system shuts down unclean, but that are always these two bays / disks. So there is also a possibility, that the backplane has problems.

Anyone knows, how to shut off the write cache on such a SAS disk? I think, there are utilities for LSI Adapters, but not in IT mode. A generic SCSI command?

Search

Search

zfs, loosing raidz-members due to reboot

Jackobli

Member

Neobin

Distinguished Member

Dunuin

Distinguished Member

Jackobli

Member

Dunuin

Distinguished Member

Jackobli

Member