zfs, loosing raidz-members due to reboot

May 6, 2021
38
1
13
Bern, Switzerland
Hello
Unclear behavior most probably hardware related, may be config related.

Supermicro barebone with 8 x 3.5 " SAS disks on LSI SAS 3008 in IT-mode.

While reboot, the raidz looses the devices on phy 6 and 7. They appear as "faulted" with a number insted of device.
Example:
Bash:
config:

    NAME                      STATE     READ WRITE CKSUM
    backup-01-pool01          DEGRADED     0     0     0
      raidz2-0                DEGRADED     0     0     0
        sda                   ONLINE       0     0     0
        sdb                   ONLINE       0     0     0
        sdc                   ONLINE       0     0     0
        sdf                   ONLINE       0     0     0
        sdd                   ONLINE       0     0     0
        sdg1                  DEGRADED     0     0    51  too many errors
        sde                   ONLINE       0     0     0
        18086653234219275637  FAULTED      0     0     0  was /dev/sdh1

The disks are appearing at boot, at least they where seen in dmesg with no problems.
I was able to clear the disk and to rejoin with
Bash:
zpool labelclear -f /dev/sdg1
zpool replace backup-01-pool01 /dev/sdg1

The device was ok for some hours and now again shows CKSUM errors. But the disk itself has been replaced.

What could be a problem, the write cache of the disks is "ON" and I dunno how to reconfigure to "OFF".
The behavior is just since we had a power outage, so it might also be a problem with the backplane.
Or perhaps, PBS does a unclean shutdown? I don't think.[/CODE]
 
What stands out is, that the two problematic drives are also the only ones, that use a partition (sdg1 and sdh1) instead of the raw device.
If this mixing in a vdev could lead to such a behavior, I unfortunately do not know, sorry.
 
Why are you using partitions instead of whole disks in the first place?
I guess there is something else running on those disks on another partition? Maybe that is somehow causing problems.
 
Why are you using partitions instead of whole disks in the first place?
I guess there is something else running on those disks on another partition? Maybe that is somehow causing problems.
IMHO, there are the usual partition layouts on that disks. This has to be numberings, as the original /sdg or /sdh were still seen when the first replace begun.
I would need to double check tomorrow. But lsblk is showing on every disk a small first partition (used for metadata imho) und the larger second partition for the data itself (or vice versa?).
 
Usually its a big 1st partition and a small 9th partition. And disks are added as "/dev/sdg" and not as "/dev/sdg1" but I guess you did the latter one when replacing the disks. But usually that should be fine too. So they are probably not causing errors because they are called sdg1 and sdh1. They are just called that by, because you replaced them with using partitions instead of whole disks.
 
Last edited:
Usually its a big 1th partition and a small 9th partition. And disks are added as "/dev/sdg" and not as "/dev/sdg1" but I guess you did the latter one when replacing the disks. But usually that should be fine too. So they are probably not causing errors because they are called sdg1 and sdh1. They are just called that by, because you replaced them with using partitions instead of whole disks.
Might be, but one disk was added newly to be sure, the first crc errors were not due to a disk error. the other one was just zapped (cleared) and re-added through zpool replace. I did not remove the existing zfs partitions nor wiped out the whole disk.
Destroying a label of a zpool member seems odd. We thought perhaps the system shuts down unclean, but that are always these two bays / disks. So there is also a possibility, that the backplane has problems.

Anyone knows, how to shut off the write cache on such a SAS disk? I think, there are utilities for LSI Adapters, but not in IT mode. A generic SCSI command?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!