Fault tolerance of the pool may be compromised

OsvaldoP · Jun 24, 2024

Guten Morgen!

Wir haben ein PBS welches leider mittels Raidcontroller im HBA Mode betrieben wird. (Wurde vom Kunden trotz Warnung so gewünscht).
Mittels 15 Stück 4TB SSD-Disks wurde ein ZFS RaidZ2 erstellt.

Vor 2 Tagen hat das System neu gestartet. 2 Disks (sdr, sds) wurden vermutlich beim Hochfahren als nicht verfügbar deklariert, zfs hat den Status degraded.
Die Disks sind vorhanden Smartwerte scheinen ok zu sein.

Welches wäre eurer Meinung die beste Vorgehensweise?
Disks trotzdem tauschen oder Disks nochmals in den Raidverbund reinhängen?
Wenn reinhängen, einfach ein zpool replace /dev/sdr /dev/sdr?

* Alamrierung via Mail:

Code:

ZFS has detected that a device was removed.

 impact: Fault tolerance of the pool may be compromised.
    eid: 8
  class: statechange
  state: UNAVAIL
   host: slbkpp01
   time: 2024-06-22 13:13:18+0200
  vpath: /dev/sdr1
  vphys: pci-0000:03:00.0-scsi-0:0:18:0
  vguid: 0xB65F245938371A0D
  devid: scsi-35002538b71a22d30-part1
   pool: zfs01 (0xF689F90C783BE902)

* zpool Status

Code:

zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:43 with 0 errors on Sun Jun  9 00:24:45 2024
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            scsi-35000cca059b6cccc-part3  ONLINE       0     0     0
            scsi-35000cca059b6cd2c-part3  ONLINE       0     0     0

errors: No known data errors

  pool: zfs01
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 01:33:46 with 0 errors on Sun Jun  9 01:57:52 2024
config:

        NAME                      STATE     READ WRITE CKSUM
        zfs01                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            sda                   ONLINE       0     0     0
            sdb                   ONLINE       0     0     0
            sdc                   ONLINE       0     0     0
            sdd                   ONLINE       0     0     0
            sdl                   ONLINE       0     0     0
            sdm                   ONLINE       0     0     0
            sdo                   ONLINE       0     0     0
            sdn                   ONLINE       0     0     0
            sdp                   ONLINE       0     0     0
            sdq                   ONLINE       0     0     0
            13141262203304221197  FAULTED      0     0     0  was /dev/sdr1
            18003793698104801029  FAULTED      0     0     0  was /dev/sds1
            sdt                   ONLINE       0     0     0
            sdu                   ONLINE       0     0     0
            sdv                   ONLINE       0     0     0

errors: No known data errors

* Smartwerte einer Disk

Code:

 smartctl -x /dev/sds
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SAMSUNG
Product:              MZILT3T8HBLS/007
Revision:             GXA0
Compliance:           SPC-5
User Capacity:        3,840,755,982,336 bytes [3.84 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5002538b71a22d30
Serial number:        S5G0NC0RA03581
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Jun 24 08:31:46 2024 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current temperature = 43
Lifetime maximum temperature = 44
Lifetime minimum temperature = 19
Maximum temperature since power on = 44
Minimum temperature since power on = 42
Manufactured in week 42 of year 2021
Accumulated start-stop cycles:  23
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     143616.723           0
write:         0        0         0         0          0      83042.590           0
verify:        0        0         0         0          0          0.008           0

Non-medium error count:      189

  Pending defect count:0 Pending Defects
No Self-tests have been logged

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 22628:16 [1357696 minutes]
    Number of background scans performed: 33,  scan progress: 58.80%
    Number of background medium scans performed: 33


Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 3
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5002538b71a22d32
    attached SAS address = 0x500056b378b913ff
    attached phy identifier = 18
    Invalid DWORD count = 4
    Running disparity error count = 4
    Loss of DWORD synchronization count = 1
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 0
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 17
     Received SSP frame error count: 0
relative target port id = 2
  generation code = 3
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5002538b71a22d33
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
    Phy event descriptors:
     Received ERROR count: 0
     Received address frame error count: 0
     Received abandon-class OPEN_REJECT count: 0
     Received retry-class OPEN_REJECT count: 0
     Received SSP frame error count: 0

Danke für euer Feedback und sg
Roland

OsvaldoP · Jun 24, 2024

Habe die Disks wie folgt nacheinander wieder eingehängt. Abschließend werde ich noch ein Verify starten

Code:

zpool labelclear -f /dev/sdX1
zpool replace zfs01 <ID laut zpool Status> /dev/sdX
zpool status -> resilver abwarten

sg
Roland

VictorSTS · Jun 25, 2024

This may happen again on restart. The pool should have been created using disk's "by-id" value instead of their /dev/sdX path, as the last may change at kernel/controller decision.

I would replace drives one by one, removing the /dev/sdX device and adding it back with their "by-id" ID, so no matter how the systems sees the drives, ZFS will use the right disks.

OsvaldoP · Jun 26, 2024

Hy

thanks for response.
It happend again after a systemcrash, think one Ram Module has a malfunction.

Pool was created by Proxmox Gui, will try to check how it was created.

Please give me some time, currently I'm on vacation

kr
Roland

OsvaldoP · Jun 26, 2024

Hy,
Wir haben dies dem Kunden mehrfach mitgeteilt, er technisch nicht der schlechteste sagt die Raidcontroller von Dell können im HBA betrieben werden ohne dass der Cache aktiv ist. Sie laufen im HBA Mode.
Danke für den Hinweis.
Zfs und Commands prüfe ich wenn ich retour bin

Sg

Search

Search

Fault tolerance of the pool may be compromised

OsvaldoP

Active Member

OsvaldoP

Active Member

VictorSTS

Famous Member

OsvaldoP

Active Member

OsvaldoP

Active Member