Guten Morgen!
Wir haben ein PBS welches leider mittels Raidcontroller im HBA Mode betrieben wird. (Wurde vom Kunden trotz Warnung so gewünscht).
Mittels 15 Stück 4TB SSD-Disks wurde ein ZFS RaidZ2 erstellt.
Vor 2 Tagen hat das System neu gestartet. 2 Disks (sdr, sds) wurden vermutlich beim Hochfahren als nicht verfügbar deklariert, zfs hat den Status degraded.
Die Disks sind vorhanden Smartwerte scheinen ok zu sein.
Welches wäre eurer Meinung die beste Vorgehensweise?
Disks trotzdem tauschen oder Disks nochmals in den Raidverbund reinhängen?
Wenn reinhängen, einfach ein zpool replace /dev/sdr /dev/sdr?
* Alamrierung via Mail:
* zpool Status
* Smartwerte einer Disk
Danke für euer Feedback und sg
Roland
Wir haben ein PBS welches leider mittels Raidcontroller im HBA Mode betrieben wird. (Wurde vom Kunden trotz Warnung so gewünscht).
Mittels 15 Stück 4TB SSD-Disks wurde ein ZFS RaidZ2 erstellt.
Vor 2 Tagen hat das System neu gestartet. 2 Disks (sdr, sds) wurden vermutlich beim Hochfahren als nicht verfügbar deklariert, zfs hat den Status degraded.
Die Disks sind vorhanden Smartwerte scheinen ok zu sein.
Welches wäre eurer Meinung die beste Vorgehensweise?
Disks trotzdem tauschen oder Disks nochmals in den Raidverbund reinhängen?
Wenn reinhängen, einfach ein zpool replace /dev/sdr /dev/sdr?
* Alamrierung via Mail:
Code:
ZFS has detected that a device was removed.
impact: Fault tolerance of the pool may be compromised.
eid: 8
class: statechange
state: UNAVAIL
host: slbkpp01
time: 2024-06-22 13:13:18+0200
vpath: /dev/sdr1
vphys: pci-0000:03:00.0-scsi-0:0:18:0
vguid: 0xB65F245938371A0D
devid: scsi-35002538b71a22d30-part1
pool: zfs01 (0xF689F90C783BE902)
* zpool Status
Code:
zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:43 with 0 errors on Sun Jun 9 00:24:45 2024
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000cca059b6cccc-part3 ONLINE 0 0 0
scsi-35000cca059b6cd2c-part3 ONLINE 0 0 0
errors: No known data errors
pool: zfs01
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: scrub repaired 0B in 01:33:46 with 0 errors on Sun Jun 9 01:57:52 2024
config:
NAME STATE READ WRITE CKSUM
zfs01 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
sdo ONLINE 0 0 0
sdn ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
13141262203304221197 FAULTED 0 0 0 was /dev/sdr1
18003793698104801029 FAULTED 0 0 0 was /dev/sds1
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdv ONLINE 0 0 0
errors: No known data errors
* Smartwerte einer Disk
Code:
smartctl -x /dev/sds
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SAMSUNG
Product: MZILT3T8HBLS/007
Revision: GXA0
Compliance: SPC-5
User Capacity: 3,840,755,982,336 bytes [3.84 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5002538b71a22d30
Serial number: S5G0NC0RA03581
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Mon Jun 24 08:31:46 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Percentage used endurance indicator: 0%
Current temperature = 43
Lifetime maximum temperature = 44
Lifetime minimum temperature = 19
Maximum temperature since power on = 44
Minimum temperature since power on = 42
Manufactured in week 42 of year 2021
Accumulated start-stop cycles: 23
Specified load-unload count over device lifetime: 0
Accumulated load-unload cycles: 0
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 143616.723 0
write: 0 0 0 0 0 83042.590 0
verify: 0 0 0 0 0 0.008 0
Non-medium error count: 189
Pending defect count:0 Pending Defects
No Self-tests have been logged
Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 22628:16 [1357696 minutes]
Number of background scans performed: 33, scan progress: 58.80%
Number of background medium scans performed: 33
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 3
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: loss of dword synchronization
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x5002538b71a22d32
attached SAS address = 0x500056b378b913ff
attached phy identifier = 18
Invalid DWORD count = 4
Running disparity error count = 4
Loss of DWORD synchronization count = 1
Phy reset problem count = 0
Phy event descriptors:
Received ERROR count: 0
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 17
Received SSP frame error count: 0
relative target port id = 2
generation code = 3
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5002538b71a22d33
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization count = 0
Phy reset problem count = 0
Phy event descriptors:
Received ERROR count: 0
Received address frame error count: 0
Received abandon-class OPEN_REJECT count: 0
Received retry-class OPEN_REJECT count: 0
Received SSP frame error count: 0
Danke für euer Feedback und sg
Roland