ZFS device fault for pool

dmq

Member
Apr 24, 2022
14
1
8
Hallo,

hoffe mir kann jemand hierbei helfen.

Ich haben eine recht neue Intel NUC Installation mit 2x Micron Pro SSDs (1x Micron_5300_MTFDDAK480TDS == /dev/sda, 1x Micron_5300_MTFDDAV480TDS == /dev/sdb). Beide sind in einem RAID1 zpool (rpool) zusammengefasst.

Proxmox meldet mir nun seit kurzem die Meldung

zpool status -v:

root@prox:~# zpool status -v pool: rpool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 00:01:42 with 0 errors on Thu Sep 8 23:26:34 2022 config: NAME STATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ata-Micron_5300_MTFDDAK480TDS_2017284987F6-part3 ONLINE 0 0 0 ata-Micron_5300_MTFDDAV480TDS_194424E6BC02-part3 FAULTED 37 695 524 too many errors

Auf der Basis habe ich ein smartctl longtest durchgeführt (smartctl -t long /dev/sdb).

Im Anschluss die Informationen über smartctl -a

root@prox:~# smartctl -a /dev/sdb smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.35-3-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Micron 5100 Pro / 52x0 / 5300 SSDs Device Model: Micron_5300_MTFDDAV480TDS Serial Number: 194424E6ANON LU WWN Device Id: 5 00a075 124e6bc02 Firmware Version: D3MU001 User Capacity: 480,103,981,056 bytes [480 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: M.2 TRIM Command: Available, deterministic, zeroed Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-4 (minor revision not indicated) SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Sep 8 23:41:15 2022 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 4642) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 6) minutes. Conveyance self-test routine recommended polling time: ( 3) minutes. SCT capabilities: (0x0035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0032 100 100 001 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1703 12 Power_Cycle_Count 0x0032 100 100 001 Old_age Always - 54 170 Reserved_Block_Pct 0x0033 100 100 010 Pre-fail Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 001 Old_age Always - 0 173 Avg_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 3 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 49 183 SATA_Int_Downshift_Ct 0x0032 100 100 000 Old_age Always - 2 184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 19 194 Temperature_Celsius 0x0022 018 015 000 Old_age Always - 82 (Min/Max 26/85) 195 Hardware_ECC_Recovered 0x0032 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 3471634576 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 108488401 248 Bckgnd_Program_Page_Cnt 0x0032 100 100 000 Old_age Always - 52082279 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 000 Pre-fail Always - 2161 210 RAIN_Success_Recovered 0x0032 100 100 000 Old_age Always - 0 211 Integ_Scan_Complete_Cnt 0x0032 100 100 000 Old_age Always - 36 212 Integ_Scan_Folding_Cnt 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 1702 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Completed [00% left] (61473024-61538559) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

Im Anschluss habe ich das Kommando "zpool clear rpool" ausgeführt. Nach einem erneuten scrub Prozess kommt der Fehler leider aber wieder :(

Zudem finde ich im bootlog / dmesg folgendes:

root@prox:~# dmesg | grep sdb [ 1.905401] sd 2:0:0:0: [sdb] 937703088 512-byte logical blocks: (480 GB/447 GiB) [ 1.905405] sd 2:0:0:0: [sdb] 4096-byte physical blocks [ 1.905416] sd 2:0:0:0: [sdb] Write Protect is off [ 1.905421] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [ 1.905439] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1.925311] sdb: sdb1 sdb2 sdb3 [ 1.950166] sd 2:0:0:0: [sdb] Attached SCSI disk [ 93.672102] sd 2:0:0:0: [sdb] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 93.672120] sd 2:0:0:0: [sdb] tag#22 Sense Key : Illegal Request [current] [ 93.672135] sd 2:0:0:0: [sdb] tag#22 Add. Sense: Unaligned write command [ 93.672149] sd 2:0:0:0: [sdb] tag#22 CDB: Read(10) 28 00 06 a1 33 48 00 00 f8 00 [ 93.672737] blk_update_request: I/O error, dev sdb, sector 111227720 op 0x0:(READ) flags 0x700 phys_seg 19 prio class 0 [ 93.674014] sd 2:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 93.674587] sd 2:0:0:0: [sdb] tag#24 Sense Key : Illegal Request [current] [ 93.675155] sd 2:0:0:0: [sdb] tag#24 Add. Sense: Unaligned write command [ 93.675716] sd 2:0:0:0: [sdb] tag#24 CDB: Read(10) 28 00 06 a1 31 48 00 01 00 00 [ 93.676281] blk_update_request: I/O error, dev sdb, sector 111227208 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 93.677683] sd 2:0:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 93.678316] sd 2:0:0:0: [sdb] tag#25 Sense Key : Illegal Request [current] [ 93.678985] sd 2:0:0:0: [sdb] tag#25 Add. Sense: Unaligned write command [ 93.679642] sd 2:0:0:0: [sdb] tag#25 CDB: Read(10) 28 00 06 a1 32 48 00 01 00 00 [ 93.680275] blk_update_request: I/O error, dev sdb, sector 111227464 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 93.821093] sd 2:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 93.821761] sd 2:0:0:0: [sdb] tag#16 Sense Key : Illegal Request [current] [ 93.822427] sd 2:0:0:0: [sdb] tag#16 Add. Sense: Unaligned write command [ 93.823091] sd 2:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 06 a1 34 40 00 01 00 00 [ 93.823820] blk_update_request: I/O error, dev sdb, sector 111227968 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 94.404627] sd 2:0:0:0: [sdb] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 94.405157] sd 2:0:0:0: [sdb] tag#22 Sense Key : Illegal Request [current] [ 94.405682] sd 2:0:0:0: [sdb] tag#22 Add. Sense: Unaligned write command [ 94.406206] sd 2:0:0:0: [sdb] tag#22 CDB: Read(10) 28 00 00 10 0a 10 00 00 10 00 [ 94.406728] blk_update_request: I/O error, dev sdb, sector 1051152 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 94.496488] sd 2:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 94.497072] sd 2:0:0:0: [sdb] tag#15 Sense Key : Illegal Request [current] [ 94.497639] sd 2:0:0:0: [sdb] tag#15 Add. Sense: Unaligned write command [ 94.498201] sd 2:0:0:0: [sdb] tag#15 CDB: Read(10) 28 00 37 e4 32 10 00 00 10 00 [ 94.498768] blk_update_request: I/O error, dev sdb, sector 937701904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 94.552994] sd 2:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 94.553608] sd 2:0:0:0: [sdb] tag#21 Sense Key : Illegal Request [current] [ 94.554224] sd 2:0:0:0: [sdb] tag#21 Add. Sense: Unaligned write command [ 94.554835] sd 2:0:0:0: [sdb] tag#21 CDB: Read(10) 28 00 37 e4 34 10 00 00 10 00 [ 94.555494] blk_update_request: I/O error, dev sdb, sector 937702416 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 94.625343] sd 2:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 94.626002] sd 2:0:0:0: [sdb] tag#30 Sense Key : Illegal Request [current] [ 94.626658] sd 2:0:0:0: [sdb] tag#30 Add. Sense: Unaligned write command [ 94.627313] sd 2:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 06 a1 35 40 00 01 00 00 [ 94.628029] blk_update_request: I/O error, dev sdb, sector 111228224 op 0x0:(READ) flags 0x700 phys_seg 16 prio class 0 [ 95.359876] sd 2:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 95.360418] sd 2:0:0:0: [sdb] tag#11 Sense Key : Illegal Request [current] [ 95.361021] sd 2:0:0:0: [sdb] tag#11 Add. Sense: Unaligned write command [ 95.361546] sd 2:0:0:0: [sdb] tag#11 CDB: Read(10) 28 00 00 10 0a 10 00 00 10 00 [ 95.362071] blk_update_request: I/O error, dev sdb, sector 1051152 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 95.400573] sd 2:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 95.401143] sd 2:0:0:0: [sdb] tag#16 Sense Key : Illegal Request [current] [ 95.401708] sd 2:0:0:0: [sdb] tag#16 Add. Sense: Unaligned write command [ 95.402273] sd 2:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 37 e4 32 10 00 00 10 00 [ 95.402866] blk_update_request: I/O error, dev sdb, sector 937701904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0


Ich hoffe die Festplatte ist nicht defekt, habe extra etwas tiefer in die Tasche gegriffen und mir eine Micron Pro angeschafft (es ist auch noch die M.2 Variante). smartctl-seitig sehe ich aber auch keine Fehler.

Hat jemand eine Idee was ich machen kann /machen sollte?

Lieben Dank vorab
dmq
 
194 Temperature_Celsius 0x0022 018 015 000 Old_age Always - 82 (Min/Max 26/85)
82 Grad Celsius klingt aber auch nicht gerade gut. vielleicht macht die SSD Probleme wegen Überhitzung?
 
  • Like
Reactions: dmq
@Dunuin danke für deine Antwort.

Mit dem Wert hatte ich auch schon ein ungutes Gefühl. Die andere Festplatte in dem System zeigt andere Temperaturen:

194 Temperature_Celsius 0x0022 047 042 000 Old_age Always - 53 (Min/Max 20/58)

Physisch ist die M.2 Karte natürlich viel näher an den Chips, die viel Abwärme produzieren. Der Lüfter ist schon auf hoch eingestellt, ich versuche daran noch einmal etwas zu machen.
 
@Dunuin :) Ich habe den NUC ausgebaut und von jedem Staubkorn befreit (es war schon ein wenig auffindbar ;-)). Zudem alles noch einmal neugesteckt (M2 etc.). Zudem habe ich die Cooling-Einstellungen im Visual BIOS angepasst. Nach dem Neustart und ein bißchen Last sahen die Temp-Sensor-Werte (lm-sensors, smartctl) deutlich besser aus. Ich habe mich dann noch einmal an einen scrub getraut. Er läuft durch ohne Fehler.

Die Temperatur war definitiv bedenklich. Ich hätte es aber nicht wirklich mit dem Integritätscheck in Verbindung gebracht. Klar, da passiert eine Menge I/O und die Temperatur geht auch noch einmal ein gutes Stück hoch.

Dir wieder einmal tausend Dank!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!