ZFS device fault for pool BUT no SMART errors on drive

Jota V.

Well-Known Member
Jan 29, 2018
59
9
48
50
Hi, on my Proxmox Cluster one node throw this email.

Code:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
    eid: 445648
  class: statechange
  state: FAULTED
   host: node01
   time: 2020-04-12 07:04:42+0200
  vpath: /dev/sda1
  vphys: pci-0000:00:1f.2-ata-1
  vguid: 0x7F878357D147A4A5
  devid: ata-WDC_WD2004FBYZ-01YCBB1_xxxxxxxxxxxxxxxxxxxxx-part1
   pool: 0xF0A3E89FFACE50B7

we've done a long SMART test on this drive and got no errors.

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4250
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31593
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
16 Total_LBAs_Read         0x0022   014   186   000    Old_age   Always       -       758610618042
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       389
194 Temperature_Celsius     0x0022   123   108   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     31583         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And zpool detail gives

Code:
root@node01:~# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 03:02:37 with 0 errors on Sun Apr 12 03:26:39 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     FAULTED      0    51     0  too many errors
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Should we change this drive? or it's a false alarm?.

Thanks.
 
Last edited:
Reading on this, i've done a

zpool clear rpool

and got this

Code:
ZFS has finished a resilver:

   eid: 478916
 class: resilver_finish
  host: node01
  time: 2020-04-14 08:38:33+0200
  pool: rpool
 state: ONLINE
  scan: resilvered 118M in 0 days 00:00:41 with 0 errors on Tue Apr 14 08:38:33 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors
 
Should we change this drive? or it's a false alarm?.
difficult to tell - especially if you don't know where the error originated... - did `dmesg`/the journal (`journalctl -r`) show any errors?

apart from that - it depends on how important the data on the pool is.

it is definitely a moment to check your backups and see if they are working.

else I had one occasion where a disk in a zpool threw a few errors - but went back to normal after a zpool clear (and stayed that way for over 6 months since then)

I hope this helps!
 
Looking into /var/log/daemon.log at email received time got many lines like this

Code:
pr 12 07:00:04 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 systemd[1]: Started Session 1398760 of user root.
Apr 12 07:00:08 node01 zed: eid=445474 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:10 node01 zed: eid=445475 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:10 node01 systemd[1]: session-1398760.scope: Succeeded.
Apr 12 07:00:11 node01 zed: eid=445476 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:11 node01 zed: eid=445477 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:11 node01 zed: eid=445478 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:12 node01 systemd[1]: Started Session 1398761 of user root.
Apr 12 07:00:13 node01 zed: eid=445479 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:15 node01 systemd[1]: session-1398761.scope: Succeeded.
Apr 12 07:00:15 node01 systemd[1]: Started Session 1398762 of user root.
Apr 12 07:00:16 node01 systemd[1]: session-1398762.scope: Succeeded.
Apr 12 07:00:16 node01 systemd[1]: Started Session 1398763 of user root.
Apr 12 07:00:17 node01 systemd[1]: session-1398763.scope: Succeeded.
Apr 12 07:00:17 node01 systemd[1]: Started Session 1398764 of user root.
Apr 12 07:00:19 node01 zed: eid=445480 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 zed: eid=445481 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 zed: eid=445482 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 systemd[1]: session-1398764.scope: Succeeded.
Apr 12 07:00:21 node01 zed: eid=445483 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:22 node01 zed: eid=445484 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:22 node01 systemd[1]: Started Session 1398765 of user root.
Apr 12 07:00:23 node01 zed: eid=445485 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:24 node01 systemd[1]: session-1398765.scope: Succeeded.
Apr 12 07:00:24 node01 systemd[1]: Started Session 1398766 of user root.
Apr 12 07:00:25 node01 systemd[1]: session-1398766.scope: Succeeded.
Apr 12 07:00:25 node01 systemd[1]: Started Session 1398767 of user root.
Apr 12 07:00:26 node01 systemd[1]: session-1398767.scope: Succeeded.
Apr 12 07:00:27 node01 systemd[1]: Started Session 1398768 of user root.
Apr 12 07:00:29 node01 zed: eid=445486 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 zed: eid=445487 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 zed: eid=445488 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 systemd[1]: session-1398768.scope: Succeeded.
Apr 12 07:00:31 node01 zed: eid=445489 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:32 node01 zed: eid=445490 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:32 node01 systemd[1]: Started Session 1398769 of user root.
Apr 12 07:00:33 node01 zed: eid=445491 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:34 node01 systemd[1]: session-1398769.scope: Succeeded.
Apr 12 07:00:34 node01 systemd[1]: Started Session 1398770 of user root.
Apr 12 07:00:34 node01 systemd[1]: session-1398770.scope: Succeeded.
Apr 12 07:00:34 node01 systemd[1]: Started Session 1398771 of user root.
Apr 12 07:00:35 node01 systemd[1]: session-1398771.scope: Succeeded.
Apr 12 07:00:36 node01 systemd[1]: Started Session 1398772 of user root.
Apr 12 07:00:38 node01 zed: eid=445492 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:39 node01 zed: eid=445493 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 zed: eid=445494 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 systemd[1]: session-1398772.scope: Succeeded.
Apr 12 07:00:40 node01 zed: eid=445495 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 zed: eid=445496 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:41 node01 systemd[1]: Started Session 1398773 of user root.
Apr 12 07:00:42 node01 zed: eid=445497 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:42 node01 systemd[1]: session-1398773.scope: Succeeded.
Apr 12 07:00:42 node01 systemd[1]: Started Session 1398774 of user root.
Apr 12 07:00:43 node01 systemd[1]: session-1398774.scope: Succeeded.
Apr 12 07:00:43 node01 systemd[1]: Started Session 1398775 of user root.
Apr 12 07:00:44 node01 systemd[1]: session-1398775.scope: Succeeded.
Apr 12 07:00:49 node01 systemd[1]: Started Session 1398776 of user root.
Apr 12 07:00:51 node01 zed: eid=445498 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445499 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445500 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445501 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445502 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445503 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 systemd[1]: session-1398776.scope: Succeeded.
Apr 12 07:00:53 node01 zed: eid=445504 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 zed: eid=445505 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 systemd[1]: Started Session 1398777 of user root.
Apr 12 07:00:54 node01 zed: eid=445506 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 zed: eid=445507 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:55 node01 zed: eid=445508 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:55 node01 systemd[1]: session-1398777.scope: Succeeded.
Apr 12 07:00:55 node01 systemd[1]: Started Session 1398778 of user root.
Apr 12 07:00:56 node01 systemd[1]: session-1398778.scope: Succeeded.
Apr 12 07:00:56 node01 systemd[1]: Started Session 1398779 of user root.
Apr 12 07:00:57 node01 systemd[1]: session-1398779.scope: Succeeded.
Apr 12 07:00:57 node01 systemd[1]: Started Session 1398780 of user root.
Apr 12 07:00:59 node01 zed: eid=445509 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:00 node01 systemd[1]: Starting Proxmox VE replication runner...
Apr 12 07:01:01 node01 zed: eid=445510 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: session-1398780.scope: Succeeded.
Apr 12 07:01:01 node01 zed: eid=445511 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: pvesr.service: Succeeded.
Apr 12 07:01:01 node01 systemd[1]: Started Proxmox VE replication runner.
Apr 12 07:01:01 node01 zed: eid=445512 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 zed: eid=445513 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: Started Session 1398804 of user root.
Apr 12 07:01:03 node01 zed: eid=445514 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:03 node01 systemd[1]: session-1398804.scope: Succeeded.
Apr 12 07:01:03 node01 systemd[1]: Started Session 1398817 of user root.
Apr 12 07:01:04 node01 systemd[1]: session-1398817.scope: Succeeded.
Apr 12 07:01:04 node01 systemd[1]: Started Session 1398818 of user root.
 
That looks quite alright - but as written - the information should rather be in `dmesg`, the journal (`journalctl -r`) or the kernel log `/var/log/kern.log`)
 
dmesg and journalclt -r gives me results from two hours ago. I've rebooted the server :-(

No info at this time on /var/log/kern.log
 
Also do a manual scrub and see if the errors comes back. If not, maybe everything is running fine again.

I had similar problems with drives that went too hot, after they cooled a bit, they worked flawlessly.