ZFS device fault for pool BUT no SMART errors on drive

Jota V.

Well-Known Member
Jan 29, 2018
59
9
48
48
Hi, on my Proxmox Cluster one node throw this email.

Code:
The number of I/O errors associated with a ZFS device exceeded acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
    eid: 445648
  class: statechange
  state: FAULTED
   host: node01
   time: 2020-04-12 07:04:42+0200
  vpath: /dev/sda1
  vphys: pci-0000:00:1f.2-ata-1
  vguid: 0x7F878357D147A4A5
  devid: ata-WDC_WD2004FBYZ-01YCBB1_xxxxxxxxxxxxxxxxxxxxx-part1
   pool: 0xF0A3E89FFACE50B7

we've done a long SMART test on this drive and got no errors.

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       4250
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31593
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
16 Total_LBAs_Read         0x0022   014   186   000    Old_age   Always       -       758610618042
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       389
194 Temperature_Celsius     0x0022   123   108   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     31583         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

And zpool detail gives

Code:
root@node01:~# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0 days 03:02:37 with 0 errors on Sun Apr 12 03:26:39 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sda     FAULTED      0    51     0  too many errors
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors

Should we change this drive? or it's a false alarm?.

Thanks.
 
Last edited:
Reading on this, i've done a

zpool clear rpool

and got this

Code:
ZFS has finished a resilver:

   eid: 478916
 class: resilver_finish
  host: node01
  time: 2020-04-14 08:38:33+0200
  pool: rpool
 state: ONLINE
  scan: resilvered 118M in 0 days 00:00:41 with 0 errors on Tue Apr 14 08:38:33 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

errors: No known data errors
 
Should we change this drive? or it's a false alarm?.
difficult to tell - especially if you don't know where the error originated... - did `dmesg`/the journal (`journalctl -r`) show any errors?

apart from that - it depends on how important the data on the pool is.

it is definitely a moment to check your backups and see if they are working.

else I had one occasion where a disk in a zpool threw a few errors - but went back to normal after a zpool clear (and stayed that way for over 6 months since then)

I hope this helps!
 
Looking into /var/log/daemon.log at email received time got many lines like this

Code:
pr 12 07:00:04 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 pmxcfs[1759]: [status] notice: received log
Apr 12 07:00:05 node01 systemd[1]: Started Session 1398760 of user root.
Apr 12 07:00:08 node01 zed: eid=445474 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:10 node01 zed: eid=445475 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:10 node01 systemd[1]: session-1398760.scope: Succeeded.
Apr 12 07:00:11 node01 zed: eid=445476 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:11 node01 zed: eid=445477 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:11 node01 zed: eid=445478 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:12 node01 systemd[1]: Started Session 1398761 of user root.
Apr 12 07:00:13 node01 zed: eid=445479 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:15 node01 systemd[1]: session-1398761.scope: Succeeded.
Apr 12 07:00:15 node01 systemd[1]: Started Session 1398762 of user root.
Apr 12 07:00:16 node01 systemd[1]: session-1398762.scope: Succeeded.
Apr 12 07:00:16 node01 systemd[1]: Started Session 1398763 of user root.
Apr 12 07:00:17 node01 systemd[1]: session-1398763.scope: Succeeded.
Apr 12 07:00:17 node01 systemd[1]: Started Session 1398764 of user root.
Apr 12 07:00:19 node01 zed: eid=445480 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 zed: eid=445481 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 zed: eid=445482 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:21 node01 systemd[1]: session-1398764.scope: Succeeded.
Apr 12 07:00:21 node01 zed: eid=445483 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:22 node01 zed: eid=445484 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:22 node01 systemd[1]: Started Session 1398765 of user root.
Apr 12 07:00:23 node01 zed: eid=445485 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:24 node01 systemd[1]: session-1398765.scope: Succeeded.
Apr 12 07:00:24 node01 systemd[1]: Started Session 1398766 of user root.
Apr 12 07:00:25 node01 systemd[1]: session-1398766.scope: Succeeded.
Apr 12 07:00:25 node01 systemd[1]: Started Session 1398767 of user root.
Apr 12 07:00:26 node01 systemd[1]: session-1398767.scope: Succeeded.
Apr 12 07:00:27 node01 systemd[1]: Started Session 1398768 of user root.
Apr 12 07:00:29 node01 zed: eid=445486 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 zed: eid=445487 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 zed: eid=445488 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:31 node01 systemd[1]: session-1398768.scope: Succeeded.
Apr 12 07:00:31 node01 zed: eid=445489 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:32 node01 zed: eid=445490 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:32 node01 systemd[1]: Started Session 1398769 of user root.
Apr 12 07:00:33 node01 zed: eid=445491 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:34 node01 systemd[1]: session-1398769.scope: Succeeded.
Apr 12 07:00:34 node01 systemd[1]: Started Session 1398770 of user root.
Apr 12 07:00:34 node01 systemd[1]: session-1398770.scope: Succeeded.
Apr 12 07:00:34 node01 systemd[1]: Started Session 1398771 of user root.
Apr 12 07:00:35 node01 systemd[1]: session-1398771.scope: Succeeded.
Apr 12 07:00:36 node01 systemd[1]: Started Session 1398772 of user root.
Apr 12 07:00:38 node01 zed: eid=445492 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:39 node01 zed: eid=445493 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 zed: eid=445494 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 systemd[1]: session-1398772.scope: Succeeded.
Apr 12 07:00:40 node01 zed: eid=445495 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:40 node01 zed: eid=445496 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:41 node01 systemd[1]: Started Session 1398773 of user root.
Apr 12 07:00:42 node01 zed: eid=445497 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:42 node01 systemd[1]: session-1398773.scope: Succeeded.
Apr 12 07:00:42 node01 systemd[1]: Started Session 1398774 of user root.
Apr 12 07:00:43 node01 systemd[1]: session-1398774.scope: Succeeded.
Apr 12 07:00:43 node01 systemd[1]: Started Session 1398775 of user root.
Apr 12 07:00:44 node01 systemd[1]: session-1398775.scope: Succeeded.
Apr 12 07:00:49 node01 systemd[1]: Started Session 1398776 of user root.
Apr 12 07:00:51 node01 zed: eid=445498 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445499 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445500 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445501 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445502 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 zed: eid=445503 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:53 node01 systemd[1]: session-1398776.scope: Succeeded.
Apr 12 07:00:53 node01 zed: eid=445504 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 zed: eid=445505 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 systemd[1]: Started Session 1398777 of user root.
Apr 12 07:00:54 node01 zed: eid=445506 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:54 node01 zed: eid=445507 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:55 node01 zed: eid=445508 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:00:55 node01 systemd[1]: session-1398777.scope: Succeeded.
Apr 12 07:00:55 node01 systemd[1]: Started Session 1398778 of user root.
Apr 12 07:00:56 node01 systemd[1]: session-1398778.scope: Succeeded.
Apr 12 07:00:56 node01 systemd[1]: Started Session 1398779 of user root.
Apr 12 07:00:57 node01 systemd[1]: session-1398779.scope: Succeeded.
Apr 12 07:00:57 node01 systemd[1]: Started Session 1398780 of user root.
Apr 12 07:00:59 node01 zed: eid=445509 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:00 node01 systemd[1]: Starting Proxmox VE replication runner...
Apr 12 07:01:01 node01 zed: eid=445510 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: session-1398780.scope: Succeeded.
Apr 12 07:01:01 node01 zed: eid=445511 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: pvesr.service: Succeeded.
Apr 12 07:01:01 node01 systemd[1]: Started Proxmox VE replication runner.
Apr 12 07:01:01 node01 zed: eid=445512 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 zed: eid=445513 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:01 node01 systemd[1]: Started Session 1398804 of user root.
Apr 12 07:01:03 node01 zed: eid=445514 class=history_event pool_guid=0xF0A3E89FFACE50B7  
Apr 12 07:01:03 node01 systemd[1]: session-1398804.scope: Succeeded.
Apr 12 07:01:03 node01 systemd[1]: Started Session 1398817 of user root.
Apr 12 07:01:04 node01 systemd[1]: session-1398817.scope: Succeeded.
Apr 12 07:01:04 node01 systemd[1]: Started Session 1398818 of user root.
 
That looks quite alright - but as written - the information should rather be in `dmesg`, the journal (`journalctl -r`) or the kernel log `/var/log/kern.log`)
 
dmesg and journalclt -r gives me results from two hours ago. I've rebooted the server :-(

No info at this time on /var/log/kern.log
 
Also do a manual scrub and see if the errors comes back. If not, maybe everything is running fine again.

I had similar problems with drives that went too hot, after they cooled a bit, they worked flawlessly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!