SMART warnings & disk usage identification

loneboat · Aug 5, 2020

My email notifications have been broken since I set up my cluster a year ago. I recently fixed them, and immediately got notified that I have one node in my cluster which has a failed sector on one of my disks. The email is the following:

Code:

This message was generated by the smartd daemon running on:

   host name:  XXXX
   DNS domain: XXXX.XXXX

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], 1 Offline uncorrectable sectors

Device info:
MB1000EBNCF, S/N:WCAW34980376, WWN:5-0014ee-25d3aa5a4, FW:HPG2, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Jul 24 03:55:44 2020 CDT
Another message will be sent in 24 hours if the problem persists.

Additionally, I have a notification about a "CurrentPendingSector" on the same drive:

Code:

This message was generated by the smartd daemon running on:

   host name:  XXXX
   DNS domain: XXXX.XXXX

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], 2 Currently unreadable (pending) sectors

Device info:
MB1000EBNCF, S/N:WCAW34980376, WWN:5-0014ee-25d3aa5a4, FW:HPG2, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Fri Mar 13 13:43:20 2020 CDT
Another message will be sent in 24 hours if the problem persists.

It has been quite a while since I set up this node, and I do not recall how I initially configured the drives, so I'm trying to figure out how this drive is used by the system (if at all) before I proceed with replacing it.

(Note that my knowledge of ZFS and filesystems in general is very limited, so I'm happy to be corrected if I'm doing something silly/wrong.)

Below is the output of zpool status. I don't see the reported drive's serial number participating in [the only] ZFS pools.

Code:

root@XXXX:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 06:39:43 with 0 errors on Sun Jul 12 07:03:44 2020
config:

        NAME                                    STATE     READ WRITE CKSUM
        rpool                                   ONLINE       0     0     0
          mirror-0                              ONLINE       0     0     0
            ata-MB1000EBNCF_WCAW32027589-part3  ONLINE       0     0     0
            ata-MB1000GCEEK_WMAW31652546-part3  ONLINE       0     0     0
            ata-MB1000EBNCF_WCAW32016279-part3  ONLINE       0     0     0

Below is the output of fdisk -l. I note that this particular device (/dev/sdc) does not even show any partitions:

Code:

root@XXXX:~# fdisk -l
Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: MB1000EBNCF
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F5B532A3-E8C4-440F-85E2-2F657A60433B

Device       Start        End    Sectors  Size Type
/dev/sda1       34       2047       2014 1007K BIOS boot
/dev/sda2     2048    1050623    1048576  512M EFI System
/dev/sda3  1050624 1953525134 1952474511  931G Solaris /usr & Apple ZFS


Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: MB1000GCEEK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: EE6173E1-CC5F-451B-A2ED-0566C6E6F1A3

Device       Start        End    Sectors  Size Type
/dev/sdb1       34       2047       2014 1007K BIOS boot
/dev/sdb2     2048    1050623    1048576  512M EFI System
/dev/sdb3  1050624 1953525134 1952474511  931G Solaris /usr & Apple ZFS


Disk /dev/sdd: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: MB1000EBNCF
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: AFEBF8B2-7B03-40AF-AC51-957823D19213

Device       Start        End    Sectors  Size Type
/dev/sdd1       34       2047       2014 1007K BIOS boot
/dev/sdd2     2048    1050623    1048576  512M EFI System
/dev/sdd3  1050624 1953525134 1952474511  931G Solaris /usr & Apple ZFS


Disk /dev/sdc: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: MB1000EBNCF
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes




Disk /dev/zd0: 256 GiB, 274877906944 bytes, 536870912 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: dos
Disk identifier: 0xb66b8038

Device     Boot   Start       End   Sectors   Size Id Type
/dev/zd0p1 *       2048   1187839   1185792   579M  7 HPFS/NTFS/exFAT
/dev/zd0p2      1187840 536868863 535681024 255.4G  7 HPFS/NTFS/exFAT


Disk /dev/zd16: 32 GiB, 34359738368 bytes, 67108864 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: dos
Disk identifier: 0x539910db

Device      Boot    Start      End  Sectors Size Id Type
/dev/zd16p1 *        2048 62916607 62914560  30G 83 Linux
/dev/zd16p2      62918654 67106815  4188162   2G  5 Extended
/dev/zd16p5      62918656 67106815  4188160   2G 82 Linux swap / Solaris

Partition 2 does not start on physical sector boundary.

Does the above indicate that I'm not even using this particular disk? If so, I can just remove it for now, and replace it later.

Does this imply that this entire node is running on top of the rpool ZFS pool (including the OS itself)?

Thanks for any help!

H4R0 · Aug 6, 2020

From a quick look, the reported bad disk has no partitions so its not used. Looks like a hot spare ?

You got 3 drives in a raid1 configuration.. ZFS shows no errors and scrub is ok, so everything is fine.

I would not use a hot spare in this case, you will have enough time to replace a bad disk. Just make sure zed is configured to send mails for bad disks, much more important then smartmon.

loneboat · Aug 6, 2020

H4R0 said:
From a quick look, the reported bad disk has no partitions so its not used. Looks like a hot spare ?

You got 3 drives in a raid1 configuration.. ZFS shows no errors and scrub is ok, so everything is fine.

I would not use a hot spare in this case, you will have enough time to replace a bad disk. Just make sure zed is configured to send mails for bad disks, much more important then smartmon.

Aah ok. I did not know about zed, I'll look that up. Thanks!

And yes, I powered down, removed that drive, and powered back up. Everything seems fine, so I guess I wasn't using that drive after all.

Thanks again!

Search

Search

SMART warnings & disk usage identification

loneboat

Well-Known Member

H4R0

Well-Known Member

loneboat

Well-Known Member

We value your privacy