Help! I'm Being Spammed with SMART Email Notifications!

mattlach · Nov 18, 2017

Hey all,

First let me explain my setup. My Proxmox box boots off of a ZFS mirror of two 500GB SSD's. I also have a secondary ZFS pool consisting of 12 spinning disks, 2 SSD SLOG ZIL devices and 2 SSD L2ARC devices, which I use for data storage. I am in the middle of a slow project to one by one replace my old 4TB drives with new 10TB drives and resilver to grow the pool.

On average I replace one disk per week. ~6 weeks in, I am halfway through this project.

Original configuration prior to start of project:

Code:

  pool: rpool
config:

    NAME                                STATE     READ WRITE CKSUM
    rpool                               ONLINE       0     0     0
      mirror-0                          ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB   ONLINE       0     0     0
        ata-Samsung_SSD_850_EVO_500GB   ONLINE       0     0     0


  pool: zfshome
config:

    NAME                                STATE     READ WRITE CKSUM
    zfshome                             ONLINE       0     0     0
      raidz2-0                          ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
      raidz2-1                          ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
    logs
      mirror-2                          ONLINE       0     0     0
        ata-INTEL_SSDSC2BA100G3         ONLINE       0     0     0
        ata-INTEL_SSDSC2BA100G3         ONLINE       0     0     0
    cache
      ata-Samsung_SSD_850_PRO_512GB     ONLINE       0     0     0
      ata-Samsung_SSD_850_PRO_512GB     ONLINE       0     0     0

Current configuration of storage pool now that I am halfway through:

Code:

  pool: zfshome
config:

    NAME                                STATE     READ WRITE CKSUM
    zfshome                             ONLINE       0     0     0
      raidz2-0                          ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
      raidz2-1                          ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-ST10000NM0016-1TT101        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0        ONLINE       0     0     0
    logs
      mirror-2                          ONLINE       0     0     0
        ata-INTEL_SSDSC2BA100G3         ONLINE       0     0     0
        ata-INTEL_SSDSC2BA100G3         ONLINE       0     0     0
    cache
      ata-Samsung_SSD_850_PRO_512GB     ONLINE       0     0     0
      ata-Samsung_SSD_850_PRO_512GB     ONLINE       0     0     0

As you can see, six of the old WD Red 4TB drives have been replaced with 10TB Seagate Enterprise drives.

I've gotten a few spurious email notifications of SMART errors in the past, but this morning is when it took off for real. My phone was buzzing a lot notifying me of SMART errors:

Every single one of the SMART notifications are for WD drives with serial numbers that are no longer in the system, some haven't been in the system for 6 weeks! When I log on to the Proxmox server, and look at drive status and pool status, everything is fine.

The ZFS Resilver notification seems legit, but for some reason is about 3 days late.

Any clue what is going on here? Is this smartd getting confused because I hot swapped the drives, and linux is still using the same /dev/sdx device names? Can I restart smartd to force it to do a fresh read of what drives are actually in the system?

Also, why are the email notifications that ARE legitimate, usually 3 days late?

Appreciate any help!

Thanks,
Matt

guletz · Nov 19, 2017

Hi,

Maybe this emails was not able to be send at the right moment (when the disks was start to show smart errors). Then you replace this disks, and after this moment your smtp server was able to send the mails. Check your mail server log, and you will find what I said.
As a side notes about replace some old hdd with new hdd:
- before replace, I test my new disks with badblocks for many days, and with smart, and if the disk pass the both tests, then is ready to use
- after a succeful replace I run a scrub
- in your case, a raid10 zpool, you can replace 2 disks at the same time (one disk for each raidz2 )
- because zfs is smart, during the replacement (without remove any old disk) the parity/redundant data will be better(as a simple example the metadata will be present on all your old disk + on the new disk until the resilver process is finish)
- and maybe the most important thing: do not use the same manufacturer models disks (in my case I use 50% from A manufacturer and the rest from B, even for 2 disk pool - if I use at minimum 4 disk I try to buy from different vendors ....

Also take care about your risky pool. Big hdd (10 Tb) can have errors (as makers siad in the data-sheets), so the probability to have another bad disk in the process of replacing the first failing disk is higher.
Only from the safety perspective is better to use small vdevs with small size disks.

mattlach · Nov 21, 2017

guletz said:
Hi,

Maybe this emails was not able to be send at the right moment (when the disks was start to show smart errors). Then you replace this disks, and after this moment your smtp server was able to send the mails. Check your mail server log, and you will find what I said.

I'll take a look, thanks, but I don't think this is it. It keeps resending warnings about the same disks (that are no longer in the system), with newer date stamps.

I suspect what is happening is that smartd hasn't noticed that they have been replaced due to not handling hot swaps well, so it is reading the data from the same /dev/sdx not realizing that the disk has changed, and then flagging it as being bad, as Seagate and WD use very different formats for their raw data.

guletz said:
As a side notes about replace some old hdd with new hdd:
- before replace, I test my new disks with badblocks for many days, and with smart, and if the disk pass the both tests, then is ready to use
- after a succeful replace I run a scrub
- in your case, a raid10 zpool, you can replace 2 disks at the same time (one disk for each raidz2 )
- because zfs is smart, during the replacement (without remove any old disk) the parity/redundant data will be better(as a simple example the metadata will be present on all your old disk + on the new disk until the resilver process is finish)
- and maybe the most important thing: do not use the same manufacturer models disks (in my case I use 50% from A manufacturer and the rest from B, even for 2 disk pool - if I use at minimum 4 disk I try to buy from different vendors ....

Also take care about your risky pool. Big hdd (10 Tb) can have errors (as makers siad in the data-sheets), so the probability to have another bad disk in the process of replacing the first failing disk is higher.
Only from the safety perspective is better to use small vdevs with small size disks.

I agree with most of these and disagree with others.

This is not a particularly risky pool, regardless of how large the disks get. With dual parity on both VDEV's the risk of either identical URE's in three places or three disks failing in the same VDEV are next to infinitesimal. Besides, that's why we have backups.

I also would not run with mismatched drives. You want the timing and performance of the drives to match as closely as possible, so identical drive part numbers are a must.

It is generally a good idea to spread out disk purchases of the same model from different retailers over several weeks toa few months to get as much diversity in the date codes as possible though, so you don't get a bad batch and multiples go at the same time.

I have been buying the disks two at a time from different retailers, with two weeks in between each purchase, so the overall process will take 12 weeks. Each group of two get tested with a full barrage of badblocks and SMART tests before they get resilvered in, one into each VDEV to minimize risk.

I am very happy with this method. The chance of data loss is infinitely small, and if it does occur the worst that will happen is the annoyance of having to restore from backup.

Remember, RAID is not backup. You still have to back your data up, regardless of what RAID configuration you use. RAID only protects against disk failure and URE's, but there are SO many more things that can go wrong with data storage. Bad ram, bad controller resulting in bad data being written across all drives, file system errors, ransomware, accidental "rm -fr /", fire, flood, etc. etc.

The only way to secure your data is to have an offsite backup. All RAID protects against is the inconvenience of having to restore from that offsite backup.

0xFelix · Nov 22, 2017

I'd suggest to restart smartd with "systemctl restart smartd", should be enough for smartd to pickup the new drives.

guletz · Nov 22, 2017

mattlach said:
This is not a particularly risky pool, regardless of how large the disks get. With dual parity on both VDEV's the risk of either identical URE's in three places or three disks failing in the same VDEV are next to infinitesimal. Besides, that's why we have backups.

My fault, I was have the impression that is a raiz1

With raidz2 is OK.

mattlach · Dec 10, 2017

0xFelix said:
I'd suggest to restart smartd with "systemctl restart smartd", should be enough for smartd to pickup the new drives.

This seems to have done the trick, thank you.

Regarding the delayed notifications, I did some research and it turns out both my ISP and my VPN provider block port 25 to reduce the risk of email spamming, so I am not sure how they are getting through at all...

guletz · Dec 12, 2017

Maybe they block only the smtp port for what is coming to your host and not for what you send outside (bad decision if you ask me - and this is one of the most important cause of spam in my opinion )

Search

Search

Help! I'm Being Spammed with SMART Email Notifications!

mattlach

Renowned Member

guletz

Famous Member

mattlach

Renowned Member

0xFelix

Member

guletz

Famous Member

mattlach

Renowned Member

guletz

Famous Member