No ZFS Degraded Notifications

pbo10

New Member
Aug 30, 2019
8
0
1
39
I have Zed configured to send notifications and I successfully receive this when a scrub is completed so I know the email settings are fine. However I get no notification if a ZFS volume is degraded.

As far as I can see after reading about this they should be sent, I see in the /etc/zfs/zed.d folder there is the script (scrub_finish-notify.sh) for the notification on scrub completion which works fine, and I also see one called "statechange-notify.sh" which appears to be designed to send a notification when the volume becomes "DEGRADED", "FAULTED" or "REMOVED" but I'm not receiving those notifications.

I've tested this today by creating a new ZFS volume and just pulling out one of the drives so the volume becomes degraded and still got nothing.

Does anyone know how to get them working? Are there any changes I need to make or should this work with default settings already?
 
Hi, That looks like it would be great, but it seems to be written for FreeBSD so I'm not sure it would run on a Proxmox host? I'm surprised there's nothing built in that works out of the box, I'm sure almost everyone would want alerts if drives fail or are removed from the system.
 
Hi,

I am using the script on the link I sent you and it works great on Proxmox.

I made some customizations for email alert.

If you want I can send you the version I'm using.

I also "customized" another script, based on this one I sent you but for BTrFS so I can receive alerts from the various file systems I'm using on different machines.

Regards,

Ricardo Jorge
 
Ok great thanks for letting me know, I'll try out the one you linked to tonight, I think that will do enough for me. I just basically want email alerts any time a ZFS pool changes state to DEGRADED no matter the reason. So a disk failure or even a disk being removed should send me an alert.

I think that script should do all that, but I'll test it out tonight. Thanks again.
 
dont want to hop on this thread but @ricardoj i was checking the script im also using that script which works great is there a way to only get email when it degrades? currently getting email every day even if its fine
 
Hi,

The "version" I'm using send e-mail only when there is a "problem".

The crontab os like this :

7 * * * * /root/ZFS/zfs_test.sh

To send e-mail I'm using this script for years.

The bash code :

Bash:
#!/bin/bash
#
# https://gist.github.com/petervanderdoes/bd6660302404ed5b094d
#
problems=0
emailSubject="`hostname` - ZFS pool - HEALTH check"
emailMessage=""

#
ZFS_LOG="/root/ZFS/ZFS-LOG.txt"
#

# Health - Check if all zfs volumes are in good condition. We are looking for
# any keyword signifying a degraded or broken array.

condition=$(/sbin/zpool status | egrep -i '(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover)')

if [ "${condition}" ]; then
  emailSubject="$emailSubject - fault"
  problems=1
fi

#

# Capacity - Make sure pool capacities are below 80% for best performance. The
# percentage really depends on how large your volume is. If you have a 128GB
# SSD then 80% is reasonable. If you have a 60TB raid-z2 array then you can
# probably set the warning closer to 95%.
#
# ZFS uses a copy-on-write scheme. The file system writes new data to
# sequential free blocks first and when the uberblock has been updated the new
# inode pointers become valid. This method is true only when the pool has
# enough free sequential blocks. If the pool is at capacity and space limited,
# ZFS will be have to randomly write blocks. This means ZFS can not create an
# optimal set of sequential writes and write performance is severely impacted.

maxCapacity=80

if [ ${problems} -eq 0 ]; then
  capacity=$(/sbin/zpool list -H -o capacity)
  for line in ${capacity//%/}
  do
    if [ $line -ge $maxCapacity ]; then
      emailSubject="$emailSubject - Capacity Exceeded"
      problems=1
    fi
  done
fi

# Errors - Check the columns for READ, WRITE and CKSUM (checksum) drive errors
# on all volumes and all drives using "zpool status". If any non-zero errors
# are reported an email will be sent out. You should then look to replace the
# faulty drive and run "zpool scrub" on the affected volume after resilvering.

if [ ${problems} -eq 0 ]; then
  errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)
  if [ "${errors}" ]; then
    emailSubject="$emailSubject - Drive Errors"
    problems=1
  fi
fi

# Scrub Expired - Check if all volumes have been scrubbed in at least the last
# 8 days. The general guide is to scrub volumes on desktop quality drives once
# a week and volumes on enterprise class drives once a month. You can always
# use cron to schedule "zpool scrub" in off hours. We scrub our volumes every
# Sunday morning for example.
#
# Scrubbing traverses all the data in the pool once and verifies all blocks can
# be read. Scrubbing proceeds as fast as the devices allows, though the
# priority of any I/O remains below that of normal calls. This operation might
# negatively impact performance, but the file system will remain usable and
# responsive while scrubbing occurs. To initiate an explicit scrub, use the
# "zpool scrub" command.
#
# The scrubExpire variable is in seconds. So for 8 days we calculate 8 days
# times 24 hours times 3600 seconds to equal 691200 seconds.

##scrubExpire=691200
#
# 2764800 => 32 dias
#
scrubExpire=2764800

if [ ${problems} -eq 0 ]; then
  currentDate=$(date +%s)
  zfsVolumes=$(/sbin/zpool list -H -o name)

  for volume in ${zfsVolumes}
  do
    if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then
      echo "ERROR: You need to run \"zpool scrub $volume\" before this script can monitor the scrub expiration time."
      break
    fi
##    if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then
    if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress") -ge 1 ]; then
      break
    fi

    ### FreeBSD with *nix supported date format
    #scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $15 $12 $13}')
    #scrubDate=$(date -j -f '%Y%b%e-%H%M%S' $scrubRawDate'-000000' +%s)

    ### Ubuntu with GNU supported date format
    scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')
    scrubDate=$(date -d "$scrubRawDate" +%s)

    if [ $(($currentDate - $scrubDate)) -ge $scrubExpire ]; then
      if [ ${problems} -eq 0 ]; then
        emailSubject="$emailSubject - Scrub Time Expired. Scrub Needed on Volume(s)"
      fi
      problems=1
      emailMessage="${emailMessage}Pool: $volume needs scrub \n"
    fi
  done
fi

# Notifications - On any problems send email with drive status information and
# capacities including a helpful subject line to root. Also use logger to write
# the email subject to the local logs. This is the place you may want to put
# any other notifications like:
#
# + Update an anonymous twitter account with your ZFS status (https://twitter.com/zfsmonitor)
# + Playing a sound file or beep the internal speaker
# + Update Nagios, Cacti, Zabbix, Munin or even BigBrother


if [ "$problems" -ne 0 ]; then
  logger $emailSubject
echo -e "$emailSubject\t$emailMessage" >> $ZFS_LOG
# Notifica via email
#
/bin/bash /root/ZFS/notifica-zfs.sh
fi

Regards,

Ricardo Jorge
 
I've made small changes to the Notification section of your script, using /usr/bin/pvemailforward makes it a bit more `standard` in proxmox , and added a tag (-t) for logger which helps finding messages in syslog.
Bash:
if [ "$problems" -ne 0 ]; then
  logger -t "zfs status notifier" "$emailSubject  (see $ZFS_LOG for more info.)"
  echo -e "$emailSubject\t$emailMessage" >> $ZFS_LOG
  # Notify via email
  echo -e "Subject: $emailSubject\n\n$emailMessage" | /usr/bin/pvemailforward
fi
 
Last edited:
When I run this, I get:
date: invalid date ‘errors on Sun Nov 8’
./zfs_health.sh: line 114: 1606208467 - : syntax error: operand expected (error token is "- ")

Are you saying that the exact and complete error message is :
Code:
date: invalid date ‘errors on Sun Nov 8’
./zfs_health.sh: line 114: 1606208467 - : syntax error: operand expected (error token is "- ")
That would not make any kind of sense...
 
@ozgurerdogan , makes no sense to me because date: invalid date ‘errors on Sun Nov 8’ isn't an error message .
Those lines, and in fact the entire script works just fine on a Proxmox system.

So, once again, is what you pasted the exact and complete error message ?

EDIT :
did some more digging, it looks as if line 105 is the one that is actually the culprit in your case.
Line 105 should be :
scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')
Please verify on your system.
If that matches, then what is the output of :
echo $(/sbin/zpool status rpool | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')
Take care to replace `rpool` with an actually existing poolname.

EDIT2 : just guessing, do you have funky pool names, perhaps including spaces ?
please show output of echo $(/sbin/zpool list -H -o name)
 
Last edited:
Ok I re-submit bash without bom format and it run without error. I also replaced "/usr/bin/pvemailforward" section above.

But to test it, I set maxCapacity=10 and ran it.
I got an email with subject of: s8 - ZFS pool - HEALTH check - Capacity Exceeded - Capacity Exceeded - Capacity Exceeded
with emtpy mail content. It does not tell about which one has exceeded capacity. It this normal?
 
Hi, I'm bumping on this thread, I'm also interested about this, but I should have missed something as I don't find any zfs log file. Should I enable the debug log within zed.rc ?
 
I don't find any zfs log file
There is no zfs log file.
The script mentioned above _creates_ a log file, at /root/ZFS/ZFS-LOG.txt, but only if any problems were detected.
If you edit the script to include the changes I proposed, then you will also get logging on problems in /var/log/syslog.
 
Mmmh :rolleyes: Yes.... Sorry, I was completely wrong, looking for actions at the beginning of the script ! Thank you...
 
It works fine, great ! One last word... Here I need to use egrep without -i because of the following output matching "UNAVAIL".
status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'.
 
hello, I tried this script on the last version of proxmox, and, after proxmox made a scrub, I begun to get the following error from cron:

date: invalid date ‘10 00:26:03 2022 ’
/root/script/checkZFS.sh: line 108: 1649578501 - : syntax error: operand expected (error token is "- ")

this looks due to the format of the date of the command:
/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')

that changed.

so the line 105 of the script should be changed from:
scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $13" "$14" " $15" " $16" "$17}')

to:
scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $12" "$13" "$14" " $15" " $16" "$17}')
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!