No email notification for zfs status degraded

monokular

New Member
Apr 9, 2021
13
0
1
38
Hello,
I installed zfs-zed via # apt-get install zfs-zed and edited /etc/zfs/zed.d/zed.rc to uncomment ZED_EMAIL_ADDR="root" as described here: https://pve.proxmox.com/wiki/ZFS_on_Linux#_activate_e_mail_notification

zpool status is degraded, but I do not get an E-Mail. Other notifications from PVE are fine. I'm using postfix for email.

# zpool status pool: zpool1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 04:37:39 with 0 errors on Sun Mar 14 05:01:41 2021 config: NAME STATE READ WRITE CKSUM zpool1 DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sdb ONLINE 0 0 0 sdc FAULTED 3 0 0 too many errors errors: No known data errors
 
Was the pool already degraded when you installed and configured zed?
 
It is an Event Daemon, so it will only send a mail when that event appears.
 
Hello, I confirm that. I was testing my ZFS. When disk was removed (mechanicaly from slot) - nothings happend pool was DEGRADED without any notification. But when disk was replaced and pool was resilvered I got email notofication. For me it is so late (notification).
 
wow, I just ran into this while testing proxmox as I consider to switch to it.

1. I am shocked that this has not been resolved in zed in over a year
2. I am also dissapointed that proxmox does not come with a drive health alert/report system out of the box, you'd think that this is kinda essential to have... o_O

I now made these changes https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096
here /usr/lib/zfs-linux/zed.d/statechange-notify.sh
and I at least get the notification now when a disk is disconnected / drops out.
 
Last edited:
  • Like
Reactions: Inworks IT and oha
Wow, this is crazy. I think I'm still running into this bug in 2022. Is there a way to get notifications when a drive goes dead?
 
Hi,
Still the issue... also on PBS.. whenever a ZFS degradation is detected there's no alert of any kind (other than checking on the host itself)
it should be fixed since ZFS 2.1.3 with this commit. What version are you using? Please also check your zed configuration.
 
  • Like
Reactions: Stoiko Ivanov
Still the issue... also on PBS.. whenever a ZFS degradation is detected there's no alert of any kind (other than checking on the host itself)
should work, but you need to make sure that mailutils are installed.

> apt install mailutils

Most time we use bsd-mailx instead of mailutils:

> apt install bsd-mailx

And check your /root/.forward settings.
 
Last edited:
  • Like
Reactions: fiona
Hmm... actually after a resilver (with the same disk on the same port) the Rz2 does seem to function for a couple of weeks.
Like now, it works again (flawless) for a week but it will turn up again i'm sure.

I'm running PBS 2.2-5 ...
Code:
root@pbsu01:~# cat /sys/module/zfs/version
2.1.5-pve1
 
I have "mixed results" ... some ZFS notifications work, some not. Here's what I tested:

Preparation
  • Using Proxmox VE 7.3-3
  • Ran "apt install mailutils" (as per the above suggestion)
  • Created a ZFS pool "local-zfs" with 3 disks using the PVE GUI
  • Migrated a VM disk to the pool (just to have some data there)
  • Tested the below 3 scenarios, all of which end in a degraded pool
Scenario 1 (working)
  • Command "zpool offline -f local-zfs ata-QEMU_HARDDISK_QM00005"
    --> Email with subject "ZFS device fault for pool 0xBDE81065A6D18BCB on pve-vm1"
  • Command "zpool clear local-zfs"
    --> Email with subject "ZFS resilver_finish event for local-zfs on pve-vm1"
Scenario 2 (not working)
  • - Command "zpool offline local-zfs ata-QEMU_HARDDISK_QM00005"
    --> No email (even though pool shows as "degraded")
  • Command "zpool online local-zfs ata-QEMU_HARDDISK_QM00005"
    --> Email with subject "ZFS resilver_finish event for local-zfs on pve-vm1"
Scenario 3 (not working)
  • Shut down PVE node
  • Unplug one of 3 hard disks of the pool
  • Start the PVE node and modified some data on degraded pool (to force resilvering)
    --> No email (even though pool shows degraded)
  • Shut down PVE node
  • Replug the unplugged hard disk
  • Start the PVE node
  • Command "zpool status" shows "scan: resilvered 464K in 00:00:00 with 0 errors on <timestamp of just now>"
    --> No email (even though resilvering finished for the pool as in scenario 1 and 2)
Conclusions
  • Device failures (done with zpool offline -f) do trigger a mail alert --> Good!
  • The fact that a pool is degraded does not trigger an alert
  • Missing member disks of a pool (e.g. after a reboot) do not trigger an alert
  • Resilvering completed in very short time right after a reboot does not trigger an alert
Question: Any ideas on how to solve that inconsistent behaviour?
 
Last edited:
I have ZED_NOTIFY_VERBOSE=1 in /etc/zfs/zed.d/zed.rc on my PVE node and making the pool to get degraded by unplugging a disk (Scenario #3 above) doesn't trigger an email notification on my system either.

On the other hand I receive the resilvering finished notification after the unplugged disk is plugged back in, but only if resilvering took longer than a few sec. In my experience the resilvering finished notification is not sent if the resilvering process was completed very quickly.
 
I have email notifications enabled and generally receive notifications (e.g. when backups complete, etc.) so there is nothing wrong with notifications and/or emails from proxmox in general.

BUT I have a problematic disk that caused the array to be in a degraded state several times. None of these times a notification has been sent. It was simply by random luck that i clicked the specific node, and clicked disk, and clicked ZFS, to see a tiny yellow "degraded" icon.

Judging from the thread above; it is crazy that something like sending a notification (and displaying a red banner in the GUI on all pages/globally) when a disk array is in a degraded state is not enabled by default or requires tinkering with config files. Storage failure is one of the worst things that can happen to a proxmox node. Next time I might not be as lucky to find the error in time.

You should not need to change a verbosity level. Besides, I already have the verbose=1 sett in zed.rc, and it still does not give me any notifications.
 
I ran those 3 tests with:
- proxmox-ve: 8.0.2
- zfs: zfs-2.1.12-pve1
- also mailutils and ZED_NOTIFY_VERBOSE=1 set as suggested above

I must confirm the very same results on Proxmox VE 8, mail are sent only on Scenario 1, and for resilver events, I get mails every time.
I belive this is not a proxmox problem, but rather underlying zfs configuration, if someone is aware of how to solve this issue, this would help a lot.

Cheers!

I have "mixed results" ... some ZFS notifications work, some not. Here's what I tested:

Preparation
  • Using Proxmox VE 7.3-3
  • Ran "apt install mailutils" (as per the above suggestion)
  • Created a ZFS pool "local-zfs" with 3 disks using the PVE GUI
  • Migrated a VM disk to the pool (just to have some data there)
  • Tested the below 3 scenarios, all of which end in a degraded pool
Scenario 1 (working)
  • Command "zpool offline -f local-zfs ata-QEMU_HARDDISK_QM00005"
    --> Email with subject "ZFS device fault for pool 0xBDE81065A6D18BCB on pve-vm1"
  • Command "zpool clear local-zfs"
    --> Email with subject "ZFS resilver_finish event for local-zfs on pve-vm1"
Scenario 2 (not working)
  • - Command "zpool offline local-zfs ata-QEMU_HARDDISK_QM00005"
    --> No email (even though pool shows as "degraded")
  • Command "zpool online local-zfs ata-QEMU_HARDDISK_QM00005"
    --> Email with subject "ZFS resilver_finish event for local-zfs on pve-vm1"
Scenario 3 (not working)
  • Shut down PVE node
  • Unplug one of 3 hard disks of the pool
  • Start the PVE node and modified some data on degraded pool (to force resilvering)
    --> No email (even though pool shows degraded)
  • Shut down PVE node
  • Replug the unplugged hard disk
  • Start the PVE node
  • Command "zpool status" shows "scan: resilvered 464K in 00:00:00 with 0 errors on <timestamp of just now>"
    --> No email (even though resilvering finished for the pool as in scenario 1 and 2)
Conclusions
  • Device failures (done with zpool offline -f) do trigger a mail alert --> Good!
  • The fact that a pool is degraded does not trigger an alert
  • Missing member disks of a pool (e.g. after a reboot) do not trigger an alert
  • Resilvering completed in very short time right after a reboot does not trigger an alert
Question: Any ideas on how to solve that inconsistent behaviour?
 
I've ran into this scenario today with one of our 14 nodes with a degraded ZFS array. I've made a habit of checking them daily till I got a hit today. No e-mail notification of the event despite notifications from the nodes about updates and etc.

Looks like I will have to do some digging on ZFS settings so it can send out e-mail notification. I will post it here if I find it.

Although, it would have been nice if there is a yellow icon on the node in the WebGUI data center view to give me heads up there is an issue on that node. Something that vmware vCenter is able to do (this is before I switched to ProxMox).

I also use Observium to monitor the ProxMox nodes and no degraded alerts there either.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!