[ZFS + ZED] No Notification for OFFLINE (removed) drives.

koyax

New Member
Jul 8, 2019
11
0
1
34
System: Proxmox VE 6.0-6 (upgraded from 5.4)

Hey guys!
I've had to resetup zed, since the config file got overwritten at the upgrade. And Notifications work fine, except, that I don't get en email if I remove a drive. Finished scrubs and resilvering cause an email. But the fact, that a hard drive went offline didn't.
In the zedlet statechange-notify.sh it seems, there are only trigger for 'DEGRADED', 'FAULTED' or 'REMOVED' drives.

Is that somehow by purpose? Or am I just missing something? The fact, that a drive went offline in my system is quite critical (imho).

Code:
Aug 30 2019 14:47:17.040921375 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "hdd"
        pool_guid = 0x79d541d265310a54
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xedfa949ff5bf4746
        vdev_state = "OFFLINE" (0x2)
        vdev_path = "/dev/sdb1"
        vdev_devid = "ata-WDC_WD5000AAKX-XXXXXXXXXXX-part1"
        vdev_physpath = "pci-0000:00:1f.2-ata-2"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x5d691ad5 0x270691f
        eid = 0x5c

Part of statechange-notify.sh
Code:
#
# Send notification in response to a fault induced statechange
#
# ZEVENT_SUBCLASS: 'statechange'
# ZEVENT_VDEV_STATE_STR: 'DEGRADED', 'FAULTED' or 'REMOVED'              <==== !!
#
# Exit codes:
#   0: notification sent
#   1: notification failed
#   2: notification not configured
#   3: statechange not relevant
#   4: statechange string missing (unexpected)
 
AFAIK the state offline is considered a manual state change only and therefore not relevant. What was the exit code reported?
 
Thanks for your reply!
What was the exit code reported?

Where does zed log the exit codes?
Unfortunately the /var/log/syslog file from that day is already deleted and /var/log/messages only shows, that the device got removed.
 
Last edited:
I guess /var/log/syslog
 
I removed the drive once and plugged it back in. I only removed the power plug.
The result ist the same. I got an eMail only for the finished resilver process but not for the offline drive.

Here is my output from syslog:

Code:
Sep  6 17:23:40 pve-01 kernel: [717197.753965] ata1: SATA link down (SStatus 0 SControl 300)
Sep  6 17:23:45 pve-01 kernel: [717203.130051] ata1: SATA link down (SStatus 0 SControl 300)
Sep  6 17:23:45 pve-01 kernel: [717203.130058] ata1.00: disabled
Sep  6 17:23:45 pve-01 kernel: [717203.130077] sd 0:0:0:0: rejecting I/O to offline device
Sep  6 17:23:45 pve-01 kernel: [717203.131161] sd 0:0:0:0: rejecting I/O to offline device
Sep  6 17:23:45 pve-01 kernel: [717203.132444] print_req_error: I/O error, dev sda, sector 1648097520 flags 701
Sep  6 17:23:45 pve-01 kernel: [717203.134841] print_req_error: I/O error, dev sda, sector 67637448 flags 701
Sep  6 17:23:45 pve-01 kernel: [717203.137156] zio pool=hdd vdev=/dev/sda1 error=5 type=2 offset=843824881664 size=8192 flags=180880
Sep  6 17:23:45 pve-01 kernel: [717203.139375] zio pool=hdd vdev=/dev/sda1 error=5 type=2 offset=34629324800 size=8192 flags=180880
Sep  6 17:23:45 pve-01 kernel: [717203.139381] ata1.00: detaching (SCSI 0:0:0:0)
Sep  6 17:23:45 pve-01 kernel: [717203.139411] print_req_error: I/O error, dev sda, sector 70343328 flags 701
Sep  6 17:23:45 pve-01 kernel: [717203.139437] print_req_error: I/O error, dev sda, sector 118106544 flags 701
Sep  6 17:23:45 pve-01 kernel: [717203.141895] zio pool=hdd vdev=/dev/sda1 error=5 type=2 offset=36014735360 size=4096 flags=180880
Sep  6 17:23:45 pve-01 kernel: [717203.141916] zio pool=hdd vdev=/dev/sda1 error=5 type=1 offset=270336 size=8192 flags=b08c1
Sep  6 17:23:45 pve-01 kernel: [717203.141927] zio pool=hdd vdev=/dev/sda1 error=5 type=1 offset=2000389152768 size=8192 flags=b08c1
Sep  6 17:23:45 pve-01 zed: eid=226 class=io pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1
Sep  6 17:23:46 pve-01 zed: eid=227 class=io pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1
Sep  6 17:23:46 pve-01 zed: eid=228 class=io pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1
Sep  6 17:23:46 pve-01 zed: eid=229 class=probe_failure pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1
Sep  6 17:23:46 pve-01 zed: eid=230 class=statechange pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1 vdev_state=OFFLINE
Sep  6 17:23:47 pve-01 zed: eid=231 class=config_sync pool_guid=0x79D541D265310A54 
Sep  6 17:24:00 pve-01 systemd[1]: Starting Proxmox VE replication runner...
Sep  6 17:24:01 pve-01 systemd[1]: pvesr.service: Succeeded.
Sep  6 17:24:01 pve-01 systemd[1]: Started Proxmox VE replication runner.
Sep  6 17:24:36 pve-01 kernel: [717253.954842] ata1: link is slow to respond, please be patient (ready=0)
Sep  6 17:24:40 pve-01 kernel: [717258.222916] ata1: COMRESET failed (errno=-16)
Sep  6 17:24:42 pve-01 kernel: [717259.726825] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  6 17:24:42 pve-01 kernel: [717259.727758] ata1.00: ATA-8: ST2000NM0011, SN03, max UDMA/133
Sep  6 17:24:42 pve-01 kernel: [717259.727760] ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 32)
Sep  6 17:24:42 pve-01 kernel: [717259.728706] ata1.00: configured for UDMA/133
Sep  6 17:24:42 pve-01 kernel: [717259.728897] scsi 0:0:0:0: Direct-Access     ATA      ST2000NM0011     SN03 PQ: 0 ANSI: 5
Sep  6 17:24:42 pve-01 kernel: [717259.729161] sd 0:0:0:0: Attached scsi generic sg0 type 0
Sep  6 17:24:42 pve-01 kernel: [717259.729195] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Sep  6 17:24:42 pve-01 kernel: [717259.729220] sd 0:0:0:0: [sda] Write Protect is off
Sep  6 17:24:42 pve-01 kernel: [717259.729223] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Sep  6 17:24:42 pve-01 kernel: [717259.729274] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep  6 17:24:42 pve-01 kernel: [717259.771585]  sda: sda1 sda9
Sep  6 17:24:42 pve-01 kernel: [717259.772638] sd 0:0:0:0: [sda] Attached SCSI disk
Sep  6 17:24:42 pve-01 zed: eid=232 class=statechange pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1 vdev_state=ONLINE
Sep  6 17:24:43 pve-01 zed: eid=233 class=vdev_online pool_guid=0x79D541D265310A54 vdev_path=/dev/sda1 vdev_state=ONLINE
Sep  6 17:24:43 pve-01 zed: eid=234 class=resilver_start pool_guid=0x79D541D265310A54 
Sep  6 17:24:43 pve-01 zed: eid=235 class=history_event pool_guid=0x79D541D265310A54 
Sep  6 17:24:43 pve-01 zed: eid=236 class=history_event pool_guid=0x79D541D265310A54 
Sep  6 17:24:43 pve-01 zed: eid=237 class=resilver_finish pool_guid=0x79D541D265310A54 
Sep  6 17:24:44 pve-01 postfix/pickup[27060]: 2B12A4641: uid=0 from=<root>
Sep  6 17:24:44 pve-01 postfix/cleanup[20247]: 2B12A4641: message-id=<20190906152444.2B12A4641@pve-01.int.#####.org>
Sep  6 17:24:44 pve-01 postfix/qmgr[2635]: 2B12A4641: from=<root@pve-01.int.#######.org>, size=1264, nrcpt=1 (queue active)
Sep  6 17:24:44 pve-01 zed: Starting scrub after resilver on hdd
Sep  6 17:24:45 pve-01 postfix/smtp[20253]: 2B12A4641: to=<#####@protonmail.com>, relay=mail.protonmail.ch[185.70.40.103]:25, delay=1.6, delays=0.02/0.01/0.36/1.2, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 908C8303B442)
Sep  6 17:24:45 pve-01 postfix/qmgr[2635]: 2B12A4641: removed
Sep  6 17:24:47 pve-01 zed: eid=238 class=config_sync pool_guid=0x79D541D265310A54  
Sep  6 17:24:47 pve-01 zed: eid=239 class=scrub_start pool_guid=0x79D541D265310A54  
Sep  6 17:24:47 pve-01 zed: eid=240 class=history_event pool_guid=0x79D541D265310A54
 
Hey!

I'm actually having this same issue on proxmox 6.1-3. If I manually remove a drive from the array, it becomes "UNAVAIL" and the pool is shown as degraded. However, no alert email is sent. Just like the OP, I'll get emails on scrubs, resilvers, etc, but no emails on the pool being degraded.

I've opened an issue on github #10123 and also linked to a similar issue that was opened, fixed, then closed back in 2017 #4653

Has anyone found a work around to this, aside from using a DIY shellscript?

My setup:
PowerEdge R515
128gb ECC ram
8x WDRE 2tb drives in raid-z2
PERC H310 (flashed to IT mode)
 
I ran another test and loaded 4 VM's onto the pool in order to get enough IO requests for zed to catch the drive as suggested in #4653, however after moving over 1tb worth of VM's to the pool and letting them run for the last 5 days, there's still no alert from zed.
 
Another observation on this - Even though the pool status is degraded, the icon for the node on the left side of the screen still shows green. However, if you go into node->disks->zfs, the health status shows degraded with a yellow warning. The green icon for the node should also be yellow to indicate a possible issue.

zfs.jpg
 
I installed the most current packages available this afternoon, which upgraded ZFS and ZED from 0.8.2 to 0.8.3, however this issue still persists.

Has anyone been able to get ZED to alert if your pool is in a degraded state?

Code:
root@pve:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!