[SOLVED] ZFS device fault for pool emails

vincentp

New Member
May 23, 2023
11
1
3
Hi

Newby with proxmox 8.x (migrating from xcp-ng) - I have a zfs pool with 2x2x2 mirrors all nvme (4 intel, 2 samsung) and every time I reboot the machine I get multiple emails from it like this

subject : ZFS device fault for pool vmdata on raptor

ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 16
class: statechange
state: UNAVAIL
host: raptor
time: 2023-11-19 14:52:24+1100
vpath: /dev/nvme8n1p1
vphys: pci-0000:04:00.0-nvme-1
vguid: 0x0F35CF947EF24215
devid: nvme-INTEL_SSDPF2KX038T1_PHAX3243005A3P8CGN-part1
pool: vmdata (0xC2489D11C4F84674)

Both the gui and the shell show it as fine once the machine has booted

zpool status vmdata

pool: vmdata
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
vmdata ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-eui.01000000000000005cd2e47348445651 ONLINE 0 0 0
nvme-eui.01000000000000005cd2e4420c455651 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-eui.01000000000000005cd2e406ce445651 ONLINE 0 0 0
nvme-eui.01000000000000005cd2e499f0445651 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme-eui.36344830574205540025384e00000001 ONLINE 0 0 0
nvme-eui.36344830574205680025384e00000001 ONLINE 0 0 0

errors: No known data errors


Any ideas what is going on? I don't want to disable email alerts but false alarms cause complacency.

Thx.
 
You might want to find out why the Intel NVMe device temporarily disconnects. Maybe it's a hardware or PCIe signal integrity issue or it has trouble negotiating a speed?
See if there are related errors with a clue in journalctl (scroll with the arrow keys) around the time before the e-mail is send.
 
journalctl -u zfs-zed.service

Last 2 boots - the most recent boot didn't result in emails, the previous ones did


Code:
-- Boot fb1692768c334c1c9b03b303dce26fa9 --
Nov 19 14:52:24 raptor systemd[1]: Started zfs-zed.service - ZFS Event Daemon (zed).
Nov 19 14:52:24 raptor zed[4587]: ZFS Event Daemon 2.1.13-pve1 (PID 4587)
Nov 19 14:52:24 raptor zed[4587]: Processing events since eid=0
Nov 19 14:52:24 raptor zed[4606]: eid=2 class=config_sync pool='rpool'
Nov 19 14:52:24 raptor zed[4607]: eid=3 class=pool_import pool='rpool'
Nov 19 14:52:24 raptor zed[4615]: eid=5 class=config_sync pool='rpool'
Nov 19 14:52:24 raptor zed[4621]: eid=7 class=config_sync pool='scratch'
Nov 19 14:52:24 raptor zed[4624]: eid=8 class=pool_import pool='scratch'
Nov 19 14:52:24 raptor zed[4634]: eid=10 class=config_sync pool='scratch'
Nov 19 14:52:24 raptor zed[4637]: eid=11 class=statechange pool='vmdata' vdev=nvme7n1p1 vdev_state=UNAVAIL
Nov 19 14:52:24 raptor zed[4646]: eid=12 class=vdev.no_replicas pool='vmdata'
Nov 19 14:52:24 raptor zed[4649]: eid=13 class=statechange pool='vmdata' vdev=nvme1n1p1 vdev_state=UNAVAIL
Nov 19 14:52:24 raptor zed[4681]: eid=15 class=vdev.no_replicas pool='vmdata'
Nov 19 14:52:24 raptor zed[4689]: eid=16 class=statechange pool='vmdata' vdev=nvme8n1p1 vdev_state=UNAVAIL
Nov 19 14:52:24 raptor zed[4712]: eid=17 class=zpool pool='vmdata'
Nov 19 14:52:25 raptor zed[4814]: vdev nvme-SAMSUNG_MZQL27T6HBLA-00A07_S6CKNE0T833976 '' doesn't exist
Nov 19 14:52:33 raptor zed[6735]: eid=19 class=config_sync pool='vmdata'
Nov 19 14:52:33 raptor zed[6750]: eid=20 class=pool_import pool='vmdata'
Nov 19 14:52:33 raptor zed[7103]: vdev nvme-eui.01000000000000005cd2e4420c455651 '' doesn't exist
Nov 19 14:52:33 raptor zed[7348]: vdev nvme-eui.01000000000000005cd2e406ce445651 '' doesn't exist
Nov 19 14:52:33 raptor zed[7865]: vdev nvme-eui.36344830574205680025384e00000001 '' doesn't exist
Nov 19 16:54:36 raptor zed[4587]: Exiting
Nov 19 16:54:36 raptor systemd[1]: Stopping zfs-zed.service - ZFS Event Daemon (zed)...
Nov 19 16:54:36 raptor systemd[1]: zfs-zed.service: Deactivated successfully.
Nov 19 16:54:36 raptor systemd[1]: Stopped zfs-zed.service - ZFS Event Daemon (zed).
-- Boot 8ac6c5890f2c435592b515ab794e4c0d --
Nov 19 16:59:38 raptor systemd[1]: Started zfs-zed.service - ZFS Event Daemon (zed).
Nov 19 16:59:38 raptor zed[3717]: ZFS Event Daemon 2.1.13-pve1 (PID 3717)
Nov 19 16:59:38 raptor zed[3717]: Processing events since eid=0
Nov 19 16:59:38 raptor zed[3734]: eid=3 class=pool_import pool='rpool'
Nov 19 16:59:38 raptor zed[3732]: eid=2 class=config_sync pool='rpool'
Nov 19 16:59:38 raptor zed[3742]: eid=5 class=config_sync pool='rpool'
Nov 19 16:59:38 raptor zed[3748]: eid=8 class=pool_import pool='scratch'
Nov 19 16:59:38 raptor zed[3746]: eid=7 class=config_sync pool='scratch'
Nov 19 16:59:38 raptor zed[3754]: eid=10 class=config_sync pool='scratch'
Nov 19 16:59:38 raptor zed[3838]: vdev nvme-SAMSUNG_MZQL27T6HBLA-00A07_S6CKNE0T833976 '' doesn't exist
Nov 19 16:59:47 raptor zed[5783]: eid=12 class=config_sync pool='vmdata'
Nov 19 16:59:47 raptor zed[5789]: eid=13 class=pool_import pool='vmdata'
Nov 19 16:59:47 raptor zed[6447]: vdev nvme-eui.01000000000000005cd2e406ce445651 '' doesn't exist
Nov 19 16:59:47 raptor zed[6566]: vdev nvme-eui.01000000000000005cd2e499f0445651 '' doesn't exist
Nov 19 16:59:47 raptor zed[6743]: vdev nvme-eui.36344830574205540025384e00000001 '' doesn't exist
Nov 19 16:59:47 raptor zed[6852]: vdev nvme-eui.36344830574205680025384e00000001 '' doesn't exist

It does seem to be random, different nvme drives each time. 4 of the drives are intel and the last 2 are samsung so not specific to the intel.

This is a new server I setup last week, and I did create the pool, delete it and recreate it again - looking further back in the logs I see this was happening then too.
 
It does seem to be random, different nvme drives each time. 4 of the drives are intel and the last 2 are samsung so not specific to the intel.

This is a new server I setup last week, and I did create the pool, delete it and recreate it again - looking further back in the logs I see this was happening then too.
Can you please look in journalctl for errors about the PCIe/NVMe devices? If it is a PCIe signal issue then it might help to limit the PCIe or M.2 slots to Gen3 (or whatever the drives are or lower) in the BIOS, a BIOS update or another motherboard.

EDIT: Or it could be a CPU pin connection issues. More information about the motherboard, PCIe lanes layout and how you connected the drives might help.
 
Last edited:
  • Like
Reactions: vincentp
This is a brand new dual eypc 7542 server. All the drives are u,2 (apart from the 2 sata boot drives). The server is in a data center 3hrs drive away, otherwise I would go and reseat things (don't trust the dc staff to do this). I have another identical server currently running xcp-ng (with mdraid not zfs) and have not had any issues with nvme etc (I just checked it's logs to be sure).

no errors showed up in journalctl for pcie, however I did find these for nvme

Nov 10 19:01:31 raptor kernel: nvme 0000:02:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 10 19:01:31 raptor kernel: nvme 0000:02:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
Nov 10 19:01:31 raptor kernel: nvme 0000:03:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 10 19:01:31 raptor kernel: nvme 0000:03:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
Nov 10 19:01:31 raptor kernel: nvme 0000:04:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 10 19:01:31 raptor kernel: nvme 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID

Nov 15 17:57:12 raptor kernel: nvme 0000:03:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 15 17:57:12 raptor kernel: nvme 0000:03:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
Nov 15 17:57:12 raptor kernel: nvme 0000:04:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 15 17:57:12 raptor kernel: nvme 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
...
Nov 15 18:39:09 raptor kernel: nvme 0000:02:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 15 18:39:09 raptor kernel: nvme 0000:02:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
Nov 15 18:39:09 raptor kernel: nvme 0000:04:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 15 18:39:09 raptor kernel: nvme 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID

...
Nov 19 14:52:19 raptor kernel: nvme 0000:02:00.0: AER: aer_status: 0x00002000, aer_mask: 0x00000000
Nov 19 14:52:19 raptor kernel: nvme 0000:02:00.0: [13] NonFatalErr
Nov 19 14:52:19 raptor kernel: nvme 0000:02:00.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
...


Not sure if they are related but it kinda marries up to random emails (this is just before zed kicks in during boot).

I think I will push this back to the server vendor tomorrow ( will keep googling though),
 
  • Like
Reactions: leesteken
Look this https://forum.proxmox.com/threads/l...ly-related-to-overheating.136565/#post-605782

try smart temperature disks and others parameters
smartctl -a /dev/nvmeXnY or
nvme smart-log /dev/nvmeXnY

google bard says


The error message "AER: aer_status: 0x00002000, aer_mask: 0x00000000" indicates that an Advanced Error Reporting (AER) error has occurred on a PCI Express (PCIe) device. The error status is 0x00002000, which means that there was a Non-Fatal Error (NFE) on the Transaction Layer. The error mask is 0x00000000, which means that all error sources are masked.

This error is not necessarily cause for concern, as it may be a transient error that has been corrected by the hardware. However, if you see this error frequently, it may indicate a hardware problem. If you are concerned about this error, you should contact your hardware vendor for support.

Here is a breakdown of the error message:

  • AER: Advanced Error Reporting
  • aer_status: The error status code
  • 0x00002000: The error status code for a Non-Fatal Error (NFE) on the Transaction Layer
  • aer_mask: The error mask
  • 0x00000000: All error sources are masked
 
Last edited:
The drives are all sitting at around 25-27 deg C so I don't think it's a temperature issue.

Further googling (at 5.30am - this issue kept me awake!) I found some references to ASPM and pcie_aspm=off - I added this to the kernal cmdline and rebooted and so far so good

Code:
-- Boot fb1692768c334c1c9b03b303dce26fa9 --
Nov 20 06:39:35 raptor systemd[1]: Started zfs-zed.service - ZFS Event Daemon (zed).
Nov 20 06:39:35 raptor zed[3809]: ZFS Event Daemon 2.1.13-pve1 (PID 3809)
Nov 20 06:39:35 raptor zed[3809]: Processing events since eid=0
Nov 20 06:39:35 raptor zed[3827]: eid=2 class=config_sync pool='rpool'
Nov 20 06:39:35 raptor zed[3829]: eid=3 class=pool_import pool='rpool'
Nov 20 06:39:35 raptor zed[3840]: eid=5 class=config_sync pool='rpool'
Nov 20 06:39:35 raptor zed[3844]: eid=7 class=config_sync pool='scratch'
Nov 20 06:39:35 raptor zed[3852]: eid=10 class=config_sync pool='scratch'
Nov 20 06:39:43 raptor zed[5900]: eid=12 class=config_sync pool='vmdata'
Nov 20 06:39:43 raptor zed[5911]: eid=13 class=pool_import pool='vmdata'
Nov 20 06:39:43 raptor zed[5979]: eid=15 class=config_sync pool='vmdata'

Also no AER entries

I'll do further testing today but hopefully that was it.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!