I need some help understanding the alert I received today. I'll give some background, but I'll hide it as a spoiler in case the extra detail isn't relevant.
Today I received the following alert:
sdd and nvme2n1p1 are both drives that are passed completely to TrueNAS's control. In TrueNAS, I have no alerts and it says all disks are healthy.
Native on the host, if I run mdadm --detail /dev/md126 and mdadm --detail /dev/md127 , I get the following. Sadly, I am not well versed in mdadm (hence why I used TrueNAS to do the heavy lifting for me!). Could someone help me understand the results? Do I really have a problem. Seems odd that 24 hours after the upgrade I'm getting this alert from pve, but not TrueNAS.
mdadm --detail /dev/md126
mdadm --detail /dev/md127
This node is running pveversion
smartctl -H on the devices complains there is no such disk, which make sense as the VM is active and has the disks.
I have multiple nodes in a cluster and one of the nodes has a TrueNAS VM. That VM has three NVMe and four spinners passed to it via pcie passthrough, giving TrueNAS total control. These set up as two zfs arrays by drive type. This has been the case since I built this node, those drives never being used by pve. I stayed current on pve8 for some time and only recently started upgrading in place to pve9. Each weekend I did one node and went from least to most important node. Yesterday I did the last node, which is the NAS. In all cases, I ran pve8to9 --full and did any callouts, but none were major (install microcode updates, turn off autoactivate on lvm-thin volumes, etc.).
Today I received the following alert:
Code:
DegradedArray event detected on md device /dev/md126
The /proc/mdstat file currently contains the following:
Personalities : [raid0] [raid1] [raid4] [raid5] [raid6] [raid10] [linear]
md126 : broken (auto-read-only) raid1 sdd1[0]
2095040 blocks super 1.2 [3/1] [U__]
md127 : broken (auto-read-only) raid1 nvme2n1p1[1]
2095040 blocks super 1.2 [2/1] [_U]
unused devices: <none>
sdd and nvme2n1p1 are both drives that are passed completely to TrueNAS's control. In TrueNAS, I have no alerts and it says all disks are healthy.
Code:
nvmE zfs raidz1 array
Pool Status: ONLINE
Used Space: 63.54%
Disks with Errors: 0 of 3
spinner zfs raidz1 array
Pool Status: ONLINE
Used Space: 13.91%
Disks with Errors: 0 of 4
Native on the host, if I run mdadm --detail /dev/md126 and mdadm --detail /dev/md127 , I get the following. Sadly, I am not well versed in mdadm (hence why I used TrueNAS to do the heavy lifting for me!). Could someone help me understand the results? Do I really have a problem. Seems odd that 24 hours after the upgrade I'm getting this alert from pve, but not TrueNAS.
mdadm --detail /dev/md126
Code:
/dev/md126:
Version : 1.2
Creation Time : Sat Jun 24 13:57:47 2023
Raid Level : raid1
Array Size : 2095040 (2045.94 MiB 2145.32 MB)
Used Dev Size : 2095040 (2045.94 MiB 2145.32 MB)
Raid Devices : 3
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Sat Mar 21 15:48:26 2026
State : clean, FAILED
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Number Major Minor RaidDevice State
0 8 49 0 active sync missing
- 0 0 1 removed
- 0 0 2 removed
mdadm --detail /dev/md127
Code:
/dev/md127:
Version : 1.2
Creation Time : Mon Jun 12 22:28:44 2023
Raid Level : raid1
Array Size : 2095040 (2045.94 MiB 2145.32 MB)
Used Dev Size : 2095040 (2045.94 MiB 2145.32 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Sat Mar 21 15:48:26 2026
State : clean, FAILED
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Number Major Minor RaidDevice State
- 0 0 0 removed
1 259 5 1 active sync missing
This node is running pveversion
pve-manager/9.1.7/16b139a017452f16 (running kernel: 6.17.13-2-pve)smartctl -H on the devices complains there is no such disk, which make sense as the VM is active and has the disks.
Last edited: