zpool gets deadman status after trying to use it for a while

Matgoeth

New Member
Aug 16, 2025
2
0
1
Hi guys!

This forum has been extremely helpful multiple times, hope you can point me towards the right direction this time too.

My setup (host for PROXMOX):
- Motherboard: Z390 AORUS ELITE (it has 6 SATA slots which I make use of)
- CPU: Intel i9-9900K @ 3.60 Ghz
- RAM: G.Skill RipjawsV DDR4 32GB (4x8GB) 3200MHz CL16 rev2 XMP2 Black (F4-3200C16D-16GVKB)
- Storage: 5x: Seagate IronWolf Pro 16 TB 256MB 3.5" SATA (RAID V) + 1x SSD Crucial MX500 (for system)

My problem is around zfs pool that I created on Ironwolfs (RAID V). Generally those are all connected to the mainboard and PSU via standard cables. I made it into a NAS storage for a container, but it shouldn't really matter. What matters is that I do get deadman status from zed when I try to copy files there:
Code:
Oct 12 16:33:26 washington zed[119688]: eid=150 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:33:26 washington zed[119692]: eid=151 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:33:26 washington zed[119696]: eid=152 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:33:26 washington zed[119698]: eid=153 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:33:26 washington zed[119700]: eid=154 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:34:28 washington zed[119945]: eid=155 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:34:28 washington zed[119949]: eid=156 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:34:28 washington zed[119952]: eid=157 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:34:28 washington zed[119955]: eid=158 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:34:28 washington zed[119957]: eid=159 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:35:29 washington zed[120174]: eid=160 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:35:29 washington zed[120178]: eid=161 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:35:29 washington zed[120181]: eid=162 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:35:29 washington zed[120184]: eid=163 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:35:29 washington zed[120186]: eid=164 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:36:31 washington zed[120405]: eid=165 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:36:31 washington zed[120409]: eid=166 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:36:31 washington zed[120413]: eid=167 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:36:31 washington zed[120415]: eid=168 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:36:31 washington zed[120417]: eid=169 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:37:20 washington pmxcfs[1535]: [status] notice: received log
Oct 12 16:37:32 washington zed[120633]: eid=170 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:37:32 washington zed[120637]: eid=171 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:37:32 washington zed[120641]: eid=172 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:37:32 washington zed[120643]: eid=173 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:37:32 washington zed[120645]: eid=174 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:38:33 washington zed[120863]: eid=175 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:38:33 washington zed[120867]: eid=176 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:38:33 washington zed[120870]: eid=177 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:38:33 washington zed[120873]: eid=178 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:38:33 washington zed[120875]: eid=179 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:39:35 washington zed[121092]: eid=180 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:39:35 washington zed[121096]: eid=181 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:39:35 washington zed[121099]: eid=182 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:39:35 washington zed[121102]: eid=183 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:39:35 washington zed[121104]: eid=184 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms
Oct 12 16:40:36 washington zed[121323]: eid=185 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALAG-part1 size=196608 offset=1838406885376 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:40:36 washington zed[121327]: eid=186 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALBK-part1 size=196608 offset=1838409179136 priority=2 err=0 flags=0x40080480 delay=23542441ms
Oct 12 16:40:36 washington zed[121330]: eid=187 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALNC-part1 size=1048576 offset=1838409572352 priority=2 err=0 flags=0x40080480 delay=23542434ms
Oct 12 16:40:36 washington zed[121333]: eid=188 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIAL29-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542438ms
Oct 12 16:40:36 washington zed[121335]: eid=189 class=deadman pool='tank' vdev=ata-ST16000NT001-3LV101_SERIALQ3-part1 size=1048576 offset=1838412587008 priority=2 err=0 flags=0x40080480 delay=23542436ms

I tried scrubbing multiple times (each time scrub happens it finds some checksum errors or files which I tried to replace nicely, then when I reboot it happens again) but with no avail. I checked the cables but nothing obvious seems to be bad (and all of the drives end up in a deadman state, not just one so the chance to have all the cables screwed is super-low).

Could you please point me into some direction on how to start troubleshooting and fixing it? Something is clearly wrong, but it's impossible for me to find out what.

Thanks in advance!
 
I would migrate the pool to an actively-cooled HBA Card in IT mode, and rebuild it as a 6-disk raidz2. You do not want large hard drives in raid5 - when a disk fails, the whole pool is at risk during resilver. RAIDZ2 gives you a whole extra disk to lean on during DEGRADED mode and more peace of mind.

Also - do burn-in testing before putting drives into use. Full DD write (note this is data loss unless you have backup, as it effectively wipes the drive) followed by a SMART long test - this weeds out shipping damage. Smart test on a 16TB drive is gonna take almost 24H, so run them in parallel.

https://github.com/kneutron/ansitest/blob/master/SMART/scandisk-bigdrive-2tb+.sh
 
  • Like
Reactions: Matgoeth
ZFS “deadman” usually points to I/O stalls — the drives stop responding in time.

Are all disks connected through the chipset SATA ports (no port multipliers)?

Try checking for slow links with dmesg | grep ata or zpool status -v.

Also, any desktop board power-saving (ASPM, C-states) or mixed SATA cables could be triggering timeouts — worth testing with another PSU rail or HBA.
 
  • Like
Reactions: Matgoeth
ZFS “deadman” usually points to I/O stalls — the drives stop responding in time.

Are all disks connected through the chipset SATA ports (no port multipliers)?

Try checking for slow links with dmesg | grep ata or zpool status -v.

Also, any desktop board power-saving (ASPM, C-states) or mixed SATA cables could be triggering timeouts — worth testing with another PSU rail or HBA.
Thank you! Yeah, my motherboard has exactly 6 SATA ports so I use them all (5 for my pool, one for host)
I managed to find out that one of my SATA cables was (likely) faulty then bringing down rest of the pool (as you said there were some signals about ata ports being timed out). I replaced it.

Right now I tend to get CKSUM mostly around 0, sometimes it gets some random error bringing one drive to 1 while it was around 30-120 when having old cable. Is it normal behavior?

I would migrate the pool to an actively-cooled HBA Card in IT mode, and rebuild it as a 6-disk raidz2. You do not want large hard drives in raid5 - when a disk fails, the whole pool is at risk during resilver. RAIDZ2 gives you a whole extra disk to lean on during DEGRADED mode and more peace of mind.
Problem is: I have EXACTLY 6 SATA ports and I'd like (for the hygiene) keep OS drive separate, so 5 drives is where I am. My safety is that I am already in posession of another drive just in case one fails to start resilver process and not wait another x days, but I think I am going to stay in this place for some time. Still: thanks for the advice, it's a valid point!
 
Thank you! Yeah, my motherboard has exactly 6 SATA ports so I use them all (5 for my pool, one for host)
I managed to find out that one of my SATA cables was (likely) faulty then bringing down rest of the pool (as you said there were some signals about ata ports being timed out). I replaced it.

Right now I tend to get CKSUM mostly around 0, sometimes it gets some random error bringing one drive to 1 while it was around 30-120 when having old cable. Is it normal behavior?
A few stray checksum errors right after replacing a bad cable can be normal — ZFS is still verifying and repairing blocks that were previously affected.
Run another zpool scrub and monitor if the CKSUM count stays stable or resets to zero.

If new errors keep appearing after multiple scrubs, it could still point to lingering cabling or disk issues.
 
  • Like
Reactions: Kingneutron