FYI: ZFS pool SUSPENDED not recognized / handled - even with HA VMs starving

UdoB

Distinguished Member
Nov 1, 2016
3,021
1,734
243
Germany
Verbatim copy of https://bugzilla.proxmox.com/show_bug.cgi?id=6773



Short: a SUSPENDED ZFS pool is not recognized although all VMs stall and services are practically down.

Long: after upgrade to Trixie 13.1 (pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)) on 08 Sept. 2025 and an supporting reboot of the node everything was fine for some hours. At ~00:15 first messages in the journal are visible:
Code:
############## Excerpts from journal

Sep 07 00:04:57 pven kernel: amd_iommu_report_page_fault: 564 callbacks suppressed
Sep 07 00:04:57 pven kernel: ahci 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x8faac000 flags=0x0020]
Sep 07 00:04:57 pven kernel: ahci 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x8faac080 flags=0x0020]

...

Sep 07 00:05:49 pven kernel: ata2.00: exception Emask 0x0 SAct 0xb3ff0580 SErr 0x0 action 0x6 frozen
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/70:38:80:e4:e2/00:00:9e:00:00/40 tag 7 ncq dma 57344 in
                                      res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/10:40:08:e5:e2/00:00:9e:00:00/40 tag 8 ncq dma 8192 in
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/80:50:b0:1b:48/00:00:9f:00:00/40 tag 10 ncq dma 65536 in
                                      res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/f8:80:18:80:de/00:00:9e:00:00/40 tag 16 ncq dma 126976 in
                                      res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }

...

Sep 07 00:16:00 pven kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 07 00:16:00 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:00 pven kernel: ata2: hard resetting link

Sep 07 00:16:10 pven kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 07 00:16:10 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:10 pven kernel: ata2: hard resetting link
Sep 07 00:16:14 pven kernel: ata1: softreset failed (1st FIS failed)
Sep 07 00:16:14 pven kernel: ata1: hard resetting link

...
 
Sep 07 00:16:49 pven kernel: ata1: softreset failed (1st FIS failed)
Sep 07 00:16:49 pven kernel: ata1: limiting SATA link speed to 3.0 Gbps
Sep 07 00:16:49 pven kernel: ata1: hard resetting link
Sep 07 00:16:50 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:50 pven kernel: ata2: softreset failed
Sep 07 00:16:50 pven kernel: ata2: reset failed, giving up
Sep 07 00:16:50 pven kernel: ata2.00: disable device
Sep 07 00:16:50 pven kernel: ata2: EH complete
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#7 CDB: Read(10) 28 00 63 f8 e7 78 00 00 80 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677256568 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754314240 size=65536 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#8 CDB: Read(10) 28 00 63 f8 e6 80 00 00 f8 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677256320 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754187264 size=126976 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#9 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#9 CDB: Read(10) 28 00 63 f8 e5 20 00 00 c0 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677255968 op 0x0:(READ) flags 0x0 phys_seg 6 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754007040 size=98304 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#29 CDB: Read(10) 28 00 62 41 98 08 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#11 CDB: Read(10) 28 00 63 f8 ec 90 00 02 00 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648465928 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677257872 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013506560 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754981888 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858755112960 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#13 CDB: Read(10) 28 00 62 41 98 f8 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466168 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 62 41 99 e8 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 62 41 7a 50 00 05 a0 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466408 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013752320 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013629440 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648458320 op 0x0:(READ) flags 0x0 phys_seg 120 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009611264 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009734144 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466648 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#17 CDB: Read(10) 28 00 63 f8 f8 08 00 01 00 00
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013875200 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677260808 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858756485120 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 62 41 7f f0 00 00 f0 00
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844010348544 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009857024 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858756616192 size=118784 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009979904 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844010102784 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013998080 size=122880 flags=2148533376


Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven zed[555707]: eid=3554 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=270336 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555710]: eid=3555 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=1920373104640 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555714]: eid=3556 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=1920373366784 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555717]: eid=3557 class=probe_failure pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1
Sep 07 00:16:54 pven zed[555720]: eid=3558 class=io pool='ds0' size=40960 offset=1327999275008 priority=2 err=6 flags=0x200180 bookmark=119946:1:1:231
Sep 07 00:16:54 pven zed[555723]: eid=3559 class=io pool='ds0' size=16384 offset=858752720896 priority=2 err=6 flags=0x208181 bookmark=119946:1:0:228688
Sep 07 00:16:54 pven zed[555725]: eid=3560 class=io pool='ds0' size=16384 offset=858752737280 priority=2 err=6 flags=0x208181 bookmark=119946:1:0:228689
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.

############## Excerpts from journal

Both sda+sdb are "good" devices. After a cold reboot everything was up and running again; "scrub" resulted in zero errors.

But this issue is _not_ about the triggering hardware/driver problem. I am unhappy that the suspended pool - where all VMs are stored - is not recognized by PVE.

Expected behavior from _my_ specific (and limited) point of view: immediately after an important (a pool with HA-relevant resources) ZFS pool gets "suspended" the HA stack should have fenced this node.

Thanks for reading and for this great software :-)
 
  • Like
Reactions: waltar
... after upgrade to Trixie 13.1 ... on 08 Sept. 2025 ...
Wow, you are living in the future and/or we in the past ... but anyway a node with failed vm resources should not start machines and I assume fencing would not be always help or even mostly even not (as maybe in your case).
:)
 
Last edited:
Wow, you are living in the future
Maybe UdoB is from Australia
Thanks for the hint; no I am in Germany.

I looked at my large clock at the wall, it shows the right date. I read it wrong, for an unknown reason...

And -of course- if you type wrong things willingly you can look at them as often as you want - it stays correct.