FYI: ZFS pool SUSPENDED not recognized / handled - even with HA VMs starving

UdoB · Sep 7, 2025

Verbatim copy of https://bugzilla.proxmox.com/show_bug.cgi?id=6773

Short: a SUSPENDED ZFS pool is not recognized although all VMs stall and services are practically down.

Long: after upgrade to Trixie 13.1 (pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.11-1-pve)) on 08 Sept. 2025 and an supporting reboot of the node everything was fine for some hours. At ~00:15 first messages in the journal are visible:

Code:

############## Excerpts from journal

Sep 07 00:04:57 pven kernel: amd_iommu_report_page_fault: 564 callbacks suppressed
Sep 07 00:04:57 pven kernel: ahci 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x8faac000 flags=0x0020]
Sep 07 00:04:57 pven kernel: ahci 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x8faac080 flags=0x0020]

...

Sep 07 00:05:49 pven kernel: ata2.00: exception Emask 0x0 SAct 0xb3ff0580 SErr 0x0 action 0x6 frozen
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/70:38:80:e4:e2/00:00:9e:00:00/40 tag 7 ncq dma 57344 in
                                      res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/10:40:08:e5:e2/00:00:9e:00:00/40 tag 8 ncq dma 8192 in
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/80:50:b0:1b:48/00:00:9f:00:00/40 tag 10 ncq dma 65536 in
                                      res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }
Sep 07 00:05:49 pven kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 07 00:05:49 pven kernel: ata2.00: cmd 60/f8:80:18:80:de/00:00:9e:00:00/40 tag 16 ncq dma 126976 in
                                      res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 07 00:05:49 pven kernel: ata2.00: status: { DRDY }

...

Sep 07 00:16:00 pven kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 07 00:16:00 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:00 pven kernel: ata2: hard resetting link

Sep 07 00:16:10 pven kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 07 00:16:10 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:10 pven kernel: ata2: hard resetting link
Sep 07 00:16:14 pven kernel: ata1: softreset failed (1st FIS failed)
Sep 07 00:16:14 pven kernel: ata1: hard resetting link

...
 
Sep 07 00:16:49 pven kernel: ata1: softreset failed (1st FIS failed)
Sep 07 00:16:49 pven kernel: ata1: limiting SATA link speed to 3.0 Gbps
Sep 07 00:16:49 pven kernel: ata1: hard resetting link
Sep 07 00:16:50 pven kernel: ata2: softreset failed (1st FIS failed)
Sep 07 00:16:50 pven kernel: ata2: softreset failed
Sep 07 00:16:50 pven kernel: ata2: reset failed, giving up
Sep 07 00:16:50 pven kernel: ata2.00: disable device
Sep 07 00:16:50 pven kernel: ata2: EH complete
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#7 CDB: Read(10) 28 00 63 f8 e7 78 00 00 80 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677256568 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754314240 size=65536 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#8 CDB: Read(10) 28 00 63 f8 e6 80 00 00 f8 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677256320 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754187264 size=126976 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#9 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#9 CDB: Read(10) 28 00 63 f8 e5 20 00 00 c0 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677255968 op 0x0:(READ) flags 0x0 phys_seg 6 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754007040 size=98304 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#29 CDB: Read(10) 28 00 62 41 98 08 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#11 CDB: Read(10) 28 00 63 f8 ec 90 00 02 00 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648465928 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677257872 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013506560 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858754981888 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858755112960 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#13 CDB: Read(10) 28 00 62 41 98 f8 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466168 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#16 CDB: Read(10) 28 00 62 41 99 e8 00 00 f0 00
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#12 CDB: Read(10) 28 00 62 41 7a 50 00 05 a0 00
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466408 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013752320 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013629440 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648458320 op 0x0:(READ) flags 0x0 phys_seg 120 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009611264 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009734144 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1648466648 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#17 CDB: Read(10) 28 00 63 f8 f8 08 00 01 00 00
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013875200 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: I/O error, dev sdb, sector 1677260808 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 0
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858756485120 size=131072 flags=2148533376
Sep 07 00:16:50 pven kernel: sd 1:0:0:0: [sdb] tag#19 CDB: Read(10) 28 00 62 41 7f f0 00 00 f0 00
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844010348544 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009857024 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=858756616192 size=118784 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844009979904 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844010102784 size=122880 flags=2148533376
Sep 07 00:16:50 pven kernel: zio pool=ds0 vdev=/dev/disk/by-id/ata-INTEL_SSDSC2KB019T7_PHYS744100LH1P9DGN-part1 error=5 type=1 offset=844013998080 size=122880 flags=2148533376


Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven zed[555707]: eid=3554 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=270336 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555710]: eid=3555 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=1920373104640 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555714]: eid=3556 class=io pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1 size=8192 offset=1920373366784 priority=0 err=5 flags=0x1300c1
Sep 07 00:16:54 pven zed[555717]: eid=3557 class=probe_failure pool='ds0' vdev=ata-INTEL_SSDSC2KB019T7_PHYS750400LV1P9DGN-part1
Sep 07 00:16:54 pven zed[555720]: eid=3558 class=io pool='ds0' size=40960 offset=1327999275008 priority=2 err=6 flags=0x200180 bookmark=119946:1:1:231
Sep 07 00:16:54 pven zed[555723]: eid=3559 class=io pool='ds0' size=16384 offset=858752720896 priority=2 err=6 flags=0x208181 bookmark=119946:1:0:228688
Sep 07 00:16:54 pven zed[555725]: eid=3560 class=io pool='ds0' size=16384 offset=858752737280 priority=2 err=6 flags=0x208181 bookmark=119946:1:0:228689
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.
Sep 07 00:16:54 pven kernel: WARNING: Pool 'ds0' has encountered an uncorrectable I/O failure and has been suspended.

############## Excerpts from journal

Both sda+sdb are "good" devices. After a cold reboot everything was up and running again; "scrub" resulted in zero errors.

But this issue is _not_ about the triggering hardware/driver problem. I am unhappy that the suspended pool - where all VMs are stored - is not recognized by PVE.

Expected behavior from _my_ specific (and limited) point of view: immediately after an important (a pool with HA-relevant resources) ZFS pool gets "suspended" the HA stack should have fenced this node.

Thanks for reading and for this great software

waltar · Sep 7, 2025

UdoB said:
... after upgrade to Trixie 13.1 ... on 08 Sept. 2025 ...

Wow, you are living in the future and/or we in the past ... but anyway a node with failed vm resources should not start machines and I assume fencing would not be always help or even mostly even not (as maybe in your case).

leesteken · Sep 7, 2025

waltar said:
Wow, you are living in the future and/or we in the past ...

There was news about Debian 13.1 at 10:00 AM EDT September 7, which is 12:00 AM AEST September 8. Maybe UdoB is from Australia or working there or runs a server over there.

waltar · Sep 7, 2025

I don't think a day has recently up to 26h and even the logs adress sep7

UdoB · Sep 7, 2025

waltar said:
Wow, you are living in the future

leesteken said:
Maybe UdoB is from Australia

Thanks for the hint; no I am in Germany.

I looked at my large clock at the wall, it shows the right date. I read it wrong, for an unknown reason...

And -of course- if you type wrong things willingly you can look at them as often as you want - it stays correct.

edd · Sep 19, 2025

It is indeed disappointing to learn that if VM storage fails then HA doesn't migrate VMs.

edd · Sep 19, 2025

UdoB said:
Fabian Grünbichler 2025-09-08 08:29:28 CEST

HA fencing is solely based on corosync/pmxcfs status, storage (or other) health checks are not incorporated.

if you want to react to other health events, you need to configure your own monitoring that triggers custom actions.

How did you solve this issue? Do you have a custom monitoring script?

UdoB · Sep 19, 2025

edd said:
How did you solve this issue?

I did not. My VMs got stuck.

edd said:
Do you have a custom monitoring script?

For monitoring? Sure. In this specific case "zed" man zed is probably sufficient. This is by far the simplest approach as it watches exclusively ZFS. Then there are a zillion full blown monitoring solutions out there.

But pure monitoring does not solve the problem that those VMs are not moved to another node automatically. This is suboptimal. Fortunately I have no real mission critical systems which can't wait until a colleague or I do handle these rare situations manually - early in the morning.

Theoretically one could write a (simple?) script to take action automatically. But... don't underestimate the complexity for a generic solution of this problem. The script would need to consider multiple events/conditions to implement a useful reaction in some situations. I will not develop that one ;-)

alexskysilk · Sep 19, 2025

UdoB said:
Expected behavior from _my_ specific (and limited) point of view: immediately after an important (a pool with HA-relevant resources) ZFS pool gets "suspended" the HA stack should have fenced this node.

A couple of reasons this isnt actually so.
1. The store in question was limited to a single node, and would not have affected anything in a cluster- eg, no other cluster node has access to this store- a zfs store is NOT HA BY DEFINITION.
2. any node may have more than one datastore. fencing a node because ONE store is out is a pretty bad reason to kill the node, especially if there are other vms on other datastores.
3. It would be nice to have a watchdog for a zfs replicated couple. Would be relatively easy to implement I think, but it wouldn't be per node, it would be per vm. From my point of view, a zfs replicated couple is bad design for HA but since the devs decided to implement it anyway, may as way go the next step.

UdoB · Sep 19, 2025

alexskysilk said:
A couple of reasons this isnt actually so.

Yeah, you have absolutely valid points.

That's why I emphasized "_my_" pov ;-)

edd · Sep 19, 2025

alexskysilk said:
A couple of reasons this isnt actually so.
1. The store in question was limited to a single node, and would not have affected anything in a cluster- eg, no other cluster node has access to this store- a zfs store is NOT HA BY DEFINITION.
2. any node may have more than one datastore. fencing a node because ONE store is out is a pretty bad reason to kill the node, especially if there are other vms on other datastores.
3. It would be nice to have a watchdog for a zfs replicated couple. Would be relatively easy to implement I think, but it wouldn't be per node, it would be per vm. From my point of view, a zfs replicated couple is bad design for HA but since the devs decided to implement it anyway, may as way go the next step.

I will concede these are reasons.

However, all these arguments completely sidestep the fact that system administrators should not have to manually babysit individual nodes. Nodes should be automatically managed based on the viability of their resources and VMs.

1. While technically true that local resources are not HA, the viability of local resources directly affect availability and should thus not be completely decoupled from HA. Each node should be continuously evaluated on its viability and this information should be feed back to HA decisions that influence migration and fencing.

2. Agrees, any node may have multiple data stores and VMs. That is why each resource should have a defined level of criticality that determines what actions should be taken when a viability evaluation is failed. The levels would correspond to “do nothing”, “restart”, “migrate”, “fence off”, etc.

3. I wholeheartedly agree with the notion that a ZFS replication watchdog would be very nice.

I keep hearing the opinion that ZFS replication for HA is a bad design, but what is the alternative when CEPH et al. isn’t an option?

I have yet to hear of any other solution. ZFS replication for HA is a far better alternative to having no alternative at all. And ZFS replication for HA being a non-optimal solution isn’t a reason to not making it better.

Anyway, these are just my opinions. I don’t really expect them to affect change or to really have any impact at all. I profess to having no particular insight nor any original thought.

If you’ve made it this far, thanks for indulging me.

alexskysilk · Sep 20, 2025

edd said:
but what is the alternative when CEPH et al. isn’t an option?

Any form of shared storage. NFS, iSCSI, etc.

edd said:
ZFS replication for HA is a far better alternative to having no alternative at all

see above.

edd said:
However, all these arguments completely sidestep the fact that system administrators should not have to manually babysit individual nodes.

zfs is host specific. if it becomes unavailable, what would be the point of fencing the node in the first place?

edd · Sep 20, 2025

alexskysilk said:
Any form of shared storage. NFS, iSCSI, etc.

Excuse my ignorance, but does this not create a single point of failure for the whole cluster?

alexskysilk said:
zfs is host specific. if it becomes unavailable, what would be the point of fencing the node in the first place?

To automatically force VM migration to an unaffected node.

alexskysilk · Sep 20, 2025

edd said:
Excuse my ignorance, but does this not create a single point of failure for the whole cluster?

Not necessarily. dual (or multi) controller NAS is a thing.

edd said:
To automatically force VM migration to an unaffected node.

Repeat after me. Replication isnt High Availability. High availability necessarily requires REAL TIME availability, which replication cannot provide.

"But in my use case, the data doesnt change that much and it would work perfectly fine!"

The reason replication isnt used for those kind of operation is because with static content application level HA is much more effectively handled with something as simple as round robin dns, which has the added benefit on not NEEDING a failover trigger- it just marks a path down if healthcheck is not passed.

"I still want to failover on storage fail!"

you can accomplish this with a simple entry in your crontab. As I said in my original response, it would have been nice it the PVE devs included this (or better yet, a per-vm trigger) but its not a showstopper.

waltar · Sep 20, 2025

edd said:
To automatically force VM migration to an unaffected node.

Mmh, I would say a live migration from a vm which is hanging while it lose it image storage doesn't work and so you need a restart of the vm with in case of zvols usage older image content depending on last successfull replication before the local zfs died.

edd · Sep 20, 2025

waltar said:
Mmh, I would say a live migration from a vm which is hanging while it lose it image storage doesn't work and so you need a restart of the vm with in case of zvols usage older image content depending on last successfull replication before the local zfs died.

Yeah, live migration is obviously out at that point. But that's not the point, any kind of migration would do. The problem is that there is no migration.

edd · Sep 20, 2025

alexskysilk said:
Not necessarily. dual (or multi) controller NAS is a thing.

Can provide more detail? I'd like to look into this.

The obvious follow up question, are dual controller NAS setups less resource intensive than CEPH and can they be run on the equivalent of a three node cluster?

alexskysilk said:
Repeat after me. Replication isnt High Availability. High availability necessarily requires REAL TIME availability, which replication cannot provide.

Neither is Proxmox HA with CEPH, so what is your point?

There is nothing real time about a fenced off node. Regardless of if you use CEPH or ZFS sync, the VM will be out of action for a few minutes.

The only difference between CEPH and ZFS sync is that with ZFS sync you lose the data of the last sync delta.

alexskysilk said:
"But in my use case, the data doesnt change that much and it would work perfectly fine!"

The reason replication isnt used for those kind of operation is because with static content application level HA is much more effectively handled with something as simple as round robin dns, which has the added benefit on not NEEDING a failover trigger- it just marks a path down if healthcheck is not passed.

Yeah, sure, but that's not my use case.

alexskysilk said:
"I still want to failover on storage fail!"

you can accomplish this with a simple entry in your crontab. As I said in my original response, it would have been nice it the PVE devs included this (or better yet, a per-vm trigger) but its not a showstopper.

Yes, I do want to failover on storage fail. Please do tell what this simple crontab entry is, it would be very helpful and I can stop pestering people about it.

alexskysilk · Sep 20, 2025

edd said:
Can provide more detail? I'd like to look into this.

google high availability storage. There are many options available.

edd said:
are dual controller NAS setups less resource intensive than CEPH

resouces arent relevant in and of themselves. Storage solutions can be very low to very high powered, with different class of disk, tiering and caching methods, and with varying features. Ceph doesnt have to be hyperconverged, in which case it will not impact compute resource availability any more than other storage solutions.

edd said:
Neither is Proxmox HA with CEPH, so what is your point?

Yes. It is. You can eliminate a whole node full of disks and the storage will keep on ticking. Its the whole point.

edd said:
Regardless of if you use CEPH or ZFS sync, the VM will be out of action for a few minutes.

You are confusing data availability with application availability. If your ZFS pool dies, any and all data recorded to it die with it. your replica is only valid up to its last synchronization. Ceph data is valid up to the last committed write regardless of node hosting the vm.

edd said:
Please do tell what this simple crontab entry is, it would be very helpful and I can stop pestering people about it.

*/5 * * * * /usr/local/sbin/checkpool.sh

Code:

# checkpool.sh
#!/bin/sh

POOL="tank"   # replace with your pool name

# check if pool is ONLINE
if ! zpool list -H -o health "$POOL" 2>/dev/null | grep -Eq '^(ONLINE|DEGRADED)$'; then
    echo "$(date) - Pool $POOL is not ONLINE. Shutting down node." >> /var/log/zfs-pool-monitor.log
    echo o > /proc/sysrq-trigger
fi

(edit- checking for just online would die on a degraded pool. Also, shutdown would probably fail since there is a dataset out, so replaced with a harder clobber)

edd · Sep 21, 2025

alexskysilk said:
google high availability storage. There are many options available.

Thank you for the suggestion, but google was somewhat unhelpful. Can you give an example of an open source alternative?

alexskysilk said:
Yes. It is. You can eliminate a whole node full of disks and the storage will keep on ticking. Its the whole point.

I concede the point.

alexskysilk said:
*/5 * * * * /usr/local/sbin/checkpool.sh

Thank you, that was indeed very simple and extremely helpful. I was initially put off by UdoB's comment about complexity.

@UdoB any reason you didn't just use something like this? I don't really see why this would not work as long as cron and zpool list runs. Any unexpected corner cases you can identify?

UdoB · Sep 21, 2025

edd said:
@UdoB any reason you didn't just use something like this?

No. That script looks fine - and I had already "Like"d it.

I am just a bit reluctant. While I really like Debian and I know enough to do a lot of optimization/scripting I usually hesitate to add another script/tool/mechanism/agent/cronjob/service/whatever to the vanilla (PVE-) host. It increases the complexity, and even though I document everything I am not sure if I immediately remember the background and the rationale of each tweak when there is a malfunction some years from now. (Have you ever looked into a very old installation and wondered what all those /usr/local/sbin/scripts do?)

In this specific case (with the unavailable ZFS pool) the actual reason is that this happened to me the very first time - and "just" in my uncritical home-lab.

FYI: ZFS pool SUSPENDED not recognized / handled - even with HA VMs starving

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

New Member

New Member

Distinguished Member

Distinguished Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

Famous Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

We value your privacy