Help! - Multipathing keeps breaking my root ZFS pool

Sp00nman · May 15, 2025

Morning Proxmoxers!

I have a serious problem and am at my wits end as to how to resolve: -

I have an 18 node cluster each setup using 2 x sata SSD in ZFS raid1 for the OS & various storage appliances deployed using ISCSI and LVM thick provisioning for shared storage for cluster

Nodes 1-7 were setup last year using the proxmox 8.1 ISO - all the problem nodes (8-18) were installed this year using proxmox 8.3 & then 8.4 ISO

Nodes 1-7 have no issues - newer nodes 8-18 all keep having this issue:-

After a few reboots (random amount) multipathing seems to somehow corrupt my zfs pool by relabelling the disks resulting in the below:

PRODUCTION [root@prox18-dc01 ~]# zpool status -v
pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: scrub repaired 0B in 00:01:30 with 0 errors on Sun May 11 00:25:31 2025
config:

NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
35111111012345679-part3 ONLINE 0 0 0
5587448289148697325 UNAVAIL 0 0 0 was /dev/sdb3

errors: No known data errors

I have tried zpool replacing the disk i have tried force removing the failed disk formatting and then zpool attach/add the refreshed disk -with no success

Im making an assumption that the issue relates to a bug in the current proxmox/zfs integration as this isnt affecting nodes that were setup last year using an older version of proxmox and i assume and older version of ZFS (my understanding is that when you create a zfs pool, the pool will be created with that version of ZFS - even if u upgrade the ZFS version existing pool will remain on the same version it was created on unless u specifically upgrade the pool).

Has anyone seen this issue before and does anyone have any idea how to resolve without needing to reinstall the host?

If I cant find a solution im going to need to revert to using hardware raid like its 2001

Looking forward to hearing from you all.

LnxBil · May 15, 2025

Have the "working" nodes been updated so that all share the same package versions?
What about configuration files in /etc, are there differences?

Sp00nman · May 15, 2025

Literally all nodes are exactly the same otherwise - same HW, same setup and config, same multipath config, all patched and updated to latest version in the enterprise repo.

The only difference is when they were deployed and which version of the proxmox iso was used.

bbgeek17 · May 15, 2025

I've re-read your post several times and word multipath is present in the title, in your statement that it breaks things and in the subsequent statement that the multipath configuration is the same across all nodes.
Perhaps I missed it, but how did you come to conclusion that Multipath is involved in your ZFS issues?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Sp00nman · May 15, 2025

Hi there

Let me try show you: -

Below is a working node without the issue:

Here is one that has the problem:

bbgeek17 · May 15, 2025

The disk's serial numbers in "good" node are different.
The "bad" node - has both disks with the same serial number. That can certainly confuse multipath.
You could immediately exclude sda and sdb from Multipath consideration in the config file. You should then look into any firmware updates for your disks, examine their parameters closely with Udev and other tools, and perhaps ship them back to manufacturer.

Are these consumer disks or enterprise?

PS you can also run: sg_inq /dev/sdX

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Sp00nman · May 15, 2025

@bbgeek17 - Yes i think you are onto something -

Nodes 1-7 all using enterprise disks, Nodes 8-18 using cheap commercial SSDs (got lots of spares to accommodate for failures)

Will revert to using the raid controller on these hosts - this way the disk presented by the raid controller will have unique properties

This issue could probably be avoided by blacklisting the local disks in /etc/multipath.conf - this will prevent the issue but not fix it once its happened already -

So either way its going to be a long day...... host reinstalls here we come....

Thanks for your input

Maximiliano · May 15, 2025

Hello,

I think the natural question here is why your hosts sees different disks with the same serial number. I can imagine this happening if the disks were presented to the host via two different paths, e.g. using iSCSI with multiple portals to the same LUN, but I would personally recommend against using ZFS with non-local disks. ZFS works best with fewer layers in between ZFS and the disks, in their documentation they even recommend against hardware controllers [1] when not using IT/HBA mode.

How did you add the disks to the zpool? Multipath creates mapped devices when it sees two disks with the same WWN (at `/dev/mapper/`), and giving these instead of the disks directly (e.g. by their path at `/dev/by-id/` or `/dev/`) would be a better option, but the caveat mentioned above would still apply.

[1] https://openzfs.github.io/openzfs-d...uning/Hardware.html#hardware-raid-controllers

bbgeek17 · May 15, 2025

Maximiliano said:
I think the natural question here is why your hosts sees different disks with the same serial number. I can imagine this happening if the disks were presented to the host via two different paths, e.g. using iSCSI with multiple portals to the same LUN, but I would personally recommend against using ZFS with non-local disks. ZFS works best with fewer layers in between ZFS and the disks, in their documentation they even recommend against hardware controllers [1] when not using IT/HBA mode.

Hi @Maximiliano ,

Rather than iSCSI, the more likely scenario, given that these are boot disks, a dual-expander SAS drive.

That said, there have been previous reports on the forum of mass-produced, budget consumer disks sharing identical serial numbers, particularly when sourced from the same batch. So, this issue would fall in line with those lovely manufacturing quirks.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

alexskysilk · May 15, 2025

Sp00nman said:
setup using 2 x sata SSD in ZFS raid1 for the OS

Sp00nman said:
After a few reboots (random amount) multipathing seems to somehow corrupt my zfs pool

Why in god's name are you allowing multipathd trap your SATA DISKS? dont do that. there's literally zero benefit and added failure domain.

LnxBil · May 16, 2025

Sp00nman said:
Literally all nodes are exactly the same otherwise - same HW

So, that was not the case in the end.

We can help you with building blacklist entries, if you post the output of lsscsi --size in CODE tags. Often you can just use a protocol, a product name or vendor to exclude the local disks from multipath.

Search

Search

Help! - Multipathing keeps breaking my root ZFS pool

Sp00nman

Well-Known Member

LnxBil

Distinguished Member

Sp00nman

Well-Known Member

bbgeek17

Distinguished Member

Sp00nman

Well-Known Member

bbgeek17

Distinguished Member

Sp00nman

Well-Known Member

Maximiliano

Proxmox Staff Member

bbgeek17

Distinguished Member

alexskysilk

Distinguished Member

LnxBil

Distinguished Member

We value your privacy