[SOLVED] NVME-of multipath inconsistency

Freddy Lege

Renowned Member
Jan 17, 2017
4
1
68
56
Hi,

We are currently experiencing an issue with mounting NVMe-of-TCP multipath volumes : after restarting following the update to version 9.2, some shared volumes are no longer visible on one of our host.
In dmesg, we find errors such as "IDs don't match for shared namespace," even though the subsystems are properly connected.
We have tried "nvme disconnect all," "nvme discover," "nvme ns-rescan," etc., but nothing has worked.
We have been performing various analyses and haven't been able to find the source of the problem. The strangest thing is that another identical server (in terms of hardware and installation) does not have this issue.
Has anyone else encountered this problem?

Regards.
 
Hi @Freddy Lege ,
We have seen no issues in our testing with PVE 9.2 and Kernel 7. Perhaps you can expand with more details, i.e.:
- storage vendor you are using
- exact PVE versions (pveversion / pveversion -v
- how the connections were initially established
- exact error messages and boot sequence from the logs
- nvme list (from all hosts)
- nvme list subsys (from all hosts)
- ip configuration comparison across all hosts
- what exact steps were part of your analyses and what were the results

Given that PVE does not use proprietary connection methods to NVMe storage but rather standard Linux interfaces, have you contacted your storage vendor for help?

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi bbgeek17,

Indeed, my question lacked details ;)

I'm using the kernel's core NVMe-of-TCP multipath on my PVE 9.2 servers to mount volumes exported from HPE Alletra arrays.

But since my question, I've found the source of the problem, which was quite difficult to identify.

I have two active-active synchronized arrays. Some volumes are synchronized between the two arrays, while others are stored separately.

The problem stemmed from the fact that one of the arrays was promoted to master and exposed all paths to all volumes with its own NQN, this is normal for synchronized volumes because it ensures that clients only see a single volume, regardless of which array they access. The problem is that this master array also exposes the unsynchronized volumes of the other array, which in turn exposes them with its own NQN. This caused conflicts with identical NDSIDs and different NQNs, hence the mounting problems.

Thank you for answering my question.
 
  • Like
Reactions: leesteken