Proxmox 9.1.1 FC Storage via LVM

Hi @ertanerbek,


No problem — here is our full setup in detail.


We are running a three-node Proxmox cluster, and each node has two dedicated network interfaces for iSCSI.
These two NICs are configured as separate iSCSI interfaces using:

iscsiadm -m iface

Both interfaces are set with MTU 9000.

iSCSI Discovery & Storage Topology

For each interface, we performed a full discovery to every logical port exposed by our Huawei Dorado arrays.
We are in a HA environment: two Dorado systems located in two datacenters operating in Active/Active via HyperMetro.


Multipath configuration


Multipath is configured with:

  • path_grouping_policy = multibus
  • path_selector = service-time 0
1764664008036.png

All configuration files are attached to this post.

Kernel & Queue-Depth optimizations


We applied a few kernel tunings and adjusted queue_depth, mainly to optimize latency and improve the behaviour under high I/O load.

ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*", ATTR{device/vendor}=="HUAWEI", ATTR{device/queue_depth}="512"

Once everything was configured, each node’s iSCSI initiator was added on both Dorado arrays.
From there, the PVs and LVs became visible in Proxmox.

Proxmox configuration

In Proxmox, each LV was simply added under Datacenter → Storage with:

  • Shared = Yes
  • snapshot-as-volume-chain = enabled
With this setup, we end up with 16 paths per LUN:
2 NICs per node × 4 logical ports per Dorado array × 2 arrays.


Performance & Testing


We performed multiple tests (performance, latency, failover) using VDBench, and the results are quite good:
  • around 150–200K IOPS in 8K or 16K
  • stable latency
  • no locking or migration issues so far
Today, we are running around 50 VDI virtual machines on this infrastructure for testing, and everything is behaving correctly.

Additional HA component

The only custom addition is a storage HA script we developed.
It monitors the number of available paths for each LUN and, if this count ever drops to 0, the script automatically fences the node to avoid corruption.

Just to clarify — I’m not using OCFS2 at all in my setup.

From what you’re describing, I really think this is the root of your issue.
OCFS2 has its own distributed lock manager, and Proxmox also applies its own locking layer on top of the storage.

So when both mechanisms coexist (OCFS2 locks + Proxmox locks), they tend to conflict, slow down operations, and trigger timeouts during clone/move/delete operations. Proxmox is not designed to work with cluster file systems like OCFS2 or GFS2, and this often leads to exactly the kind of behaviour you’re seeing.


In my case, since I’m only using LVM on top of iSCSI + multipath, there is no additional filesystem-level locking, which is why I don’t encounter the same issues.

If you remove OCFS2 from the equation and rely on standard LVM instead, you will likely see much more stable behaviour on the Proxmox side.

As mentioned, all the scripts and configuration files are attached for reference.

Feel free to ask if you want extra details — happy to help compare setups.
 

Attachments

Last edited:
Hello @tiboo86

Thank you very much for your feedback and for this extensive sharing. I hope both myself and many others will benefit from it. However, this is an IPSAN system, and the issue I am experiencing is on the FC SAN side.

Let’s clarify the OCFS2 topic. Some of our LUNs are used with LVM, while others are used with OCFS2. My goal is to switch to whichever option is fully supported by Proxmox. Actually, I don’t have any problems with OCFS2 itself, but in terms of raw performance, Proxmox and OCFS2 locking mechanisms are constantly conflicting. For this reason, if possible, I would prefer to use LVM. I am currently in the testing phase with LVM, but I am facing issues in the FC SAN environment. Therefore, while I use some LUNs with LVM, I connect others as directories with OCFS2.
 
This raises the question: Should Proxmox’s LVM support be configured with CLVM? It feels as though standard LVM is not functioning correctly in this scenario. Regardless of whether I use RAW or QCOW, disk deletion and migration operations consistently cause problems.
The lock is managed by proxmox code directly. (through pmxcfs/corosync). You can't delete 2 lvm volumes at the same time.
 
  • Like
Reactions: ertanerbek
The lock is managed by proxmox code directly. (through pmxcfs/corosync). You can't delete 2 lvm volumes at the same time.

Yes, you are right. For this reason, instead of LVM, perhaps OCFS2 or GFS2 — since it is integrated with Corosync — could be better options that may be supported in the future compared to LVM.
 
Hi @ertanerbek, we’re mostly on the same page: Fibre Channel is far from dead in large enterprise environments. That said, investing in legacy entry SANs (for example an HPE MSA or older Dell ME models), or even trying to repurpose them, purely to run FC may not make sense for everyone today. For use cases that require the highest scale and predictable performance, FC remains a valid option.

At the same time, modern IP-based SANs (iSCSI and NVMe/TCP) over 100/200/400 Gbit links can compete very effectively with FC for many workloads.

Regarding clustered filesystems: requests for native clustered-FS support in PVE come up frequently on this forum (monthly, if not more often). I don’t have visibility into Proxmox GmbH’s internal roadmap, but my educated guess is that natively supported clustered filesystem is unlikely to appear.

Back to your original problem: at a high level, LVM operates on kernel block devices and does not distinguish whether those blocks are provided by FC or iSCSI. Adding more context may help others to spot something interesting. What you posted is the task output at failure, but we don’t know what else was running on the system, nor the timings and resource usage of parallel tasks.

Here is a brief starting list:
  • VM configs and disk sizes
  • timings for the problematic tasks when there is no resource contention
  • timings for the tasks when lock contention occurs
  • relevant journalctl/dmesg excerpts around the failure
  • examples of successful task outputs for comparison
The best way forward is a reproducible, step-by-step scenario that demonstrates the issue. As others have mentioned, some lock contention is expected in this configuration, the goal is to identify whether what you’re seeing is within normal parameters or indicative of something particular in your environment.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi @ertanerbek, we’re mostly on the same page: Fibre Channel is far from dead in large enterprise environments. That said, investing in legacy entry SANs (for example an HPE MSA or older Dell ME models), or even trying to repurpose them, purely to run FC may not make sense for everyone today. For use cases that require the highest scale and predictable performance, FC remains a valid option.

At the same time, modern IP-based SANs (iSCSI and NVMe/TCP) over 100/200/400 Gbit links can compete very effectively with FC for many workloads.

Regarding clustered filesystems: requests for native clustered-FS support in PVE come up frequently on this forum (monthly, if not more often). I don’t have visibility into Proxmox GmbH’s internal roadmap, but my educated guess is that natively supported clustered filesystem is unlikely to appear.

Back to your original problem: at a high level, LVM operates on kernel block devices and does not distinguish whether those blocks are provided by FC or iSCSI. Adding more context may help others to spot something interesting. What you posted is the task output at failure, but we don’t know what else was running on the system, nor the timings and resource usage of parallel tasks.

Here is a brief starting list:
  • VM configs and disk sizes
  • timings for the problematic tasks when there is no resource contention
  • timings for the tasks when lock contention occurs
  • relevant journalctl/dmesg excerpts around the failure
  • examples of successful task outputs for comparison
The best way forward is a reproducible, step-by-step scenario that demonstrates the issue. As others have mentioned, some lock contention is expected in this configuration, the goal is to identify whether what you’re seeing is within normal parameters or indicative of something particular in your environment.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
The exact cause of this problem is the discard operation.

Scenario 1: The guest resides on any source disk structure. When I try to clone this guest into an LVM setup in qcow format, whether the SSD and DISCARD feature on the disk are enabled or not, qcow2 performs discard operations during the cloning process. This causes the disk to become overloaded to the point of being unusable. When I check with iostat, I see heavy discard activity and the UTIL level is ~99%.

Scenario 2: The guest resides on any disk structure. When I try to clone this guest to an LVM disk as a RAW disk, whether the SSD and DISCARD features are enabled or not, the cloning completes successfully at the disk speeds I specified without any issues. During this process, the disk’s UTIL level remains between ~10% and ~30%, depending on the amount of data being transferred.

Scenario 3: The guest has a RAW disk and runs on LVM. The disk has SSD and DISCARD enabled. When I run fstrim inside the guest, the LVM disk becomes overloaded and unable to respond. The UTIL level rises to ~99%.

Scenario 4: The guest has a RAW disk and runs on LVM. The disk has SSD enable and DISCARD disable. When I run fstrim inside the guest, guest immediatly say fstrim finish (I donot see any discard operation on LVM disk so this fake discard in guest side.)

In all these scenarios, the LVM system has the issue_discard option set to 0, meaning LVM does not accept discard. Even though the guest or qcow2 tries to perform discard during migration, the storage shows very low I/O or bandwidth. LVM does not forward discard to the physical disks — and it should not.
 
Last edited: