Proxmox 9.1.1 FC Storage via LVM

ertanerbek

Well-Known Member
Mar 29, 2019
98
7
48
44
Hello,

Has anyone successfully implemented a professional Proxmox setup with Fibre Channel (FC) SAN storage? I am not referring to IPSAN, but specifically FC-based SAN.

In a clustered environment, I am experiencing significant issues, particularly during cloning, wipe disk operations. The lock mechanisms appear problematic, and Proxmox seems unable to handle them reliably. In my test environment, when I attempt to delete or move disks simultaneously from different nodes, the system begins to encounter errors.

My current setup is as follows:

  • Proxmox 9.1.1
  • QCOW2 disk format
  • Huawei 5000v3 SAN Storage → HBA → Linux Multipath → LVM → Proxmox (2 node cluster with qdevice)
This raises the question: Should Proxmox’s LVM support be configured with CLVM? It feels as though standard LVM is not functioning correctly in this scenario. Regardless of whether I use RAW or QCOW, disk deletion and migration operations consistently cause problems.

If anyone has managed to run this configuration stably, could you share documentation or insights on how you achieved it? The storage lock issues are proving to be a major challenge.

Nov 26 11:59:38 PVE1 pvedaemon[94779]: lvremove 'STR-5TB-HUAWEI-NVME-045/vm-103-disk-0' error: 'storage-STR-5TB-HUAWEI-NVME-045'-locked command timed out - aborting
Nov 26 11:59:38 PVE1 pvedaemon[72268]: <root@pam> end task UPID:PVE1:0001723B:000F6884:6926C13E:imgdel:103@STR-5TB-HUAWEI-NVME-045:root@pam: lvremove 'STR-5TB-HUAWEI-NVME-045/vm-103-disk-0' error: 'storage-STR-5TB-HUAWEI-NVME-045'-locked command timed out - aborting
 
This setup should be stable, here's the documentation for multipath:
https://pve.proxmox.com/wiki/Multipath

Related information can be found here as well:
https://pve.proxmox.com/wiki/Migrate_to_Proxmox_VE#Storage_boxes_(SAN/NAS)


The error you get is due to a hard timeout of 60s for operation includes volume allocation on shared storage's. you need to make sure your storage is fast enough:
https://forum.proxmox.com/threads/u...-command-timed-out-aborting.98274/post-424883
https://forum.proxmox.com/threads/e...mage-got-lock-timeout-aborting-command.65786/
 
This raises the question: Should Proxmox’s LVM support be configured with CLVM?
CLVM championed by RedHat at one point, seems to have fallen out of favor, so taking on support for it might be quiet a tall task
https://salsa.debian.org/lvm-team/l...vmoved LVs.-,Remove clvmd,-Remove lvmlib (api
https://askubuntu.com/questions/1241259/clvm-package-in-repo
https://www.sourceware.org/cluster/clvm/

@bkry is correct, simultaneous operations on a shared storage where there are metadata consistency requirements must be serialized. It is quiet easy to overrun the timeout on such operations as "wipe".
https://github.com/proxmox/pve-cluster/blob/master/src/PVE/Cluster.pm#L642


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hello Friends,


First of all, thank you for your answers. However, there is a point we overlooked in the LVM LUN part: this is not a file system, but block storage. Also, LVM thin is not being used in this section. Essentially, the host’s RAM does not hold metadata here (since I use direct sync in the cache part). I also tested with RAW disks instead of qcow.


At this point, it seems that Proxmox is not handling this disk as it should. When a LUN is created, isn’t its address range defined on both sides? After all, there is no issue with which block the created LUN maps to, because it creates a tick. It’s just that while it is active on one host, it is inactive on another.


You can also feel this when writing data on the guests. Under heavy and intensive usage, I have serious doubts that unpleasant results may occur. I will try to understand the situation more clearly with different tests soon. Of course, I first need to create a proper test procedure.


My SAN storage device is quite fast. Even though my HBAs are 8 Gigabit, they run in dual mode. This means I can reach 2 Gigabytes of bandwidth per LUN and achieve 50,000 Random Read I/O and 50,000 Random Write I/O with a 50% RW mix.


I don’t think the issue lies on the LUN side, because when I access the LUNs at the operating system level (Debian), there is no problem. The issue only occurs when operating through Proxmox guests.
 
However, there is a point we overlooked in the LVM LUN part: this is not a file system, but block storage.
I think most people in this forum are aware of LVM being block storage.
Also, LVM thin is not being used in this section.
You'd be surprise with some wild experiments that were attempted/reported here before. There should always be a healthy dose of skepticism about taking posts at face value.
Essentially, the host’s RAM does not hold metadata here (since I use direct sync in the cache part).
This depends on the overall system state. You may be interested in this article we posted recently: https://kb.blockbridge.com/technote/proxmox-qemu-cache-none-qcow2/

Additionally, when an LV is created within the VG on host1, host2 does not immediately become aware of it. In PVE case a lock is taken to insure that two hosts do not attempt to create the LVs at the same time. When the create is finished - other hosts rescan VG restructure to lean about the metadata changes.
At this point, it seems that Proxmox is not handling this disk as it should. When a LUN is created,
LUN is a SCSI concept. In your case there are no LUNs being created, only LVM LVs. There are other storage types where a LUN is created for each Virtual Disk. Those storage systems typically do NOT use LVM.
After all, there is no issue with which block the created LUN maps to, because it creates a tick.
We successfully caused data corruption by timing LVM LV creation on two hosts by bypassing PVE cluster lock for an experiment.
LVM was never meant to be used with shared storage. CLVM was an addon/afterthought. PVE takes on the functionality of the CLVM by using its own cluster-wide locks during dangerous operations.
The issue only occurs when operating through Proxmox guests.
Perhaps your storage/client is not optimized. You may find this article interesting:
https://kb.blockbridge.com/technote/proxmox-tuning-low-latency-storage


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It’s an interesting way of answering, maybe it’s just because we speak different languages.

Anyway, at the end of the day the situation doesn’t really change. When I work directly with LVM and LVs via direct storage or OCFS2, I don’t encounter the same problems—there’s an ownership aspect here, and these are not thin.

If the same issues occurred when working directly with LVM, I could easily say that the problem lies in the LVM layer or that there’s a structural corruption. But unfortunately, there’s no issue at that layer.

The only thing that can really be said here is, just like other folks on the forum have noticed, without running wild tests, when using Proxmox it’s best not to use any Cluster Aware system other than CEPH.