Failed to find logical volume vm-13649-disk-0 after reverting snapshot StarLVM plugin, FC storage

Gratified4693

New Member
Jun 15, 2026
3
2
3
Hello,

We've recently migrated from ESXi to Proxmox. We are using the same storage we used at ESXI EV, it's FC SAN mounted as usual in such case.
We have two disks 15Tb each. We use Proxmox as cluster with multuple nodes, 4 nodes now active.

One is LVM - pure, no volume-chains, no other tricks, not using snapshots on it. No issues.
Another one is starlvm - for dev VMs where we need to use snapshot functionality. This is the problematic one.

1781512026961.png

For development purposes we need to use snapshots as a tool to roll-back VM changes fast and reliably.
But sometimes, randomly, during snapshot revert procedure some VMs are losing disks:

Code:
unsupported storage of vg 'star_vg_fc_san_3738'
 activating vm-13472-disk-0...
 deactivating vm-13472-disk-0...
Use of uninitialized value in string ne at /usr/share/perl5/PVE/Storage/Custom/StarLvmPlugin.pm line 481.
TASK ERROR: no such logical volume star_vg_fc_san_3738/vm-13472-disk-0

To make things consistent the snapshot reverting procedure is following:
  • shutdown VM (gracefully)
  • get list of snapshots
  • revert to the latest snapshot (have no parent)
  • start VM
I'm running this routine sequentially for each VM group(6 VMs in a group), but there are multiple VM groups at the cluster (alpha, beta, each with 6 VMs in it), and sometimes they might execute this routine for different VMs in parallel. I'm using some randomization and time shift to run snapshot and power-related tasks as random as possible. I'm not sure if this may be the cause, but it's maybe worth mentioning.

Here is the problematic VM config:
- all vms are built similarly, only name, tags, mac changes for each vm version and group

Code:
#Clone made from SOME VM
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
parent: phase_2
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
tags: SOME-TAGS
vmgenid: 896cef72-1765-41dd-8b93-89f4ea668e09
[phase_2]
#Cloned VM Reverting this snapshot when needed
agent: enabled=1,fstrim_cloned_disks=1
boot: order=scsi0;net0
cores: 2
cpu: host
machine: q35
memory: 4096
meta: creation-qemu=11.0.0,ctime=1780414247
name: VM-NAME-delta
net0: virtio=some_MAC,bridge=somebridge,firewall=1,queues=1
onboot: 0
ostype: l26
scsi0: swsan3738lv:vm-13472-disk-0,aio=native,cache=none,detect_zeroes=1,discard=on,iothread=1,queues=2,size=70G
scsihw: virtio-scsi-single
smbios1: uuid=UUID
snaptime: 1780436041
tags: SOME-TAGS
vmgenid: 1f6ec138-c221-497e-ae8f-23d69c5c92d8

I'm checking two VMs, one is already affected by this issue 13472 and another one is OK 13649.
And there are some checks I run after this fail to get the full view:

Code:
#
# Affected VM  13472
#
pvesm status
unsupported storage of vg 'star_vg_fc_san_3738'
Name                  Type     Status     Total (KiB)      Used (KiB) Available (KiB)        %
local                  dir   disabled               0               0               0      N/A
local-lvm          lvmthin   disabled               0               0               0      N/A
san3739lv              lvm     active     16106123264     10011447296      6094675968   62.16%
swsan3738lv        starlvm     active     16106123264      4272119808     11834003456   26.52%
#
pvesm list swsan3738lv --vmid 13472
unsupported storage of vg 'star_vg_fc_san_3738'
Volid Format  Type      Size VMID
#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738  | grep 13472
  inactive          '/dev/star_vg_fc_san_3738/lvmth-13472' [70.00 GiB] inherit
  inactive          '/dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2' [70.00 GiB] inherit
#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13472
  VG                  #PV #LV #SN Attr   VSize   VFree
  pve                   1   3   0 wz--n- 277.87g  16.00g
  star_vg_fc_san_3738   1 167   0 wz--n- <15.00t <10.96t
  vg_fc_san_3739        1  55   0 wz--n- <15.00t  <5.68t
  star_vg_fc_san_3738 lvmth-13472                    twi---tz-k   70.00g
  star_vg_fc_san_3738 [lvmth-13472_tdata]            Twi-------   70.00g
  star_vg_fc_san_3738 [lvmth-13472_tmeta]            ewi-------   72.00m
  star_vg_fc_san_3738 snap_vm-13472-disk-0_phase_2   Vri---tz-k   70.00g lvmth-13472

#
ls -lah /dev/star_vg_fc_san_3738/ | grep 13472
-- none ---

#
ls -l /dev/mapper/ | grep 13472
-- none ---

#
vgchange -a y star_vg_fc_san_3738
  18 logical volume(s) in volume group "star_vg_fc_san_3738" now active

#
lvs | grep 13472
  lvmth-13472                    star_vg_fc_san_3738 twi---tz-k   70.00g
  snap_vm-13472-disk-0_phase_2   star_vg_fc_san_3738 Vri---tz-k   70.00g lvmth-13472

# It does not help me to check the disk itself, yet
mount /dev/star_vg_fc_san_3738/lvmth-13472 /mnt/pve/
mount /dev/star_vg_fc_san_3738/snap_vm-13472-disk-0_phase_2 /mnt/pve/

#
# Example of healthy VM 13649
#
pvesm list swsan3738lv --vmid 13649
unsupported storage of vg 'star_vg_fc_san_3738'
Volid                       Format  Type             Size VMID
swsan3738lv:vm-13649-disk-0 raw     images    69793218560 13649

#
sudo /usr/sbin/lvscan | grep /star_vg_fc_san_3738  | grep 13649
  ACTIVE            '/dev/star_vg_fc_san_3738/lvmth-13649' [65.00 GiB] inherit
  inactive          '/dev/star_vg_fc_san_3738/snap_vm-13649-disk-0_phase_2' [65.00 GiB] inherit
  ACTIVE            '/dev/star_vg_fc_san_3738/vm-13649-disk-0' [65.00 GiB] inherit

#
vgs; lvs -a -o vg_name,lv_name,lv_attr,lv_size,pool_lv,data_percent,metadata_percent | grep 13649
  VG                  #PV #LV #SN Attr   VSize   VFree
  pve                   1   3   0 wz--n- 277.87g  16.00g
  star_vg_fc_san_3738   1 167   0 wz--n- <15.00t <10.96t
  vg_fc_san_3739        1  55   0 wz--n- <15.00t  <5.68t
  star_vg_fc_san_3738 lvmth-13649                    twi-aotz-k   65.00g             30.48  20.12
  star_vg_fc_san_3738 [lvmth-13649_tdata]            Twi-ao----   65.00g
  star_vg_fc_san_3738 [lvmth-13649_tmeta]            ewi-ao----   68.00m
  star_vg_fc_san_3738 snap_vm-13649-disk-0_phase_2   Vri---tz-k   65.00g lvmth-13649
  star_vg_fc_san_3738 vm-13649-disk-0                Vwi-aotz-k   65.00g lvmth-13649 28.84

#
ls -l /dev/mapper/ | grep 13649
lrwxrwxrwx 1 root root       8 Jun 12 08:34 star_vg_fc_san_3738-lvmth--13649 -> ../dm-45


The interesting part is in the working VM:

Code:
star_vg_fc_san_3738 vm-13649-disk-0                Vwi-aotz-k   65.00g lvmth-13649 28.84

So, if I understand correctly, using example of unaffected VM:

1VM main diskas volumelvmth-13649 twi-aotz-k 65.00g
2VM disk mountedvolume as diskvm-13649-disk-0 Vwi-aotz-k 65.00g lvmth-13649
3VM snapshotseparate volume, mounted as disk?snap_vm-13649-disk-0_phase_2 Vri---tz-k 65.00g lvmth-13649

Affected VM somehow lost the mount #2 during the procudure of snapshot reverting:

1VM main diskas volumelvmth-13472 twi---tz-k 70.00g
2VM disk mountedLOST?no such logical volume star_vg_fc_san_3738/vm-13472-disk-0
3VM snapshotseparate volume, mounted as disk?snap_vm-13472-disk-0_phase_2 Vri---tz-k 70.00g lvmth-13472



Questions:
  • What am I doing wrong?
  • Is there a way to mount lost volume again, since I still have a snapshot and original disk of affected VM in place?



UPD: I'm still investigating the issue, trying to find a cause and to blame some network issues or improper setup, or concurrency, but I cannot find a strong evidence or a proper error in host logs.

It happens randomly and I cannot recreate the issue running snapshot reverting routine in loop for 100 times in parallel to normal routines. I tried switching this routine in thread-running case or in sequential execution: one vm after another - and it still happens.

UPD2: Trying to revert snapshot for affected VM throw error:
This is new vm, new ID, but the same issue.

Code:
unsupported storage of vg 'star_vg_fc_san_3738'
TASK ERROR: lvremove 'star_vg_fc_san_3738/vm-19582-disk-0' error:   Failed to find logical volume "star_vg_fc_san_3738/vm-19582-disk-0"

This is something I'm trying to find workaround for too: how can I skip this error or make a dummy mount to fool Proxmox into thinking it actually "deleted" an old disk-0 and replaced it with snapshot version of VM disk?



UPD3: Just for fun I've created a fake volume to trick Proxmox into thinking it can delete this fake volume and assign snapshot disk back. But it does not work:

Code:
qm unlock 19582
lvcreate -L 1M -n vm-19582-disk-0 star_vg_fc_san_3738
sudo /usr/sbin/lvscan | grep vm-19582-disk-0
  inactive          '/dev/star_vg_fc_san_3738/snap_vm-19582-disk-0_phase_2' [65.00 GiB] inherit
  ACTIVE            '/dev/star_vg_fc_san_3738/vm-19582-disk-0' [4.00 MiB] inherit

Snapshot revert task returned:

Code:
unsupported storage of vg 'star_vg_fc_san_3738'
  Consider pruning star_vg_fc_san_3738 VG archive with more than 992 MiB in 7774 files (see archiving settings in lvm.conf).
  Logical volume "vm-19582-disk-0" successfully removed.
  Thin pool star_vg_fc_san_3738-lvmth--19582-tpool (252:90) transaction_id is 10, while expected 9.
TASK ERROR: lvm rollback 'star_vg_fc_san_3738/snap_vm-19582-disk-0_phase_2' error:   Aborting. Failed to locally activate thin pool star_vg_fc_san_3738/lvmth-19582.

Checking:

Code:
lvconvert --repair star_vg_fc_san_3738/lvmth-19582
  Consider pruning star_vg_fc_san_3738 VG archive with more than 992 MiB in 7779 files (see archiving settings in lvm.conf).
  Consider pruning star_vg_fc_san_3738 VG archive with more than 992 MiB in 7780 files (see archiving settings in lvm.conf).
  WARNING: LV star_vg_fc_san_3738/lvmth-19582_meta0 holds a backup of the unrepaired metadata. Use lvremove when no longer required.




Thanks in advance
 
Last edited:
  • Like
Reactions: Jeffthomson890
Great investigation @Gratified4693 and welcome to the forum.

Have you tried to reach out to the provider of this plugin for support?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Since I'm only started to use Proxmox recently I thought it might be a problem with how I imagine things should work,
I hasn't considered it as a general issue with plugin itself, yet.

Maybe I must, if there is no other option.

I'm still hoping there is some small forgotten option or vm conf. workaround I can use to patch this problem now, and later be able to proceed with bug report.
 
  • Like
Reactions: Jeffthomson890
Thanks for sharing the detailed investigation. One thing that stood out to me is the mismatch between the expected thin pool transaction ID and the actual transaction ID reported during the rollback operation:

"transaction_id is 10, while expected 9"

That makes me wonder whether the issue is occurring during metadata updates rather than the snapshot itself. Since the affected VMs seem to lose the vm-XXXX-disk-0 logical volume reference while the thin pool and snapshot LV still exist, it could indicate that the revert operation is not completing atomically under certain conditions.

Have you checked whether the problem occurs only when multiple snapshot revert jobs are running across different nodes at the same time? Even if the VMs are different, concurrent operations against the same storage backend might be worth ruling out. It may also be useful to compare the thin pool metadata state between a healthy VM and a failed VM immediately after the issue occurs.

Given that the storage plugin is managing the LV lifecycle, I agree with the suggestion to involve the plugin vendor as well, especially since the underlying volumes appear to remain present while the expected mapping LV disappears.
 
  • Like
Reactions: Gratified4693
Thanks for sharing the detailed investigation. One thing that stood out to me is the mismatch between the expected thin pool transaction ID and the actual transaction ID reported during the rollback operation:

"transaction_id is 10, while expected 9"

That makes me wonder whether the issue is occurring during metadata updates rather than the snapshot itself. Since the affected VMs seem to lose the vm-XXXX-disk-0 logical volume reference while the thin pool and snapshot LV still exist, it could indicate that the revert operation is not completing atomically under certain conditions.

Have you checked whether the problem occurs only when multiple snapshot revert jobs are running across different nodes at the same time? Even if the VMs are different, concurrent operations against the same storage backend might be worth ruling out. It may also be useful to compare the thin pool metadata state between a healthy VM and a failed VM immediately after the issue occurs.

Given that the storage plugin is managing the LV lifecycle, I agree with the suggestion to involve the plugin vendor as well, especially since the underlying volumes appear to remain present while the expected mapping LV disappears.

As a workaround I made a function call to run command to list all VM-related volumes right after each snapshot operation:
So it now helps me to get this specific moment when disk disappears.

Code:
[2026-06-13 08:25:21,323: INFO/ForkPoolWorker-1] VM_NAME-delta Wait after snapshot operations...
    VM logical volumes: ['/dev/star_vg_fc_san_3738/lvmth-13649', '/dev/star_vg_fc_san_3738/snap_vm-13649-disk-0_phase_2', '/dev/star_vg_fc_san_3738/vm-13649-disk-0']

[2026-06-14 08:37:58,382: INFO/ForkPoolWorker-2] VM_NAME-delta Wait after snapshot operations...
    VM logical volumes: ['/dev/star_vg_fc_san_3738/lvmth-13649', '/dev/star_vg_fc_san_3738/snap_vm-13649-disk-0_phase_2']

I'm still trying to catch the concurrency issue, however it's not so easy to catch. Once I tried reverting test VM snapshot in loop continuously for hours and running all usual routines and still cannot recreate the issue.

Moreover, initially, after migration from vCenter, all snapshot reverting operations were threaded and were run in concurrency. This setup was fine for a few days, and only after numerous attempts to understand the root cause I switched to sequential method one vm after another in a single group (but still there is a chance to have two groups of VMs run in concurrency).

NOTE: What I mean is that initially, a heavy and concurrently running snapshot reverting routine, (when dozens of VMs were trying to revert snapshots at the same time) did not trigger the issue. The issue only appeared after some time, and I tried back and forth with multi-threaded and consecutive executions - and it still looks like happening unpredictably (network blips, SAN-storage-specific error, etc?)

This is the fix I'll try to introduce next - to use only one "thread" for all snapshot reverting tasks for any VM from any group. It did not want to use it early, since this is obviously a huge bottleneck for all our operations.
 
Last edited: