iSCSI trouble on PVE Cluster

Pigi_102 · Jul 10, 2024

Hello all,
I'm fighting with a strange problem when using iSCSI as shared storage.
Let's say I do create a Lun on storage and present her to cluster.
I do create a storage ( "Datacenter -> Storage " ) type iSCSI and she get correctly created on all nodes.
After that I do create a Storage of type LVM using this Lun and also this get created on all nodes.
The problem starts when I want to expand the lun ( but also when I create a new Lun on storage and present to cluster ).
I do expand the Lun and I can see that all nodes get the scsi change information, but while on the node I'm doing the configuration I can see the iSCSI stuffs ( storage and so on ) correctly working, on all other nodes the iSCSI storage becomes "Status Unknown" and since then I cannot use them.
The only way to get them working is by rebooting every other node except the one that has all working, and the storage became available again.
I can't see any specific error message around and the problem seems to be more tied to PV/LV than iSCSI, not really sure.
No systemctl restart iscsid.service or pvesm rescan iscsi change the situation. Only the reboot.

More over ( but I'm not sure if I'm doing correctly ) on the working one I need to do the pvresize manually to get the new space available ( but this could also be correct, can't find lot of info on this task ).

The storage, at the moment, is a TrueNas exporting iSCSI Luns, and also a target for Zfs-Over-iSCSI and proxmox pve is 8.2.2 from free repository.
I've seen quite a bit of thread about this specific problem, and almost everyone ends with a reboot ( which IMHO is not an acceptable solution )

Any idea on how to workaround or fix this problem ?
Thanks in advance.

Pigi_102

bbgeek17 · Jul 10, 2024

Hi @Pigi_102 ,
If you can clarify a few things and provide direct command line output, it may be easier to understand the situation for others.

a) you said that you have iSCSI lun with LVM, yet towards the end you mention ZFS-over-iSCSI
b) After the creation of iSCSI and LVM on top of it, are you able to create VM disks across all nodes? Can you migrate a VM across all nodes?
c) after LUN expansion, do lsblk and lsscsi agree across all nodes?
d) in the case of adding a new LUN, are you connecting it to the same Target on the storage side? Do lsblk/lsscsi/pvsm scan iscsi agree?
e) The status is the responsibility of pvstatd. It's possible there is a bug or edge condition. What about "pvesm status" and "pvesm list" outputs for that particular storage pool/target?
Keep in mind that LVM bits may not be activated on other nodes unless they are actively in use. They should be rescanned and updated in theory.
f) If you have reproducible steps, you can follow "journalctl" in the second terminal window on each node, while you are performing your steps. That way you can catch anything unusual. Of course, you can also review what happened in the past.

PVE will not expand the LVM automatically for you. You do need to properly resize PV and VG.

Providing outputs of lsblk/lsscsi/iscsiadm -m node [-m session]/journalctl/dmesg annotated with the status of the overall system may be helpful.

Good luck

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pigi_102 · Jul 10, 2024

bbgeek17 said:
you said that you have iSCSI lun with LVM, yet towards the end you mention ZFS-over-iSCSI

I have both my I have trouble on the lun/lvm side.

Maybe I've tracked down a little bit, and yes, probably I was wrong in my assumptions:
I've created a new lun, and created a new lvm storage using the same iscsi storage ( the lun is a new one inside the same target ).
When I ask to create a new lvm I get the window asking me :
* ID: the name of the storage
* Base Storage: ( here I do select the iscsi storage )
* Base Volume: here a new window popup asking me on wich node execute the scan. I choose one of my three and select the lun.
* Volume group: here I do choose a new name
*Content ( default )
and select "shared" just to be sure.

whe I click add I see the storage appearing in the storage section of the nodes, but on the other two is status unkonwn ( whether I chose one of them to scan in the step before ).

lsblk is coherent on all three nodes.
lsscsi is not present on these machines ( don't know why ).

pvesm status is differente on the nodes:

Code:

root@pve1:~# pvesm status
Name                  Type     Status           Total            Used       Available        %
LocalZfs           zfspool     active        70615040         4320136        66294904    6.12%
NFS_DG                 nfs     active        50290816        47989248         2301568   95.42%
ceph_fs_vm             rbd     active        63861716        25880500        37981216   40.53%
cephfs_1            cephfs     active        43196416         5218304        37978112   12.08%
iscsi-lun            iscsi     active               0               0               0    0.00%
iscsi-lvm              lvm     active        20963328         2097152        18866176   10.00%
iscsi-lvm-2            lvm     active         8380416               0         8380416    0.00%
local                  dir     active         9698612         5427252         3757108   55.96%
local-lvm          lvmthin     active         3534848               0         3534848    0.00%
zfs-over-iscsi         zfs     active       103522816        65732132        37790684   63.50%

root@pve2:~# pvesm status
Name                  Type     Status           Total            Used       Available        %
LocalZfs           zfspool     active        70615040        26943212        43671827   38.16%
NFS_DG                 nfs     active        50290816        47989248         2301568   95.42%
ceph_fs_vm             rbd     active        63861716        25880500        37981216   40.53%
cephfs_1            cephfs     active        43196416         5218304        37978112   12.08%
iscsi-lun            iscsi     active               0               0               0    0.00%
iscsi-lvm              lvm     active        20963328         2097152        18866176   10.00%
iscsi-lvm-2            lvm   inactive               0               0               0    0.00%
local                  dir     active         9698612         5152612         4031748   53.13%
local-lvm          lvmthin     active         3534848               0         3534848    0.00%
zfs-over-iscsi         zfs     active       103522816        65732132        37790684   63.50%

root@pve3:~# pvesm status
Name                  Type     Status           Total            Used       Available        %
LocalZfs           zfspool     active        70615040          307761        70307279    0.44%
NFS_DG                 nfs     active        50290816        47989248         2301568   95.42%
ceph_fs_vm             rbd     active        63861716        25880500        37981216   40.53%
cephfs_1            cephfs     active        43196416         5218304        37978112   12.08%
iscsi-lun            iscsi     active               0               0               0    0.00%
iscsi-lvm              lvm     active        20963328         2097152        18866176   10.00%
iscsi-lvm-2            lvm   inactive               0               0               0    0.00%
local                  dir     active         9698612         5087120         4097240   52.45%
local-lvm          lvmthin     active         3534848               0         3534848    0.00%
zfs-over-iscsi         zfs     active       103522816        65732132        37790684   63.50%

showing the new lvm as inactive on the two nodes.

pvs lvs and vgs are different accordling.

journalctl -xe does not shows nothing really interessing.

If I reboot the two nodes showing inactive on the new lvm, then the status goes from inactive to active and starts working correctly.

This is reproducible .

bbgeek17 · Jul 10, 2024

Pigi_102 said:
whe I click add I see the storage appearing in the storage section of the nodes, but on the other two is status unkonwn ( whether I chose one of them to scan in the step before ).

Try to "pvesm alloc [appropriate options]" on each of the other two nodes, see if that activates the storage.
Or migrate the VM.

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pigi_102 · Jul 11, 2024

I think I hit a bug in some way.
If I try to create a lun on the newly, but unknown status storage, I get a "no such volume group" as in the attached image.
Also if I migrate a vm with a disk on the storage I get:

Code:

2024-07-11 07:41:21 starting migration of VM 100 to node 'pve2' (192.168.126.202)
2024-07-11 07:41:21 starting VM 100 on remote node 'pve2'
2024-07-11 07:41:35 [pve2] can't activate LV '/dev/vg-iscsi-3/vm-100-disk-0':   Cannot process volume group vg-iscsi-3
2024-07-11 07:41:35 ERROR: online migrate failure - remote command failed with exit code 255
2024-07-11 07:41:35 aborting phase 2 - cleanup resources
2024-07-11 07:41:35 migrate_cancel
2024-07-11 07:41:36 ERROR: migration finished with problems (duration 00:00:16)
TASK ERROR: migration problems

If I reboot ( no other changes ) it works.

I will see if I can file a bug.

Pigi_102 · Jul 11, 2024

For those who my be interessed, I've filed a bug and they have found a workaround.
If you connect to the pve that have the inactive, and issue:
pvscan --cache
the storage becomes active and everything works as it should.
I've asked if there would be a fix, but at least there is aworkaround that does not ask for reboot.

Search

Search

iSCSI trouble on PVE Cluster

Pigi_102

New Member

bbgeek17

Distinguished Member

Pigi_102

New Member

bbgeek17

Distinguished Member

Pigi_102

New Member

Attachments

Pigi_102

New Member