[SOLVED] Issue with iScsi offline on one node in cluster

courtjesterau · Apr 21, 2024

Hi,

I have a 3 node cluster all three nodes are connected to teh same switch 2 x are in a LACP bond and one is just a single link all 3 connect to an iSCSI nas for storage for the VM's

One day ( after a hard lock up ) my 2nd node willl not see teh iscsi storage with teh UI saying it is offline. I can ping the IP address from teh shell of the node and the other 2 x nodes in the cluster are working fine.

I ahve restarted this several times updated it and removed and re added the storage but nothing has worked.

in the syslog I am getting a repeating message:

pr 21 16:16:39 pve2 pvedaemon[1608]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Apr 21 16:16:39 pve2 pvedaemon[1608]: storage 'Qnas-iSCSI-Pool' is not online
Apr 21 16:16:42 pve2 pvedaemon[1607]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Apr 21 16:16:42 pve2 pvedaemon[1607]: storage 'Qnas-iSCSI-Pool' is not online
Apr 21 16:16:44 pve2 pvedaemon[1607]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Apr 21 16:16:44 pve2 pvedaemon[1607]: storage 'Qnas-iSCSI-Pool' is not online
Apr 21 16:16:46 pve2 pvestatd[1577]: command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Apr 21 16:16:46 pve2 pvestatd[1577]: storage 'Qnas-iSCSI-Pool' is not online

any ideas on what to try next to get this up and running?

courtjesterau · Apr 22, 2024

anyone?

bbgeek17 · Apr 22, 2024

Did you review your system log starting from boot point? "journalctl"
What does "pvesm status" say on a bad node and good node?
What do "iscsiadm -m node" and "iscsiadm -m session" say on good and bad node?
What happens when you do "pvesm scan iscsi <portal>"
Are there any errors on storage side? Are you sure the initiator is in the allowed group to access the target?
Is the MTU correct? Are you using Jumbo? If you do, can you ping with non-fragmented high size?
Can you do manual "iscsiadm discovery" against the target?
What is the context of your "/etc/pve/storage.cfg"?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

courtjesterau · Apr 22, 2024

bbgeek17 said:
Did you review your system log starting from boot point? "journalctl"
What does "pvesm status" say on a bad node and good node?
What do "iscsiadm -m node" and "iscsiadm -m session" say on good and bad node?
What happens when you do "pvesm scan iscsi <portal>"
Are there any errors on storage side? Are you sure the initiator is in the allowed group to access the target?
Is the MTU correct? Are you using Jumbo? If you do, can you ping with non-fragmented high size?
Can you do manual "iscsiadm discovery" against the target?
What is the context of your "/etc/pve/storage.cfg"?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I am newish to proxmox so not sure about all these questions but I will try to provide the answers the best I can

Are there any errors on storage side? Are you sure the initiator is in the allowed group to access the target?

There are no errors storage side this was running for well over a year with three nodes exactly as this is now and there are still 2x nodes suscessfully using the storage

The MTU is correct - double checked it is all standard 1500 and MTU settings are teh same on all 3 nodes and I am not using jumbo - this was changed a while back maybe 7 months ago ( and it still worked fins for a period after this was changed )

Good node details:

root@pve3:~# pvesm status
Name Type Status Total Used Available %
BKP02 pbs active 15475078144 1924850944 13550227200 12.44%
Qnas-iSCSI-Lun-0 lvm active 2147479552 1709244416 438235136 79.59%
Qnas-iSCSI-Pool iscsi active 0 0 0 0.00%
local dir active 98497780 11540784 81907448 11.72%
local-lvm lvmthin active 855855104 0 855855104 0.00%

root@pve3:~# iscsiadm -m session
tcp: [1] 172.16.103.5:3260,1 iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03 (non-flash)

root@pve3:~# iscsiadm -m node
172.16.103.5:3260,1 iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03

root@pve3:~# pvesm scan iscsi 172.16.103.5
iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03 172.16.103.5:3260

root@pve3:~# iscsiadm -m discovery
172.16.103.5:3260 via sendtargets

Storage cfg:
iscsi: Qnas-iSCSI-Pool
portal 172.16.103.5
target iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03
content none

lvm: Qnas-iSCSI-Lun-0
vgname Qnas-iSCSI-Lun-0
content images,rootdir
shared 1

bad node pvesm status

root@pve2:~# pvesm status
Command failed with status code 5.
command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
storage 'Qnas-iSCSI-Pool' is not online
Name Type Status Total Used Available %
BKP02 pbs active 15475078144 1924850944 13550227200 12.44%
Qnas-iSCSI-Lun-0 lvm inactive 0 0 0 0.00%
Qnas-iSCSI-Pool iscsi inactive 0 0 0 0.00%
local dir active 40516856 18773012 19653452 46.33%
local-lvm lvmthin active 56545280 0 56545280 0.00%

root@pve2:~# iscsiadm -m session
iscsiadm: No active sessions.

root@pve2:~# iscsiadm -m node
[]:3260,4294967295

root@pve2:~# pvesm scan iscsi 172.16.103.5
iscsiadm: Could not stat /etc/iscsi/nodes//,3260,-1/default to delete node: No such file or directory
iscsiadm: Could not add/update [tcp:[hw=,ip=,net_if=,iscsi_if=default] 172.16.103.5,3260,1 iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03]
iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03 172.16.103.5:3260

root@pve2:~# iscsiadm -m discovery
172.16.103.5:3260 via sendtargets
172.16.100.7:3260 via sendtargets

storage CFG file:

iscsi: Qnas-iSCSI-Pool
portal 172.16.103.5
target iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03
content none

lvm: Qnas-iSCSI-Lun-0
vgname Qnas-iSCSI-Lun-0
content images,rootdir
shared 1

bbgeek17 · Apr 22, 2024

What is the output of "pveversion" from each of the nodes?
good:

courtjesterau said:
root@pve3:~# iscsiadm -m discovery
172.16.103.5:3260 via sendtargets

bad:

courtjesterau said:
root@pve2:~# iscsiadm -m discovery
172.16.103.5:3260 via sendtargets
172.16.100.7:3260 via sendtargets

Where does 100.7 come from? Is it reachable from the server? You may be suffering from a recent change in handling of multi-IP targets: if a non-reachable IP is reported by the target, the initiator (PVE) gets confused.

You can try to specify a particular portal:
iscsiadm --mode discovery --type sendtargets --portal 172.16.103.5

But first you should delete the strange iscsi node entry on "bad" server:

courtjesterau said:
root@pve2:~# iscsiadm -m node
[]:3260,4294967295

iscsiadm -m node -o delete

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

courtjesterau · Apr 22, 2024

bbgeek17 said:
What is the output of "pveversion" from each of the nodes?
good:

bad:

Where does 100.7 come from? Is it reachable from the server? You may be suffering from a recent change in handling of multi-IP targets: if a non-reachable IP is reported by the target, the initiator (PVE) gets confused.

You can try to specify a particular portal:
iscsiadm --mode discovery --type sendtargets --portal 172.16.103.5

But first you should delete the strange iscsi node entry on "bad" server:

iscsiadm -m node -o delete

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

172.16.100.7 is the managment IP for the NAS -- the prox mox clusters have magament IPs in this range as well.-- this does not accept iscsi connections via it (normally) I did allow connections via this subnet yesterday did add it yesterday troubble shooting ( where it connected / worked fine)

trying to specifiy a target I get an error:

root@pve2:~# iscsiadm --mode discovery --type sendtargets --portal 172.16.103.5
iscsiadm: Could not stat /etc/iscsi/nodes//,3260,-1/default to delete node: No such file or directory
iscsiadm: Could not add/update [tcp:[hw=,ip=,net_if=,iscsi_if=default] 172.16.103.5,3260,1 iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03]
172.16.103.5:3260,1 iqn.2004-04.com.qnap:ts-431xeu:iscsi.qnas.322a03

when I try to delete the bad node I get this error

root@pve2:~# iscsiadm -m node -o delete
iscsiadm: Could not stat /etc/iscsi/nodes//,3260,-1/default to delete node: No such file or directory
iscsiadm: Could not execute operation on all records: encountered iSCSI database failur

courtjesterau · Apr 22, 2024

I have resolve it -- thanks for the help --

I removed the iscsi from the node in the data cetner view

then on the node went to this directory: /etc/iscsi/nodes/

and removed all listing manually

then re added the node to the iscsi in the datacenter storage manager

and this has got it working now..

bbgeek17 · Apr 22, 2024

Was just typing that you should clean up /etc/iscsi

good job

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

courtjesterau · Apr 22, 2024

bbgeek17 said:
Was just typing that you should clean up /etc/iscsi

good job

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

thank you for the help

rufinus · Sep 24, 2024

courtjesterau said:
I have resolve it -- thanks for the help --

I removed the iscsi from the node in the data cetner view

then on the node went to this directory: /etc/iscsi/nodes/

and removed all listing manually

then re added the node to the iscsi in the datacenter storage manager

and this has got it working now..

Thanks this helped me solve the same problem too.

spintike · Oct 2, 2024

bbgeek17 said:
Did you review your system log starting from boot point? "journalctl"
What does "pvesm status" say on a bad node and good node?
What do "iscsiadm -m node" and "iscsiadm -m session" say on good and bad node?
What happens when you do "pvesm scan iscsi <portal>"
Are there any errors on storage side? Are you sure the initiator is in the allowed group to access the target?
Is the MTU correct? Are you using Jumbo? If you do, can you ping with non-fragmented high size?
Can you do manual "iscsiadm discovery" against the target?
What is the context of your "/etc/pve/storage.cfg"?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

@bbgeek17 your question solved a similar issue in my case. At this point I just wanted to express my gratitude!

Search

Search

[SOLVED] Issue with iScsi offline on one node in cluster

courtjesterau

New Member

Attachments

courtjesterau

New Member

bbgeek17

Distinguished Member

courtjesterau

New Member

bbgeek17

Distinguished Member

courtjesterau

New Member

courtjesterau

New Member

bbgeek17

Distinguished Member

courtjesterau

New Member

rufinus

New Member

spintike

Member

We value your privacy