Weirdest thing ever!!!

Nov 26, 2024
4
0
1
Hi all,

I have the weirdest thing ever. I have a two node cluster with iscsi multipath. We had a power outage a while ago and the storage on one of the nodes started to have intermittent connection issues. It would connect some of the time and fail with communication failure (0) or connection timed out (596) the rest. The summary page for the storage would show periods of connection with spaces in between.

I narrowed it down to the multipath entries after running isciadm -m session -P3 on both nodes and comparing the outputs. On the failing node, it has lost the target info on two of the paths. Where there should be two routes between the two storage volumes for each node, there is only one route active between the two storage volumes on the two hosts.

What I would like to do is reset this connection between this node and volumes without wiping out the connections with the other node
 
Hi @WacArts ,
I have the weirdest thing ever. I have a two node cluster
It is not weird. It is a rather common shortcut that leads to many complaints when both nodes reboot due to quorum loss.
We had a power outage a while ago and the storage on one of the nodes started to have intermittent connection issues. It would connect some of the time and fail with communication failure (0) or connection timed out (596) the rest.
The easiest solution would be to reboot the node.
I narrowed it down to the multipath entries after running isciadm -m session -P3 on both nodes and comparing the outputs. On the failing node, it has lost the target info on two of the paths. Where there should be two routes between the two storage volumes for each node, there is only one route active between the two storage volumes on the two hosts.
Did you try to figure out why the paths are missing? Are the network paths operational? Can you ping appropriate IPs? Can you scan with: pvesm scan iscsi
May be you have a failed switch/NIC ?

What I would like to do is reset this connection between this node and volumes without wiping out the connections with the other node
Any iscsiadm manipulation done on the "bad" node are local to that node. You should examine the log on the node to understand why the connection was not established.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
We had a power outage a while ago and the storage on one of the nodes started to have intermittent connection issues. It would connect some of the time and fail with communication failure (0) or connection timed out (596) the rest.
The easiest solution would be to reboot the node.
I did that when I first noticed the issue - didn't resolve it then and probably won't resolve it now

I narrowed it down to the multipath entries after running isciadm -m session -P3 on both nodes and comparing the outputs. On the failing node, it has lost the target info on two of the paths. Where there should be two routes between the two storage volumes for each node, there is only one route active between the two storage volumes on the two hosts.
Did you try to figure out why the paths are missing? Are the network paths operational? Can you ping appropriate IPs? Can you scan with: pvesm scan iscsi
May be you have a failed switch/NIC ?
I ran ICMP on all network ports that are implicated in this issue on both hosts and all responded. pvesm scam iscsi also returned all connected. Yes I did try to figure out why the paths are missing, I just haven't found a logical explanation as to why they aren't. They are all connected, it just seems that two of the connections to the storage on one node haven't connected properly.
If someone could explain to me how to reset these connections without trashing the remaining connections or the storage devices, then I would love to hear it
What I would like to do is reset this connection between this node and volumes without wiping out the connections with the other node
Any iscsiadm manipulation done on the "bad" node are local to that node. You should examine the log on the node to understand why the connection was not established.
I have done some more digging and the open-iscsi service fails to start because of the iscsiadm initiator reported an error. This is the error that has kicked my backup storage over and I want to restore the initiator connections to the correct iface targets. The iface targets have different IP addresses where they have lost the connection to the target hence the failure as it is looking for something that doesn't exist
 
Hi @WacArts,

It would be helpful if you could provide structured data CLI outputs, command history, and a clear comparison between the “good” and “bad” system states.

It’s always a bit frustrating for the community when troubleshooting becomes a slow back-and-forth of “try this” followed by “I already tried that.” Detailed, upfront information makes it much easier (and faster) to help.

Thanks!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited: