Quick HOWTO on setting up iSCSI Multipath

uptonguy75

New Member
Nov 15, 2024
14
6
3
Hi Everyone,

I originally included this HOWTO guide as a reply to someone else's post but am posting in its own thread as it may help others who struggle to get proper iSCSI MPIO working on a Promox cluster. Coming from an enterprise VMware ESXi background, I wanted my shared storage setup the same way in Promox (albeit without LVM snapshots but Veeam fills this gap). It should be noted that this configuration is entirely done from shell and not through the web UI. I found that the GUI iSCSI storage setup does not work for MPIO as it only creates a single PVE NIC-to-iSCSI IP link which isn't helpful when the SAN has redundant IPs. The CLI process will create an MPIO that covers all PVE NICs-to-iSCSI IP links. The MPIO alias is then added to the PVE storage list (like a Datastore in ESXi). I also included a quick & dirty MPIO check script to add to the cron that will check MPIO status every minute and send an email alert should anything about MPIO change.

Best,
UG
 

Attachments

Thank you. Could you please elaborate what your guide does differently than the official documentation?

https://pve.proxmox.com/wiki/Multipath

Hi LnxBil,

In the "Multipath" guide, this didn't work for me:

"Then, configure your iSCSI storage on the GUI ("Datacenter->Storage->Add->iSCSI"). In the "Add: iSCSI" dialog, enter the IP of an arbitrary portal. Usually, the iSCSI target advertises all available portals back to the iSCSI initiator, and in the default configuration, Proxmox VE will try to connect to all advertised portals."
By adding the iSCSI entry via the GUI, it only created a single new iscsi interface called "default" which was tied to only 1 of my PVE iSCSI NIC's. As I have 2 PVE iSCSI NIC's which need to be part of the multipath, it worked out better for me to individually scan iSCSI portals via each of the 2 NICs:

iscsiadm -m discovery -I <nic_if1> --op=new --op=del --type sendtargets --portal <portal IP>:3260 <target initiator>
iscsiadm -m discovery -I <nic_if2> --op=new --op=del --type sendtargets --portal <portal IP>:3260 <target initiator>

This way, 2 separate interface entries were added to the iSCSI configuration and 8 paths were established (2 PVE NICs x 4 SAN NICs) when using multipath -ll.

Once the multipath mapper device is added to PVE (pvesm add lvm <Datastore ID> --vgname <Datastore ID>), there is clutter in the Datacenter->Storage list as the original GUI-added non-multipath iSCSI entry isn't needed once the multipath LVM device is listed.

Later this week, I'll be setting up a new node so I can better document where the process fell apart for me and will re-post then.
 
Last edited:
As promised, when I added a new node this week, I took a closer look at the "Storage: iSCSI" & "Storage: Multipath" wiki pages. I significantly overhauled my iSCSI MPIO guide and re-uploaded it.

The "Storage: iSCSI" & "Storage: Multipath" pages don't address configuring multiple iSCSI interfaces per node. Without this, iSCSI connections will only use a single NIC (called "default" in the iSCSI configs). Therefore the MPIO config will only use half the available paths in a dual iSCSI NIC setup.

Coming from an enterprise VMware environment, I wrote the guide to simplify the process of getting shared ISCSI LVM storage w/ MPIO working in a Proxmox cluster. The guide covers:
  • Configuring dual iSCSI NICs
  • Configuring iSCSI with dual NICs
  • Adding a new shared iSCSI LUN with MPIO
  • Creating a new LVM Datastore on shared iSCSI LUN
  • Adding an existing shared iSCSI LVM Datastore
  • Removing an iSCSI LVM Datastore
  • Configuring watchdogs to generate email alerts:
    • Monitoring changes to MPIO status
    • Monitoring changes to Datastore availability status
  • Extra helpful scripts:
    • Show only iSCSI connections for specific target
    • Show WWID mapped to specific block device
I hope that it will be useful to others who are looking to do the same thing.
 

Attachments

Hi, thank you for sharing your setup steps! I'd like to better understand in which sense the current Multipath wiki article [1] doesn't cover your setup. I see your guide uses the Open-iSCSI ifaces [1] feature where Open-iSCSI manages the network interfaces that connect to the SAN, whereas the wiki article doesn't -- there, the PVE host's network stack handles connectivity.

Do I understand correctly the main reason for using the ifaces feature is that your SAN has two IPs in the same subnet (10.10.42.x/24 in your guide) and thus multipathing isn't really possible when using the host's network stack? If yes -- does the SAN also support configuring the two IPs in two disjoint subnets? If so and if the SAN has two IPs in two disjoint subnets, it should be possible to assign the PVE node two IPs in the two disjoint subnets, and let the PVE host's network stack handle the networking (like in the wiki article).

[1] https://pve.proxmox.com/wiki/Multipath
[2] https://github.com/open-iscsi/open-iscsi/blob/df0f2bf9cba81333b9d171bfd0635eda522fcb5b/README#L586
 
Question: how would you go about using multipath for storage type iscsi (direct map)?
With direct map you mean the QEMU integration that is also used for e.g. ZFS-over-iSCSI? AFAIK, QEMU did not have support for this in the past, yet I don't know the current status and if it has been implemented already.
 
Hi, thank you for sharing your setup steps! I'd like to better understand in which sense the current Multipath wiki article [1] doesn't cover your setup. I see your guide uses the Open-iSCSI ifaces [1] feature where Open-iSCSI manages the network interfaces that connect to the SAN, whereas the wiki article doesn't -- there, the PVE host's network stack handles connectivity.

Do I understand correctly the main reason for using the ifaces feature is that your SAN has two IPs in the same subnet (10.10.42.x/24 in your guide) and thus multipathing isn't really possible when using the host's network stack? If yes -- does the SAN also support configuring the two IPs in two disjoint subnets? If so and if the SAN has two IPs in two disjoint subnets, it should be possible to assign the PVE node two IPs in the two disjoint subnets, and let the PVE host's network stack handle the networking (like in the wiki article).

[1] https://pve.proxmox.com/wiki/Multipath
[2] https://github.com/open-iscsi/open-iscsi/blob/df0f2bf9cba81333b9d171bfd0635eda522fcb5b/README#L586

Hello. My apologies for the delay in responding to you. Reflecting on my configuration, I realized the wiki article didn't work for me due to how I chose to architecture the storage LAN. Coming to Proxmox with ~20 years of VMware vCenter experience, I opted to create the same storage architecture that I was comfortable with under VMware. That is, using a single storage VLAN for iSCSI traffic rather than 2 distinct storage networks. Had I used 2 different storage networks then the wiki article guidance would have identified both distinct interfaces rather than just 1. Since I have just 1 storage network, I needed to manually define the interfaces/IP & MAC addresses with iscsiadm in order for it properly build out MPIO then simply manually add the datastore/LUN into the cluster using pvesm. Using this architecture, I've not yet had a storage failure in any VMware clusters.

The architecture for my cluster has each Proxmox node with (2) 10-gbit links for storage (in the same VLAN). Each NIC connects to a different Cisco storage switch. The storage switches have trunks between them and trunks connecting them to the network core. Rapid per-VLAN spanning tree keeps the network loop free. If my storage network were built with SOHO network switches that don't offer STP, then yes, building out 2 separate networks would be the better option.

My risk analysis deemed that splitting storage NICs between separate LANs is riskier as there is slightly less failure tolerance in the event of multiple link failures. This is most evident in my lower-end SAN units that are used mostly for testing. These only have 1 storage NIC each. With separate networks, I have to pick 1 of the 2 storage networks for it to live on which will only allow 1 path to it from each node. In the event of a node storage NIC failure, the SAN is unreachable. However, with a single storage VLAN, if 1 storage NIC on a node goes down, there is still a viable path via node storage NIC two.

I think I will revise my guide to note that it uses a single storage VLAN and was designed to mimic VMware ESXi best practices. This may help others who are migrating to Proxmox from VMware in the enterprise.
 
This is how I setup my Proxmox as well, defining the MAC of the NICs within iface. This allows both interfaces of the NIC to be used. It actually tripled the read IO and doubled the write IO.

Bash:
iscsiadm -m discovery -t st -p 10.10.254.50:3260 –interface=ens2f0np0 --discover
iscsiadm -m discovery -t st -p 10.10.254.50:3260 –interface=ens2f1np1 --discover

Bash:
iscsiadm -m iface -I ens2f0np0 --op=update -n iface.hwaddress -v bc:97:e1:78:47:60
iscsiadm -m iface -I ens2f1np1 --op=update -n iface.hwaddress -v bc:97:e1:78:47:61

Others have gave me the same speil about having each interface on a different VLAN. Our VMware environment has never had that and has worked fine.

Both interfaces showing traffic during a fio test:
1738085771539.png
 
Last edited:
The architecture for my cluster has each Proxmox node with (2) 10-gbit links for storage (in the same VLAN)
Best practices for mpio are to seperate physical and logical networks. each of your links should be in a seperate vlan with a seperate cidr. The reason for this is most operating systems (linux included) will send traffic only to an interface with the primary gateway.
 
Best practices for mpio are to seperate physical and logical networks. each of your links should be in a seperate vlan with a seperate cidr. The reason for this is most operating systems (linux included) will send traffic only to an interface with the primary gateway.

I respectfully disagree regarding your reason why MPIO will only send traffic to 1 interface in a VLAN based on assigned gateway. The gateway address only matters when you need to direct traffic outside the current layer 2 domain (e.g. storage VLAN). An iSCSI VLAN shouldn't be configured for layer 3 routing as it should be isolated and all the devices within that VLAN will have IP addresses in the same subnet negating the need for traffic to try to route outside of it.

MPIO will send traffic out whichever interfaces are explicitly configured using the defined multipath policy (e.g. round robin, most recently used, fixed, etc.). In my case, my host node has 2 iSCSI NIC's defined in MPIO and my SAN has 4 NIC's - all of these are located in the same VLAN. MPIO calculates 8 viable paths between each host & my SAN:

vt1-proxmox-lun-1 (36908109571039535892d54e7d4f1fdbaeb45) dm-11 SYNOLOGY,Storage
size=1.9T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=30 status=active
|- 29:0:0:1 sdi 8:128 active ready running
|- 30:0:0:1 sdj 8:144 active ready running
|- 35:0:0:1 sdx 65:112 active ready running
|- 36:0:0:1 sdy 65:128 active ready running
|- 31:0:0:1 sdah 66:16 active ready running
|- 32:0:0:1 sdak 66:64 active ready running
|- 33:0:0:1 sdax 67:16 active ready running
`- 34:0:0:1 sday 67:32 active ready running


Both iSCSI NICs on the host are pushing equal amounts of data in line with an even distribution between NICs - even though they are both in the same VLAN. The SAN also reports that both host iSCSI NIC IP's are connected to each LUN.

That aside, using 2 distinct VLANs for iSCSI vs a single VLAN is mostly six of one, a dozen of another unless a SAN vendor requires a specific approach for hypervisor integration.
 
Thank goodness I found your document.

While I, otherwise, followed the guide at "https://pve.proxmox.com/wiki/Multipath" I could not get the LVM to show up in Proxmox, until I got to the end of your document and found that you need to run the command pvesm add lvm <Datastore ID> --vgname <Datastore ID>
Why the heck is that not in the "official" document? Everything else went smoothly but I spent an hour or so trying to figure out what final step I was missing to make the VG accessible to Proxmox...

I tried to go back and add this to the WIki, but it looks like it is closed and I can't create an account. Maybe someone with access can take that on...???

Madness!!
 
Last edited:
Thank goodness I found your document.

While I, otherwise, followed the guide at "https://pve.proxmox.com/wiki/Multipath" I could not get the LVM to show up in Proxmox, until I got to the end of your document and found that you need to run the command pvesm add lvm <Datastore ID> --vgname <Datastore ID>
Why the heck is that not in the "official" document? Everything else went smoothly but I spent an hour or so trying to figure out what final step I was missing to make the VG accessible to Proxmox...

I tried to go back and add this to the WIki, but it looks like it is closed and I can't create an account. Maybe someone with access can take that on...???

Madness!!

You're welcome! I'm glad you found it helpful.
 
The architecture for my cluster has each Proxmox node with (2) 10-gbit links for storage (in the same VLAN). Each NIC connects to a different Cisco storage switch. The storage switches have trunks between them and trunks connecting them to the network core. Rapid per-VLAN spanning tree keeps the network loop free. If my storage network were built with SOHO network switches that don't offer STP, then yes, building out 2 separate networks would be the better option.
MPIO will send traffic out whichever interfaces are explicitly configured using the defined multipath policy (e.g. round robin, most recently used, fixed, etc.). In my case, my host node has 2 iSCSI NIC's defined in MPIO and my SAN has 4 NIC's - all of these are located in the same VLAN. MPIO calculates 8 viable paths between each host & my SAN:

vt1-proxmox-lun-1 (36908109571039535892d54e7d4f1fdbaeb45) dm-11 SYNOLOGY,Storage
size=1.9T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=30 status=active
|- 29:0:0:1 sdi 8:128 active ready running
|- 30:0:0:1 sdj 8:144 active ready running
|- 35:0:0:1 sdx 65:112 active ready running
|- 36:0:0:1 sdy 65:128 active ready running
|- 31:0:0:1 sdah 66:16 active ready running
|- 32:0:0:1 sdak 66:64 active ready running
|- 33:0:0:1 sdax 67:16 active ready running
`- 34:0:0:1 sday 67:32 active ready running
Hi, thank you for the additional information! Just so I understand correctly: The SAN having four NICs also means the SAN has 4 distinct IP addresses in the same subnet assigned (10.10.42.0/24 in your example)?
Would it be possible to post the output of iscsiadm -m session and iscsiadm -m iface on the PVE node? (feel free to censor out the concrete target names, I'm mostly interested in the IPs).

While I, otherwise, followed the guide at "https://pve.proxmox.com/wiki/Multipath" I could not get the LVM to show up in Proxmox, until I got to the end of your document and found that you need to run the command pvesm add lvm <Datastore ID> --vgname <Datastore ID>
Why the heck is that not in the "official" document? Everything else went smoothly but I spent an hour or so trying to figure out what final step I was missing to make the VG accessible to Proxmox...
The "LVM on top of a LUN" [0] mentions that the LVM storage needs to be created when using either of the two described options, though it uses the GUI to create the storage ('navigate to the Datacenter->Storage panel and click "Add: LVM"' and 'You can then create an LVM storage for that VG in the Proxmox GUI') -- if I understand correctly, this should have done the trick in your case? Happy to clarify the phrasing in the wiki article though, if you have any suggestions.

I tried to go back and add this to the WIki, but it looks like it is closed and I can't create an account. Maybe someone with access can take that on...???
See [1] for how to create a wiki account

[0] https://pve.proxmox.com/wiki/Multipath#LVM_on_top_of_a_LUN
[1] https://forum.proxmox.com/threads/how-can-we-contribute-to-the-wiki.93970/
 
Hi, thank you for the additional information! Just so I understand correctly: The SAN having four NICs also means the SAN has 4 distinct IP addresses in the same subnet assigned (10.10.42.0/24 in your example)?
Would it be possible to post the output of iscsiadm -m session and iscsiadm -m iface on the PVE node? (feel free to censor out the concrete target names, I'm mostly interested in the IPs).

Yes, my SAN has 4 different IP addresses assigned to it (2 controllers x 2 NICs/controller). Any NIC on the SAN can handle requests for all LUNs/targets.

Here is the obfuscated output you requested. This is for just 1 LUN/target:

tcp: [56] 10.205.81.174:3260,10 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [62] 10.205.81.174:3260,10 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [63] 10.205.81.171:3260,7 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [64] 10.205.81.171:3260,7 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [65] 10.205.81.172:3260,9 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [66] 10.205.81.172:3260,9 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [67] 10.205.81.173:3260,8 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [68] 10.205.81.173:3260,8 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)

enp202s0f0np0 tcp,3c:ec:ef:af:2d:7a,10.205.81.81,<empty>,<empty>
ens2f0np0 tcp,3c:ec:ef:af:2d:74,10.205.81.91,<empty>,<empty>


As you can see, there are 2 host connections to each SAN IP.

The "LVM on top of a LUN" [0] mentions that the LVM storage needs to be created when using either of the two described options, though it uses the GUI to create the storage ('navigate to the Datacenter->Storage panel and click "Add: LVM"' and 'You can then create an LVM storage for that VG in the Proxmox GUI') -- if I understand correctly, this should have done the trick in your case? Happy to clarify the phrasing in the wiki article though, if you have any suggestions.


See [1] for how to create a wiki account

[0] https://pve.proxmox.com/wiki/Multipath#LVM_on_top_of_a_LUN
[1] https://forum.proxmox.com/threads/how-can-we-contribute-to-the-wiki.93970/
 
  • Like
Reactions: fweber
Yes, my SAN has 4 different IP addresses assigned to it (2 controllers x 2 NICs/controller). Any NIC on the SAN can handle requests for all LUNs/targets.

Here is the obfuscated output you requested. This is for just 1 LUN/target:

tcp: [56] 10.205.81.174:3260,10 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [62] 10.205.81.174:3260,10 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [63] 10.205.81.171:3260,7 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [64] 10.205.81.171:3260,7 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [65] 10.205.81.172:3260,9 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [66] 10.205.81.172:3260,9 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [67] 10.205.81.173:3260,8 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)
tcp: [68] 10.205.81.173:3260,8 iqn.2000-01.com.synology:CCORP-SYN-5.Target-06.9018439838 (non-flash)

enp202s0f0np0 tcp,3c:ec:ef:af:2d:7a,10.205.81.81,<empty>,<empty>
ens2f0np0 tcp,3c:ec:ef:af:2d:74,10.205.81.91,<empty>,<empty>


As you can see, there are 2 host connections to each SAN IP.
Thanks a lot! I tried to replicate your setup as closely as possible by following your guide (with a LIO target instead of a SAN, though). One potential problem I noticed is that, at least in my test setup, the PVE node (iSCSI initiator) responds to ARP requests for any of its IPs (10.205.81.81 and 10.205.81.91 in your example) on *both* NICs -- hence, an ARP request from the target for any of the initiator IPs gets two replies with two different MAC adresses. More specifically, on my test system, running `arping` on the target against one of the PVE IPs (172.16.0.201):
Code:
# arping 172.16.0.201 -c1
ARPING 172.16.0.201
42 bytes from bc:24:11:60:f4:b8 (172.16.0.201): index=0 time=159.063 usec
42 bytes from bc:24:11:59:5e:95 (172.16.0.201): index=1 time=199.568 usec
This is potentially bad, as the target may update its ARP table entries for the initiator IPs depending on which ARP request it sees first.
Do you have the possibility to monitor the ARP table on your SAN? If yes, can you check whether you see any flapping ARP table entries? If possible, you could also run an `arping` from target to initiator and see whether you get two replies too, but be careful, on my test system this occasionally makes one of the paths fail.

This problem can probably be solved by some combination of ARP filtering tweaks and source-based routing/VRFs on the PVE side, but this makes everything a little more complex.
 
Last edited:
I can confirm I'm seeing the same behavior. Both MACs are responding.
1738960209466.png


I'm reaching out to the SAN vendor to see if we can run a command to see what the SAN sees. I'll provide an update after their response.

EDIT:

Is there supposed to be any noticeable issues because of this? So far, I haven't had any odd behavior as of yet. Getting 6GB/s to our SAN, which I believe is the limit of the 2x25GbE connections we use.
 
Last edited: