ZFS over iSCSI: Multipath alternatives

joaquinv898 · Thursday at 05:29

ZFS over iSCSI: Multipath alternatives

I’m using Proxmox “ZFS over iSCSI” with a remote Ubuntu storage server (ZFS pools exported via LIO/targetcli). After reboot I noticed something important:

- VM disks still work
- But

Code:

iscsiadm -m session

shows no sessions for the ZFS-over-iSCSI storage
-

Code:

qm showcmd <vmid>

shows QEMU is connecting using its userspace iSCSI driver (

Code:

"driver":"iscsi"

) and a single

Code:

"portal":"<ip>"

So host dm-multipath isn’t applicable (the host isn’t the iSCSI initiator for these VM disks).

Goal
I have 2x10G links on both Proxmox hosts and the Ubuntu storage server, each to a different switch (no MLAG/vPC). I want:
- redundancy (if a switch/link fails, storage remains reachable)
- and also to load-balance at least at the “per-pool / per-storage” level (10G per pool, ~20G aggregate if both pools active)

Current L3 layout
Proxmox node:

NIC1: 192.168.103.5/27 (Switch A)
NIC2: 192.168.103.35/27 (Switch B)

Ubuntu storage:

NIC1: 192.168.103.3/27 (Switch A)
NIC2: 192.168.103.33/27 (Switch B)

Proposed alternative to multipath (VIP + forced source routes)
Create two VIP portals on the storage server and make Proxmox prefer different NICs per VIP:

VIP1: 192.168.104.3 (intended to prefer Proxmox NIC1 -> Storage NIC1)
VIP2: 192.168.104.33 (intended to prefer Proxmox NIC2 -> Storage NIC2)

Then publish:
- ZFS Pool A via iSCSI portal VIP1 (192.168.104.3)
- ZFS Pool B via iSCSI portal VIP2 (192.168.104.33)

So under normal operation each pool is “pinned” to one 10G path, but it can fail over to the other if the primary path dies.

Routing idea on Proxmox
Host routes with explicit

Code:

src

and metrics:

VIP1 prefers NIC1 (primary), fails over to NIC2 (secondary)

Code:

ip route add 192.168.104.3/32  via 192.168.103.3  dev <IFACE_NIC1> src 192.168.103.5  metric 100
ip route add 192.168.104.3/32  via 192.168.103.33 dev <IFACE_NIC2> src 192.168.103.35 metric 200

VIP2 prefers NIC2 (primary), fails over to NIC1 (secondary)

Code:

ip route add 192.168.104.33/32 via 192.168.103.33 dev <IFACE_NIC2> src 192.168.103.35 metric 100
ip route add 192.168.104.33/32 via 192.168.103.3  dev <IFACE_NIC1> src 192.168.103.5  metric 200

Verify:

Code:

ip route get 192.168.104.3
ip route get 192.168.104.33

Storage side (Ubuntu) - make VIPs local and bind LIO portals
Idea: add VIPs as /32 so they are always local, then bind LIO portals to them.

Create a dummy interface and add VIPs:

Code:

modprobe dummy
ip link add dummy0 type dummy
ip link set dummy0 up

ip addr add 192.168.104.3/32  dev dummy0
ip addr add 192.168.104.33/32 dev dummy0

Bind LIO portals to the VIPs:

Code:

targetcli
cd /iscsi/<IQN>/tpg1/portals
create 192.168.104.3 3260
create 192.168.104.33 3260
cd /
saveconfig
exit

Confirm listeners:

Code:

ss -lntp | grep :3260

Important: rp_filter
Because this is asymmetric-looking routing (forcing source + preferred egress), I believe rp_filter must be set to loose (2) on both sides:

Code:

cat >/etc/sysctl.d/99-iscsi-vip.conf <<'EOF'
net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.default.rp_filter=2
EOF
sysctl --system

Expected behavior
- Under normal conditions:
- Pool A storage uses VIP1 and stays on the NIC1 10G path
- Pool B storage uses VIP2 and stays on the NIC2 10G path
- Aggregate throughput can reach ~20G if both pools are active (but each VM disk is still one portal / one TCP session)
- On failure:
- routes should switch to the secondary
- BUT: this is not multipath; the iSCSI TCP session used by QEMU userspace initiator will drop and must reconnect
- likely a pause/hiccup; worst case the VM disk could stall depending on reconnect behavior

I do have tho some general questions
1) Is this “VIP + pinned routes” approach considered sane for ZFS-over-iSCSI (QEMU iscsi driver) when MLAG/LACP isn’t possible?
2) Any known pitfalls with LIO portals bound to /32 VIPs on a dummy interface?
3) Is there a better pattern to get redundancy + load distribution (per-storage) without switching away from ZFS-over-iSCSI?

Relevant evidence: QEMU uses userspace iSCSI
Example from

Code:

qm showcmd

(portal pinned to a single IP):

Code:

"driver":"iscsi","portal":"192.168.103.33","target":"iqn.2003-01.org.linux-iscsi.<host>:sn.<...>","lun":1

This is why

Code:

iscsiadm -m session

does not show the VM disk sessions.

joaquinv898 · Thursday at 05:32

As an update, I just tested if this works

Seems fine.. but it doesn't fallback if the IP is not reachable, it needs the port to actually get down

It doesn't seem to come back to primary once it is available again (I guess that is to be expected as this is TCP, it would not drop)

the rp filter does not seem to need any changes on the proxmox hosts, so that is good

I feel that zfs over iscsi multipath implementation has been stalled for some time, so this may be a suitable alternative. It doesn't come without drawbacks tho.. but what does?

Feliz Jueves!

bbgeek17 · Thursday at 15:26

It is an interesting approach @joaquinv898 . Keep us updated on your progress.

cheers

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

joaquinv898 · Thursday at 17:04

I can tell you now that the method seems to be working

Maybe if you have really sensitive applications, like intense DBs hitting the drives or something may be an issue, but even in an iops test the vm didn't crash or reported errors, the switch seems to be fast enought

Again, I know this solution is very specific..

I would sit it as better than "active-backup"
More reliable for storage than "balance-alb"
Worse than LACP but it doesn't require switch support/configuration
Definitely worse than multipath, but multipath doesn't seem to be an option for ZFS over iSCSI

I got vlan separation, performance increase, and redundancy, I like this

I figure that people that doesn't have a security team looking what you install may be able to do it with BGP or similar

The good thing about this is that it only requires a couple commands with out of the box features of the OS

Regards.

PD. That "It doesn't seem to come back to primary once it is available again" was my error, I was testing in a way that didn't create the routes when the interface came up again, ideally you will test in the switch side, phisically or with the commands: ifup / ifdown in the proxmox side

PD n2. I noticed that in some cases, the interface is not detected as down, so the route doesn't actually change, in that case you can choose probably BGP or use some script to switch automatically when needed

Search

Search

ZFS over iSCSI: Multipath alternatives

joaquinv898

New Member

joaquinv898

New Member

bbgeek17

Distinguished Member

joaquinv898

New Member

We value your privacy