ZFS over iSCSI: Multipath alternatives
I’m using Proxmox “ZFS over iSCSI” with a remote Ubuntu storage server (ZFS pools exported via LIO/targetcli). After reboot I noticed something important:
- VM disks still work
- But
shows no sessions for the ZFS-over-iSCSI storage
-
shows QEMU is connecting using its userspace iSCSI driver (
) and a single
So host dm-multipath isn’t applicable (the host isn’t the iSCSI initiator for these VM disks).
Goal
I have 2x10G links on both Proxmox hosts and the Ubuntu storage server, each to a different switch (no MLAG/vPC). I want:
- redundancy (if a switch/link fails, storage remains reachable)
- and also to load-balance at least at the “per-pool / per-storage” level (10G per pool, ~20G aggregate if both pools active)
Current L3 layout
Proxmox node:
Ubuntu storage:
Proposed alternative to multipath (VIP + forced source routes)
Create two VIP portals on the storage server and make Proxmox prefer different NICs per VIP:
Then publish:
- ZFS Pool A via iSCSI portal VIP1 (192.168.104.3)
- ZFS Pool B via iSCSI portal VIP2 (192.168.104.33)
So under normal operation each pool is “pinned” to one 10G path, but it can fail over to the other if the primary path dies.
Routing idea on Proxmox
Host routes with explicit
and metrics:
Verify:
Storage side (Ubuntu) - make VIPs local and bind LIO portals
Idea: add VIPs as /32 so they are always local, then bind LIO portals to them.
Create a dummy interface and add VIPs:
Bind LIO portals to the VIPs:
Confirm listeners:
Important: rp_filter
Because this is asymmetric-looking routing (forcing source + preferred egress), I believe rp_filter must be set to loose (2) on both sides:
Expected behavior
- Under normal conditions:
- Pool A storage uses VIP1 and stays on the NIC1 10G path
- Pool B storage uses VIP2 and stays on the NIC2 10G path
- Aggregate throughput can reach ~20G if both pools are active (but each VM disk is still one portal / one TCP session)
- On failure:
- routes should switch to the secondary
- BUT: this is not multipath; the iSCSI TCP session used by QEMU userspace initiator will drop and must reconnect
- likely a pause/hiccup; worst case the VM disk could stall depending on reconnect behavior
I do have tho some general questions
1) Is this “VIP + pinned routes” approach considered sane for ZFS-over-iSCSI (QEMU iscsi driver) when MLAG/LACP isn’t possible?
2) Any known pitfalls with LIO portals bound to /32 VIPs on a dummy interface?
3) Is there a better pattern to get redundancy + load distribution (per-storage) without switching away from ZFS-over-iSCSI?
Relevant evidence: QEMU uses userspace iSCSI
Example from
(portal pinned to a single IP):
This is why
does not show the VM disk sessions.
I’m using Proxmox “ZFS over iSCSI” with a remote Ubuntu storage server (ZFS pools exported via LIO/targetcli). After reboot I noticed something important:
- VM disks still work
- But
Code:
iscsiadm -m session
-
Code:
qm showcmd <vmid>
Code:
"driver":"iscsi"
Code:
"portal":"<ip>"
So host dm-multipath isn’t applicable (the host isn’t the iSCSI initiator for these VM disks).
Goal
I have 2x10G links on both Proxmox hosts and the Ubuntu storage server, each to a different switch (no MLAG/vPC). I want:
- redundancy (if a switch/link fails, storage remains reachable)
- and also to load-balance at least at the “per-pool / per-storage” level (10G per pool, ~20G aggregate if both pools active)
Current L3 layout
Proxmox node:
- NIC1: 192.168.103.5/27 (Switch A)
- NIC2: 192.168.103.35/27 (Switch B)
Ubuntu storage:
- NIC1: 192.168.103.3/27 (Switch A)
- NIC2: 192.168.103.33/27 (Switch B)
Proposed alternative to multipath (VIP + forced source routes)
Create two VIP portals on the storage server and make Proxmox prefer different NICs per VIP:
- VIP1: 192.168.104.3 (intended to prefer Proxmox NIC1 -> Storage NIC1)
- VIP2: 192.168.104.33 (intended to prefer Proxmox NIC2 -> Storage NIC2)
Then publish:
- ZFS Pool A via iSCSI portal VIP1 (192.168.104.3)
- ZFS Pool B via iSCSI portal VIP2 (192.168.104.33)
So under normal operation each pool is “pinned” to one 10G path, but it can fail over to the other if the primary path dies.
Routing idea on Proxmox
Host routes with explicit
Code:
src
- VIP1 prefers NIC1 (primary), fails over to NIC2 (secondary)
Code:
ip route add 192.168.104.3/32 via 192.168.103.3 dev <IFACE_NIC1> src 192.168.103.5 metric 100
ip route add 192.168.104.3/32 via 192.168.103.33 dev <IFACE_NIC2> src 192.168.103.35 metric 200
- VIP2 prefers NIC2 (primary), fails over to NIC1 (secondary)
Code:
ip route add 192.168.104.33/32 via 192.168.103.33 dev <IFACE_NIC2> src 192.168.103.35 metric 100
ip route add 192.168.104.33/32 via 192.168.103.3 dev <IFACE_NIC1> src 192.168.103.5 metric 200
Verify:
Code:
ip route get 192.168.104.3
ip route get 192.168.104.33
Storage side (Ubuntu) - make VIPs local and bind LIO portals
Idea: add VIPs as /32 so they are always local, then bind LIO portals to them.
Create a dummy interface and add VIPs:
Code:
modprobe dummy
ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 192.168.104.3/32 dev dummy0
ip addr add 192.168.104.33/32 dev dummy0
Bind LIO portals to the VIPs:
Code:
targetcli
cd /iscsi/<IQN>/tpg1/portals
create 192.168.104.3 3260
create 192.168.104.33 3260
cd /
saveconfig
exit
Confirm listeners:
Code:
ss -lntp | grep :3260
Important: rp_filter
Because this is asymmetric-looking routing (forcing source + preferred egress), I believe rp_filter must be set to loose (2) on both sides:
Code:
cat >/etc/sysctl.d/99-iscsi-vip.conf <<'EOF'
net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.default.rp_filter=2
EOF
sysctl --system
Expected behavior
- Under normal conditions:
- Pool A storage uses VIP1 and stays on the NIC1 10G path
- Pool B storage uses VIP2 and stays on the NIC2 10G path
- Aggregate throughput can reach ~20G if both pools are active (but each VM disk is still one portal / one TCP session)
- On failure:
- routes should switch to the secondary
- BUT: this is not multipath; the iSCSI TCP session used by QEMU userspace initiator will drop and must reconnect
- likely a pause/hiccup; worst case the VM disk could stall depending on reconnect behavior
I do have tho some general questions
1) Is this “VIP + pinned routes” approach considered sane for ZFS-over-iSCSI (QEMU iscsi driver) when MLAG/LACP isn’t possible?
2) Any known pitfalls with LIO portals bound to /32 VIPs on a dummy interface?
3) Is there a better pattern to get redundancy + load distribution (per-storage) without switching away from ZFS-over-iSCSI?
Relevant evidence: QEMU uses userspace iSCSI
Example from
Code:
qm showcmd
Code:
"driver":"iscsi","portal":"192.168.103.33","target":"iqn.2003-01.org.linux-iscsi.<host>:sn.<...>","lun":1
Code:
iscsiadm -m session