Storage network 'Host Unreachable' randomly, issue

michael_hess

New Member
Aug 8, 2024
8
0
1
4 nodes

Linux Bond0 1Gb LACP bond - 172.16.0.15/24 - Management and cluster network, default gateway
OVS Bond1 - 10Gb LACP bond - no IP (ports are bond1 and many vlans)
OVS IntPort vlan11_iscsi 172.16.10.3/27 - Storage network (iSCSI) on Bond1

Dell PowerStore with IP's:
172.16.10.9 - Discovery
172.16.10.10 - Node A
172.16.10.29 - Node B

Can't get solid pings to the .10.9 or 10.10, weirdly always work to 10.29. Randomly works to all three for a short while.

TCP dump pinging from 1st node to each storage IP, only .29 shows traffic OUT, let alone IN:
Code:
root@proxmox-srv0-n1:~# tcpdump -i vlan11_iscsi icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan11_iscsi, link-type EN10MB (Ethernet), snapshot length 262144 bytes
08:56:47.545935 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 1, length 64
08:56:47.546230 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 1, length 64
08:56:48.546933 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 2, length 64
08:56:48.547036 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 2, length 64
08:56:49.597195 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 3, length 64
08:56:49.597289 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 3, length 64

Ping's via vlan IP:
Code:
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.9
PING 172.16.10.9 (172.16.10.9) from 172.16.10.3 : 56(84) bytes of data.
From 172.16.10.3 icmp_seq=1 Destination Host Unreachable
From 172.16.10.3 icmp_seq=2 Destination Host Unreachable
From 172.16.10.3 icmp_seq=3 Destination Host Unreachable
^C
--- 172.16.10.9 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3066ms
pipe 4
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.10
PING 172.16.10.10 (172.16.10.10) from 172.16.10.3 : 56(84) bytes of data.
From 172.16.10.3 icmp_seq=1 Destination Host Unreachable
From 172.16.10.3 icmp_seq=2 Destination Host Unreachable
^C
--- 172.16.10.10 ping statistics ---
4 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3094ms
pipe 4
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.29
PING 172.16.10.29 (172.16.10.29) from 172.16.10.3 : 56(84) bytes of data.
64 bytes from 172.16.10.29: icmp_seq=1 ttl=64 time=0.304 ms
64 bytes from 172.16.10.29: icmp_seq=2 ttl=64 time=0.114 ms
64 bytes from 172.16.10.29: icmp_seq=3 ttl=64 time=0.107 ms
^C
--- 172.16.10.29 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2038ms
rtt min/avg/max/mdev = 0.107/0.175/0.304/0.091 ms

Code:
root@proxmox-srv0-n1:~# ip route get 172.16.10.9
172.16.10.9 dev vlan11_iscsi src 172.16.10.3 uid 0
    cache

ChatGPT suggested I change this from 2 to 0, no change:

sysctl -w net.ipv4.conf.all.rp_filter=0

It then suggested this, which I'm hesitant to do since this is a production system:

ovs-vsctl set Open_vSwitch . other_config:disable-lro=true
ovs-vsctl set Open_vSwitch . other_config:disable-gro=true
ovs-vsctl set Open_vSwitch . other_config:disable-tso=true

Anyone have a similar issue or suggestions? My other storage arrays work fine on the same subnet. The PowerStore has no firewall or ip restrictions. I've disabled the ProxMox firewall as well, no change.
 
Last edited:
4 nodes

Linux Bond0 1Gb LACP bond - 172.16.0.15/24 - Management and cluster network, default gateway
OVS Bond1 - 10Gb LACP bond - no IP (ports are bond1 and many vlans)
OVS IntPort vlan11_iscsi 172.16.10.3/27 - Storage network (iSCSI) on Bond1

Dell PowerStore with IP's:
172.16.10.9 - Discovery
172.16.10.10 - Node A
172.16.10.29 - Node B

Can't get solid pings to the .10.9 or 10.10, weirdly always work to 10.29. Randomly works to all three for a short while.

TCP dump pinging from 1st node to each storage IP, only .29 shows traffic OUT, let alone IN:
Code:
root@proxmox-srv0-n1:~# tcpdump -i vlan11_iscsi icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan11_iscsi, link-type EN10MB (Ethernet), snapshot length 262144 bytes
08:56:47.545935 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 1, length 64
08:56:47.546230 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 1, length 64
08:56:48.546933 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 2, length 64
08:56:48.547036 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 2, length 64
08:56:49.597195 IP proxmox-srv0-n1-OVSvmbr0 > dellps500t_iscsib2: ICMP echo request, id 41078, seq 3, length 64
08:56:49.597289 IP dellps500t_iscsib2 > proxmox-srv0-n1-OVSvmbr0: ICMP echo reply, id 41078, seq 3, length 64

Ping's via vlan IP:
Code:
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.9
PING 172.16.10.9 (172.16.10.9) from 172.16.10.3 : 56(84) bytes of data.
From 172.16.10.3 icmp_seq=1 Destination Host Unreachable
From 172.16.10.3 icmp_seq=2 Destination Host Unreachable
From 172.16.10.3 icmp_seq=3 Destination Host Unreachable
^C
--- 172.16.10.9 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3066ms
pipe 4
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.10
PING 172.16.10.10 (172.16.10.10) from 172.16.10.3 : 56(84) bytes of data.
From 172.16.10.3 icmp_seq=1 Destination Host Unreachable
From 172.16.10.3 icmp_seq=2 Destination Host Unreachable
^C
--- 172.16.10.10 ping statistics ---
4 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3094ms
pipe 4
root@proxmox-srv0-n1:~# ping -I 172.16.10.3 172.16.10.29
PING 172.16.10.29 (172.16.10.29) from 172.16.10.3 : 56(84) bytes of data.
64 bytes from 172.16.10.29: icmp_seq=1 ttl=64 time=0.304 ms
64 bytes from 172.16.10.29: icmp_seq=2 ttl=64 time=0.114 ms
64 bytes from 172.16.10.29: icmp_seq=3 ttl=64 time=0.107 ms
^C
--- 172.16.10.29 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2038ms
rtt min/avg/max/mdev = 0.107/0.175/0.304/0.091 ms

Code:
root@proxmox-srv0-n1:~# ip route get 172.16.10.9
172.16.10.9 dev vlan11_iscsi src 172.16.10.3 uid 0
    cache

ChatGPT suggested I change this from 2 to 0, no change:

sysctl -w net.ipv4.conf.all.rp_filter=0

It then suggested this, which I'm hesitant to do since this is a production system:

ovs-vsctl set Open_vSwitch . other_config:disable-lro=true
ovs-vsctl set Open_vSwitch . other_config:disable-gro=true
ovs-vsctl set Open_vSwitch . other_config:disable-tso=true

Anyone have a similar issue or suggestions? My other storage arrays work fine on the same subnet. The PowerStore has no firewall or ip restrictions. I've disabled the ProxMox firewall as well, no change.
Your issue is likely due to MTU mismatches, VLAN misconfiguration, or Open vSwitch handling large packets incorrectly. Try the following:

1️ Verify MTU Consistency

Check the MTU for all interfaces (bond1, vlan11_iscsi, and storage devices):
Code:
ip link show | grep mtu

Ensure all interfaces in the path have the same MTU (e.g., 9000 for jumbo frames or 1500 for standard frames).

2️ Manually Set VLAN and OVS Settings

Try explicitly setting the VLAN tag:
Code:
ovs-vsctl set port vlan11_iscsi tag=11

Then restart Open vSwitch:
Code:
systemctl restart openvswitch-switch

3️ Check Bonding Mode & LACP Timeout

Ensure Bond1 is correctly configured for LACP (mode 802.3ad) and that LACP timeouts are not causing intermittent failures:
Code:
cat /proc/net/bonding/bond1

If issues persist, try setting LACP fast mode:
Code:
ovs-vsctl set port bond1 lacp=active other_config:lacp-time=fast

4️ Disable Offloading Temporarily for Testing

If MTU and bonding are fine, try disabling Large Receive Offload (LRO), Generic Receive Offload (GRO), and TCP Segmentation Offload (TSO):
Code:
ovs-vsctl set Open_vSwitch . other_config:disable-lro=true
ovs-vsctl set Open_vSwitch . other_config:disable-gro=true
ovs-vsctl set Open_vSwitch . other_config:disable-tso=true


Then restart the OVS switch and test connectivity.
 
Last edited:
Wow thank you for all the steps!

MTU matches across the board.

Manually setting VLAN had no effect.

I don't have a bond1 under /proc/net/bonding/ just the linux bond0 for management. Here's what the OVS Bond looks like:
1741188215124.png
This is configured on Extreme Switches, as LACP L2 address-based (vs port based).
Code:
enable sharing 1:8 grouping 1:8 algorithm address-based L2 lacp

Balance-slb is the only thing I don't believe I need, Wiki says that's for unmanaged switches.

Disabled the three indicated features. Still getting Host Unreachable.
 
Wow thank you for all the steps!

MTU matches across the board.

Manually setting VLAN had no effect.

I don't have a bond1 under /proc/net/bonding/ just the linux bond0 for management. Here's what the OVS Bond looks like:
View attachment 83287
This is configured on Extreme Switches, as LACP L2 address-based (vs port based).
Code:
enable sharing 1:8 grouping 1:8 algorithm address-based L2 lacp

Balance-slb is the only thing I don't believe I need, Wiki says that's for unmanaged switches.

Disabled the three indicated features. Still getting Host Unreachable.

Your issue seems to be related to Open vSwitch (OVS) bonding, LACP settings, or VLAN handling on the Extreme switches. Since MTU is consistent and disabling offloading had no effect, try the following:


1️ Verify OVS Bonding Mode & LACP Settings

Since bond1 is using LACP (balance-tcp) but does not appear under /proc/net/bonding/, check if OVS is properly handling the bond:
Bash:
ovs-appctl bond/show bond1

If it's misconfigured, try switching to LACP Active Mode explicitly:
Bash:
ovs-vsctl set port bond1 lacp=active other_config:lacp-time=fast

Then restart OVS:
Code:
systemctl restart openvswitch-switch


2️ Check VLAN Configuration on Extreme Switch

Since VLAN11 is being used for iSCSI, ensure the Extreme Switch is correctly tagging the traffic:
Bash:
show vlan 11

Try setting it manually on OVS:
Bash:
ovs-vsctl set port vlan11_iscsi vlan_mode=access tag=11

Then check if the VLAN is applied:
Bash:
ovs-vsctl list port vlan11_iscsi


3️ Verify iSCSI Storage Connection & Routing

Since .10.9 and .10.10 are unreachable but .10.29 works, test connectivity directly from the storage device:
Bash:
ssh admin@172.16.10.9
ping 172.16.10.3

If the storage devices cannot ping back, check the iSCSI network settings on the Dell PowerStore.


4️ Debug OVS Flows and ARP Resolution

Check if OVS is correctly forwarding packets:
Bash:
ovs-ofctl dump-flows OVSvmbr0

If ARP is failing, try:
Code:
arp -a | grep 172.16.10

If no ARP entry exists, force it:
Bash:
arp -s 172.16.10.9 xx:xx:xx:xx:xx:xx

(where xx:xx:xx:xx:xx:xx is the MAC address of 10.9).


Conclusion:

  • Check if bond1 is properly handling LACP with ovs-appctl bond/show bond1.
  • Manually enforce VLAN settings on OVS and confirm switch VLAN config.
  • Verify that Dell PowerStore devices can ping the Proxmox host.
  • Check OVS flows and ARP resolution to debug packet forwarding.
I hope this answer was helpful:)
 
Last edited:
1

Bash:
root@proxmox-srv0-n1:~# ovs-appctl bond/show bond1
---- bond1 ----
bond_mode: balance-tcp
bond may use recirculation: yes, Recirc-ID : 1
bond-hash-basis: 0
lb_output action: disabled, bond-id: -1
updelay: 0 ms
downdelay: 0 ms
next rebalance: 7972 ms
lacp_status: negotiated
lacp_fallback_ab: false
active-backup primary: <none>
active member mac: 3c:d9:2b:f6:de:44(eno50)

member eno49: enabled
  may_enable: true
  hash 0: 1 kB load
  hash 1: 1 kB load

...

member eno50: enabled
  active member
  may_enable: true
  hash 18: 1 kB load
  hash 24: 1 kB load
  hash 26: 1 kB load
  hash 42: 1 kB load
  hash 57: 6139 kB load
  hash 63: 1 kB load

...

After running the explicit command:
Bash:
root@proxmox-srv0-n1:~# ovs-vsctl set port bond1 lacp=active other_config:lacp-time=fast
root@proxmox-srv0-n1:~# systemctl restart openvswitch-switch
root@proxmox-srv0-n1:~# ovs-appctl bond/show bond1
---- bond1 ----
bond_mode: balance-tcp
bond may use recirculation: yes, Recirc-ID : 1
bond-hash-basis: 0
lb_output action: disabled, bond-id: -1
updelay: 0 ms
downdelay: 0 ms
next rebalance: 8026 ms
lacp_status: negotiated
lacp_fallback_ab: false
active-backup primary: <none>
active member mac: 3c:d9:2b:f6:de:44(eno50)

member eno49: enabled
  may_enable: true
  hash 107: 10 kB load
  hash 112: 1 kB load
  hash 137: 9 kB load
  hash 141: 1 kB load
  hash 142: 3 kB load
  hash 146: 2 kB load
  hash 207: 32 kB load
  hash 232: 1 kB load
  hash 235: 1 kB load

member eno50: enabled
  active member
  may_enable: true
  hash 57: 5314 kB load
  hash 227: 3 kB load
  hash 252: 2 kB load

2

Vlan 11 has been in place for years, working with vmware, output follows:
Bash:
* Slot-2 OlympusStack1.1 # show vlan 11

VLAN 11: show vlan iSCSI-vlan11
VLAN Interface with name iSCSI-vlan11 created by user
    Admin State:         Enabled     Tagging:   802.1Q Tag 11

ovs-vsctl list port vlan11_iscsi before and after command to set manually
Bash:
root@proxmox-srv0-n1:~# ovs-vsctl list port vlan11_iscsi
_uuid               : c1ecde56-21de-4716-806f-80fb2d078d06
bond_active_slave   : []
bond_downdelay      : 0
bond_fake_iface     : false
bond_mode           : []
bond_updelay        : 0
cvlans              : []
external_ids        : {}
fake_bridge         : false
interfaces          : [7fb86c94-2b8e-4322-b1ce-5db76fc046d3]
lacp                : []
mac                 : []
name                : vlan11_iscsi
other_config        : {}
protected           : false
qos                 : []
rstp_statistics     : {}
rstp_status         : {}
statistics          : {}
status              : {}
tag                 : 11
trunks              : []
vlan_mode           : []
Bash:
root@proxmox-srv0-n1:~# ovs-vsctl set port vlan11_iscsi vlan_mode=access tag=11
root@proxmox-srv0-n1:~# ovs-vsctl list port vlan11_iscsi
_uuid               : c1ecde56-21de-4716-806f-80fb2d078d06
bond_active_slave   : []
bond_downdelay      : 0
bond_fake_iface     : false
bond_mode           : []
bond_updelay        : 0
cvlans              : []
external_ids        : {}
fake_bridge         : false
interfaces          : [7fb86c94-2b8e-4322-b1ce-5db76fc046d3]
lacp                : []
mac                 : []
name                : vlan11_iscsi
other_config        : {}
protected           : false
qos                 : []
rstp_statistics     : {}
rstp_status         : {}
statistics          : {}
status              : {}
tag                 : 11
trunks              : []
vlan_mode           : access

At this point, still Dest Host Unreachable.

Bond on the powershare:
Code:
27: feData0@bond0.11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether ae:31:62:56:0f:a1 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.10/27 scope global noprefixroute feData0
       valid_lft forever preferred_lft forever
    inet 172.16.10.9/27 scope global secondary noprefixroute feData0
       valid_lft forever preferred_lft forever
    inet6 fe80::ac31:62ff:fe56:fa1/64 scope link
       valid_lft forever preferred_lft forever
and pingability to proxmox node (fw disabled):
Bash:
[SVC:service@6BSYCJ3-A user]$ ping -I feData0 172.16.10.3
PING 172.16.10.3 (172.16.10.3) from 172.16.10.10 feData0: 56(84) bytes of data.
64 bytes from 172.16.10.3: icmp_seq=1 ttl=64 time=0.359 ms
64 bytes from 172.16.10.3: icmp_seq=2 ttl=64 time=0.140 ms
64 bytes from 172.16.10.3: icmp_seq=3 ttl=64 time=0.144 ms
64 bytes from 172.16.10.3: icmp_seq=4 ttl=64 time=0.131 ms
64 bytes from 172.16.10.3: icmp_seq=5 ttl=64 time=0.155 ms
64 bytes from 172.16.10.3: icmp_seq=6 ttl=64 time=0.120 ms

4

Bash:
root@proxmox-srv0-n1:~# ovs-ofctl dump-flows OVSvmbr0
 cookie=0x0, duration=1260.262s, table=0, n_packets=19426818, n_bytes=385414266577, priority=0 actions=NORMAL
Code:
root@proxmox-srv0-n1:~# arp -a | grep 172.16.10
rackstation. (172.16.10.21) at 00:11:32:56:81:4c [ether] on vlan11_iscsi
dellps500t_iscsia1. (172.16.10.9) at <incomplete> on vlan11_iscsi
tintri-iscsi-1. (172.16.10.22) at 00:e0:ed:9d:58:47 [ether] on vlan11_iscsi
proxmox-srv1-n1-OVSvmbr0. (172.16.10.4) at 42:e6:de:96:3b:8b [ether] on vlan11_iscsi
dellps500t_iscsib1. (172.16.10.10) at ae:31:62:56:0f:a1 [ether] on vlan11_iscsi
proxmox-srv2-n1-OVSvmbr0. (172.16.10.5) at ba:24:c9:a1:7b:00 [ether] on vlan11_iscsi
dellps500t_iscsib2. (172.16.10.29) at a6:dd:2c:f8:92:cc [ether] on vlan11_iscsi
? (172.16.10.30) at b4:0c:25:e0:40:10 [ether] on vlan11_iscsi

I will remove the 10Gb bond from this node, and just IP a single interface directly without LACP and see if that works as intended. I'll update you at that point. Thank you for all the help!
 
I've built a Linux Bond/Bridge/VLANs and the issue no longer exists. This seems to be an OVS specific problem. Here's what it looks like now vs before:
1741792569875.png
1741792588314.png

Not sure if this needs more troubleshooting to help the ProxMox team, or if it will just help others that might run into the same problem, but let me know if I can help in any way. Thank you so much @shbaek for all the help!