1 of multiple VLANs on bond not working

pwizard · Jan 9, 2023

Scenario:

Brand new PVE installation for a new Compute node - ISO PVE 7.3, all updates installed, now pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve)

Networking is supposed to be provided via a 2-member bond, 2x 10G interfaces on a single Mellanox MT27500 (ConnectX-3).

The bond is supposed (for now) to carry 3 tagged VLANs, plus the vmbr0 for guest traffic.

vlan75@bond0: 192.168.75.0/24 Mgmt traffic, SSH access, corosync cluster traffic
vlan76@bond0: 192.168.76.0/24 Storage traffic, RBD volumes on a 3 node pveceph cluster (the nodes are joined to the same PVE cluster)
vlan302@bond0: 172.19.0.0/16 NFS mounts on NAS, for ISO images

I've copied the configuration basically 1:1, save for the specific IP addresses, from another node in the same PVE cluster:

Code:

auto lo
iface lo inet loopback

iface ens1 inet manual

iface ens1d1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens1 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        bond-min-links 1
        bond-lacp-rate 1

auto vlan302
iface vlan302 inet static
        address 172.19.76.11/16
        vlan-raw-device bond0

auto vlan75
iface vlan75 inet static
        address 192.168.75.11/24
        gateway 192.168.75.254
        vlan-raw-device bond0

auto vlan76
iface vlan76 inet static
        address 192.168.76.11/24
        vlan-raw-device bond0

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

ifreload -a, and everything went fine, I joined the cluster, moved guests there and back..... until I decided to reboot the node and I could no longer reach the mgmt IP address 192.168.75.11 because ifupdown2 didn't want to install the default route ("gateway" statement).

Thankfully, Google landed me right at the solution, posted on the ifupdown2 github by Proxmox @aderumier .

Funny, the ifupdown2 guys are proud that their code resolves dependencies by itself (you can even see the depends displayed with a CLI parameter) - yet apparently this does not translate into the code and you need *just the right order of interfaces* to allow this house of cards to work. Reassuring... apparently it's also unimportant enough that nobody even posted there after Oct 30, 2019.

The config I used above worked just fine (and continue to works just fine) on our older Proxmox nodes, but those obviously do not use ifupdown2 as they we're upgraded through several PVE releases over the years, so this explains it.

Anyway, now we end up with this configuration:

Code:

auto lo
iface lo inet loopback

iface ens1 inet manual

iface ens1d1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens1 ens1d1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
        bond-min-links 1
        bond-lacp-rate 1

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

auto vlan302
iface vlan302 inet static
        address 172.19.76.11/16
        vlan-raw-device bond0

auto vlan75
iface vlan75 inet static
        address 192.168.75.11/24
        gateway 192.168.75.254
        vlan-raw-device bond0

auto vlan76
iface vlan76 inet static
        address 192.168.76.11/24
        vlan-raw-device bond0

.. and my default route survives a reboot just fine now, so thank you Alexandre for finding the workaround.

I can reach all destinations (that I care about at least, did not check every possible IP) in both vlan75 and vlan76, but in vlan302, ARP resolution works, I can send ICMP requests out and they are received just fine on the other side, and I can find the echo request in my tcpdump -i vlan302 - but my ping command tells me I've got 100% packet loss! The packets do not seem to be handled correctly by whatever component of the network stack.

Code:

root@prox11:~# ip neigh del 172.19.3.9 dev vlan302
root@prox11:~# ping 172.19.3.9
PING 172.19.3.9 (172.19.3.9) 56(84) bytes of data.
^C
--- 172.19.3.9 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

At the same time in my tcpdump session:

root@prox11:~# tcpdump -nvvv -s 0 -i vlan302 'host 172.19.3.9'
tcpdump: listening on vlan302, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:33:52.286022 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.19.3.9 tell 172.19.76.11, length 28
20:33:52.290390 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.19.3.9 is-at 00:11:32:33:38:89, length 46
20:33:52.290406 IP (tos 0x0, ttl 64, id 55622, offset 0, flags [DF], proto ICMP (1), length 84)
    172.19.76.11 > 172.19.3.9: ICMP echo request, id 34101, seq 1, length 64
20:33:52.290507 IP (tos 0x0, ttl 64, id 64551, offset 0, flags [none], proto ICMP (1), length 84)
    172.19.3.9 > 172.19.76.11: ICMP echo reply, id 34101, seq 1, length 64
20:33:53.290596 IP (tos 0x0, ttl 64, id 55820, offset 0, flags [DF], proto ICMP (1), length 84)
    172.19.76.11 > 172.19.3.9: ICMP echo request, id 34101, seq 2, length 64
20:33:53.290711 IP (tos 0x0, ttl 64, id 64790, offset 0, flags [none], proto ICMP (1), length 84)
    172.19.3.9 > 172.19.76.11: ICMP echo reply, id 34101, seq 2, length 64

It's not an ICMP quirk either, I could care less about ICMP requests and their replies, but pvesm can no longer mount the NFS store (server 172.19.3.9).

Now here's the twist: if I keep the vlan302 part commented out in the config, reboot the server, and following the reboot I add the section back in and ifreload -a, everything works just fine! Curiously similar to the first issue with the gateway, where it would fail at boot, but succeed if the same command was executed on a running system...

How can I go about fixing this? Is it ifupdown2 yet again, is it a Mellanox mlx4_* issue, ...? The very same Mellanox card has been in use for years, up until now, in the same PVE cluster, we're just replacing the aging hardware itself, but the Mlx are only 3 years old and therefore still good for the new servers as well. No issue at all with the older servers and their NFS mounting behavior....

Does it make any difference whether I configure the VLAN interface as vlan302[@bond0] or bond0.302?

EDIT: the behavior has been the same on 2 other nodes I've begun replacing, one of them, after yet another reboot, now DOES reach the NFS storage. So it looks like it doesn't fail reliably, in 100% of cases, but only like at least 80%?

vesalius · Jan 10, 2023

By memory @spirit may have worked on ifupdown2. Wondering if he has anything to add here?

elonden · Jan 11, 2023

I had a somewhat different situation where traffic that was tagged with a few vlan ids coming from the physical interface was not propagated onto the bridge. I could see this happening when doing two tcpdump sessions, each on the respective interface. I could see dhcp request coming in on the physical interface but not on the bridge. My router also sits in a VM which is connected to the same bridge as most of my VM's and containers and these had no issue obtaining IP addresses in the respective vlans.

Only after I rebooted the Proxmox server things started working again so I think there must have been something wrong with the connection of the bridge and the nic driver.

pwizard · Jan 11, 2023

OK, I've found a new clue: looks like the issue is on the bond level - the issue happens if the server sends the ICMP echo requests out on one side of the LACP bond (due to hashing policy), but the switch returns the replies strictly on the other member link (as a result of *its* hashing policy).
Obviously, this should not matter at all and the bond should properly return the packets to the application layer, but this does not seem to happen.

Observe the following tcpdumps, on the bond interface and the 2 member links, respectively, towards / from 172.19.3.9 - where the ping output tells me 100% packet loss:

Code:

root@prox12:~# tcpdump -n -i bond0 'host 172.19.3.9'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:30:00.634265 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 100, length 64
20:30:00.634375 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 100, length 64
20:30:01.658271 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 101, length 64
20:30:01.658386 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 101, length 64
20:30:02.682255 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 102, length 64
20:30:02.682385 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 102, length 64
20:30:03.706259 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 103, length 64
20:30:03.706365 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 103, length 64
^C
8 packets captured
14 packets received by filter
0 packets dropped by kernel
root@prox12:~# tcpdump -n -i ens1f1np1 'host 172.19.3.9'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens1f1np1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:30:13.946264 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 113, length 64
20:30:14.970264 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 114, length 64
20:30:15.994262 IP 172.19.76.12 > 172.19.3.9: ICMP echo request, id 17769, seq 115, length 64
^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel
root@prox12:~# tcpdump -n -i ens1f0np0 'host 172.19.3.9'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens1f0np0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:30:26.234383 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 125, length 64
20:30:27.258382 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 126, length 64
20:30:28.282348 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 127, length 64
20:30:29.306364 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 128, length 64
20:30:30.330354 IP 172.19.3.9 > 172.19.76.12: ICMP echo reply, id 17769, seq 129, length 64
^C
5 packets captured
7 packets received by filter
0 packets dropped by kernel

The exact same behavior can be seen if I ping another host 172.19.76.13, again the packets go out one way and return strictly on the other network interface.

And here's the same situation for a host 172.19.76.14 that happens to work just fine, 0% packet loss according to ping command:

Code:

root@prox12:~# tcpdump -n -i bond0 'host 172.19.76.14'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:39:32.698241 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 517, length 64
20:39:32.698285 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 517, length 64
20:39:33.722228 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 518, length 64
20:39:33.722279 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 518, length 64
20:39:34.746256 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 519, length 64
20:39:34.746307 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 519, length 64
^C
6 packets captured
6 packets received by filter
0 packets dropped by kernel
root@prox12:~# tcpdump -n -i ens1f1np1 'host 172.19.76.14'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens1f1np1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:39:37.882224 ARP, Request who-has 172.19.76.14 tell 172.19.76.12, length 28
20:39:37.892619 ARP, Reply 172.19.76.14 is-at 42:40:78:5c:b8:d3, length 46
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel
root@prox12:~# tcpdump -n -i ens1f0np0 'host 172.19.76.14'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens1f0np0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
20:39:46.010269 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 530, length 64
20:39:46.010316 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 530, length 64
20:39:47.034261 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 531, length 64
20:39:47.034313 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 531, length 64
20:39:48.058260 IP 172.19.76.12 > 172.19.76.14: ICMP echo request, id 45041, seq 532, length 64
20:39:48.058304 IP 172.19.76.14 > 172.19.76.12: ICMP echo reply, id 45041, seq 532, length 64
^C
6 packets captured
6 packets received by filter
0 packets dropped by kernel

By pure luck I've managed to capture at the exact time when the host was also doing an ARP refresh. Both request and reply went over the same member link and were registered correctly by the OS / application, and the same, but on the other member link, is true for the ICMP level requests and reply packets.

So what's going on with the bond in the current PVE release? Am I doing something wrong with my bond?

Why does it affect only one out of 3 explicit VLAN subinterfaces, not the bridge (thankfully! the guests are wholly oblivious to the issue).

By the way, this new host 172.19.76.12 that I've been using for my latest tests uses a 25G Mlx ConnectX-4 NIC as opposed to the other new affected nodes with their 10G ConnectX-3. It's not the specific hardware either, but now that I found out the issue seems to be related to the bond driver, that's not really surprising.

Search

Search

1 of multiple VLANs on bond not working

pwizard

New Member

vesalius

Renowned Member

elonden

New Member

pwizard

New Member