[SOLVED] Wrong MAC addresses in ARP table

jamarsa · Jun 20, 2021

Hi, I'm having some problems with a cluster of three nodes, with several interfaces connected to the same switch in each node, each one with different services in mind.

In each node, there is:

1 pair of 10Gbpe NICs with IP range 172.16.0.x/24, named as vmbr0v4 (i.e., with VLAN 4), connected to 10Gbpe mouths in the switch, for internet communication&Ceph frontend.
1 pair of 10Gbpe NICs with IP range 172.16.13.x/24, named as bond13, connected also to 10Gbpe mouths in the switch, for Ceph backend.
1 NIC with 1Gbpe bandwidth, with IP range 172.16.11.x/24, for Proxmox Ring0 cluster communication.
1 NIC with 1Gbpe bandwidth, with IP range 172.16.12.x/24, for Proxmox Ring1 cluster communication.

All seems to work well, except for one thing: when testing speeds on the 172.16.13.x range, one or more of the nodes receives at 1Gbpe instead of the expected 10Gbpe. Tracing things, it seems that the receiving node uses one of the Ring0/Ring1 interfaces for receiving data, instead of the bond13 one.

To my understanding this points to problems in the arp table. Indeed, if I consult the arp table of the sending node, I see that the MAC address of the 172.16.13.<n> receiving node is wrong, pointing instead to one of the Ring<n> NICs. So, I clean the arp table with the command:

ip -s -s neigh flush all

But again there is a wrong value in the arp table.

I was given to believe that this behavior happens only when you have the two conflicting interfaces in the same subnet (a.k.a. 'the ARP problem'), but this is not the case...

spirit · Jun 21, 2021

can you send the result of "cat /proc/net/bondind/bond13" ?

jamarsa · Jun 21, 2021

Here it comes... Since the previous post, I have managed to patch things up by manually deleting the offending ARP entries (ip neigh del 172.16.13.x dev bond13), and letting it populate again with a ping... But I'm still in the dark as for why there was a wrong entry previously.

Code:

node02 # cat /proc/net/bonding/bond13
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 00:22:fa:32:15:e5
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 15
        Partner Key: 3137
        Partner Mac Address: f0:33:e5:04:33:24

Slave Interface: enp193s0f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:22:fa:32:15:e5
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:22:fa:32:15:e5
    port key: 15
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 32768
    system mac address: f0:33:e5:04:33:24
    oper key: 3137
    port priority: 32768
    port number: 4
    port state: 61

Slave Interface: enp129s0f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 3
Permanent HW addr: 00:22:fa:32:15:e3
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:22:fa:32:15:e5
    port key: 15
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 32768
    system mac address: f0:33:e5:04:33:24
    oper key: 3137
    port priority: 32768
    port number: 12
    port state: 61

Code:

node03# cat /proc/net/bonding/bond13
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 00:22:fa:28:09:7b
Active Aggregator Info:
        Aggregator ID: 15
        Number of ports: 2
        Actor Key: 15
        Partner Key: 3393
        Partner Mac Address: f0:33:e5:04:33:24

Slave Interface: enp193s0f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:22:fa:28:09:7b
Slave queue ID: 0
Aggregator ID: 15
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:22:fa:28:09:7b
    port key: 15
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 32768
    system mac address: f0:33:e5:04:33:24
    oper key: 3393
    port priority: 32768
    port number: 6
    port state: 61

Slave Interface: enp129s0f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:22:fa:28:09:79
Slave queue ID: 0
Aggregator ID: 15
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 00:22:fa:28:09:7b
    port key: 15
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 32768
    system mac address: f0:33:e5:04:33:24
    oper key: 3393
    port priority: 32768
    port number: 14
    port state: 61

spirit · Jun 21, 2021

That's really really strange ...

can you send output of "ip route" ?

jamarsa · Jun 21, 2021

Strange indeed!

ip route on node02:

Code:

default via 172.16.0.4 dev vmbr0v4 onlink
172.16.0.0/24 dev vmbr0v4 proto kernel scope link src 172.16.0.12
172.16.11.0/24 dev enp65s0f0 proto kernel scope link src 172.16.11.12
172.16.12.0/24 dev vmbr12 proto kernel scope link src 172.16.12.12
172.16.13.0/24 dev bond13 proto kernel scope link src 172.16.13.12

ip route on node03:

Code:

default via 172.16.0.4 dev vmbr0v4 onlink
172.16.0.0/24 dev vmbr0v4 proto kernel scope link src 172.16.0.13
172.16.11.0/24 dev enp65s0f0 proto kernel scope link src 172.16.11.13
172.16.12.0/24 dev vmbr12 proto kernel scope link src 172.16.12.13
172.16.13.0/24 dev bond13 proto kernel scope link src 172.16.13.13
192.168.1.0/24 dev vmbr0 proto kernel scope link src 192.168.1.214

As you can see, in node03 there are two ips in vmbr0, one with VLAN4 (vmbr0v4), 172.16.0.13, and one with 192.168.1.214 (vmbr0), in order to communicate with a different network in another switch. Ring1 is defined with a bridge to allow some VM to use that route instead of the main (172.16.0.x) one. But these bandwidth problems I think existed before I created both the bridge, and the secondary address in vmbr0...

spirit · Jun 21, 2021

I dont see nothing strange in ip route.

can you send your /etc/network/interfaces just to be sure ?

jamarsa · Jun 21, 2021

I don't think the problem lies there, but here is the interfaces file. The other nodes are similar, only changing the last number in the IP addresses.

The routing problem is at layer2, and appeared/disappeared randomly by simply shutting down the main bond13 interface, and bringing it up again. Tested also with a single interface (one after the other, in fact) by deleting bond13 and giving the address to a single interface, and the problem was the same.

Code:

auto enp65s0f0
iface enp65s0f0 inet static
    address 172.16.11.13/24

auto enp65s0f2
iface enp65s0f2 inet manual

iface enp65s0f1 inet manual

#auto enp65s0f3
iface enp65s0f3 inet static
    address 172.16.14.13/24

auto enp193s0f0
iface enp193s0f0 inet manual

auto enp193s0f1
iface enp193s0f1 inet manual
    mtu 1500

auto enp129s0f0
iface enp129s0f0 inet manual

auto enp129s0f1
iface enp129s0f1 inet manual
    mtu 1500

auto bond0
iface bond0 inet manual
    bond-slaves enp193s0f0 enp129s0f0
    bond-miimon 100
    bond-mode 802.3ad
    pre-up sleep 2
        bond-downdelay 200
        bond-updelay 200
        bond-lacp-rate 1


iface bond0.4 inet manual

auto bond13
iface bond13 inet static
    address 172.16.13.13/24
    bond-slaves enp193s0f1 enp129s0f1
    bond-miimon 100
    bond-mode 802.3ad
    bond-downdelay 200
        bond-updelay 200
        bond-lacp-rate 1
    mtu 1500


auto vmbr0
iface vmbr0 inet static
    bridge-ports bond0
    address 192.168.1.214/24
    bridge-stp off
    bridge-fd 0
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    bridge-vlan-aware yes
    bridge-vids 1-1000

auto vmbr0v4
iface vmbr0v4 inet static
    address 172.16.0.13/24
    gateway 172.16.0.4
    bridge-ports bond0.4
    bridge-stp off
    pre-up sleep 2


#Internal Networks

auto vmbr999
iface vmbr999 inet manual
        bridge-ports none
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vmbr12
iface vmbr12 inet static
    address 172.16.12.13/24
    bridge-ports enp65s0f2
    bridge-stp off

jamarsa · Jun 21, 2021

Let me show a more specific test I have just done.

Bash:

node01# ip neigh show bond13

172.16.13.12 lladdr 00:22:fa:32:15:e5 REACHABLE  //correct MAC for bond13/enp129s0f1/enp193s0f1 in node02
192.168.1.27 lladdr 00:50:56:c2:4b:32 STALE
172.16.13.13 lladdr 00:22:fa:28:09:7b REACHABLE

node01# ifdown bond13
node01# ifup bond13
node01# ping 172.16.13.13
PING 172.16.13.12 (172.16.13.13) 56(84) bytes of data.
64 bytes from 172.16.13.13: icmp_seq=1 ttl=64 time=0.321 ms
^C
--- 172.16.13.13 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
node01# ping 172.16.13.12
PING 172.16.13.12 (172.16.13.12) 56(84) bytes of data.
64 bytes from 172.16.13.12: icmp_seq=1 ttl=64 time=0.344 ms
^C
--- 172.16.13.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
node01# ip neigh show dev bond13
172.16.13.12 lladdr 00:22:fa:3c:24:b4 REACHABLE // Incorrect, this is enp65s0f0 in node02, with IP 172.16.11.12
172.16.13.13 lladdr 00:22:fa:28:09:7b REACHABLE

node01# ip neigh del 172.16.13.12 dev bond13
node01# ping 172.16.13.12
PING 172.16.13.12 (172.16.13.12) 56(84) bytes of data.
64 bytes from 172.16.13.12: icmp_seq=1 ttl=64 time=0.332 ms
^C
--- 172.16.13.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
node01# ip neigh show dev bond13
172.16.13.12 lladdr 00:22:fa:32:15:e5 REACHABLE  // correct again
172.16.13.13 lladdr 00:22:fa:28:09:7b REACHABLE

I must add that there is also a firewall in the network, but just the 172.16.13.0/24 range is not defined in it (because it would *not* go outside the local network), so it shouldn't interfere (maybe routing between the two networks, who knows, it's happened before).

Curiously also, there are STALE entries in several other interfaces, but with the correct MAC address. I don't know if this is normal, but I think they should not be there, as these interfaces don't communicate over that IP range. Unless this is the result of previous misrouted packets...

172.16.13.12 dev vmbr12 lladdr 00:22:fa:32:15:e5 STALE
172.16.13.12 dev enp65s0f0 lladdr 00:22:fa:32:15:e5 STALE

jamarsa · Jun 21, 2021

Oh, and there are STALE entries also in several other interfaces for IPs in other ranges (for example, 172.16.11.12 in the 172.16.0.11 interface), which suggests that the same problem is happening in the other interfaces, but it isn't detected because the bandwidth is good enough.

spirit · Jun 22, 2021

Hi, you /etc/network/interfaces seem to be ok.
I really don't known what's going out, but it don't seem to be a proxmox/linux problem.
Are you sure that you don't have a network loop somewhere in your network ? or something like a proxy arp ?

you should try to look arp traffic with tcpdump "tcpdump -i <iface> -e arp " , and look from where is coming arp request/response in your network.

jamarsa · Jun 27, 2021

Sorry for the delay, I was busy with other assignments... Here are some tcpdumps.

I have been doing several tests, pursuing some assumptions in what could be wrong.

At first it occurred to me that there was indeed an ARP flux problem: the three interfaces giving their MAC addresses were considered as part of the same subnet, given that the IP range I'm using is historically used in 16bit subnets. So I tested changing one of the interfaces to a completely different IP range, let's say 192.168.12.12. But no, this interface even with completely different range keeps answering the ARP request:

Code:

12:55:25.711786 00:22:fa:32:15:eb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.13.12 tell 172.16.13.11, length 28
12:55:25.711865 00:22:fa:3c:24:b6 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b6, length 46
12:55:25.711894 00:22:fa:32:15:e5 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:32:15:e5, length 46
12:55:25.711894 00:22:fa:3c:24:b4 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b4, length 46
12:55:27.794176 00:22:fa:32:15:eb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.13.12 tell 172.16.13.11, length 28
12:55:27.794415 00:22:fa:3c:24:b4 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b4, length 46
12:55:27.794415 00:22:fa:32:15:e5 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:32:15:e5, length 46
12:55:27.794431 00:22:fa:3c:24:b6 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b6, length 46
12:55:28.751449 00:22:fa:32:15:eb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.13.12 tell 172.16.13.11, length 28
12:55:28.751686 00:22:fa:3c:24:b4 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b4, length 46
12:55:28.751687 00:22:fa:32:15:e5 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:32:15:e5, length 46
12:55:28.751701 00:22:fa:3c:24:b6 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b6, length 46
12:55:32.467046 00:22:fa:32:15:eb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.13.12 tell 172.16.13.11, length 28
12:55:32.467134 00:22:fa:3c:24:b6 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b6, length 46
12:55:32.467166 00:22:fa:3c:24:b4 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:3c:24:b4, length 46
12:55:32.467166 00:22:fa:32:15:e5 > 00:22:fa:32:15:eb, ethertype ARP (0x0806), length 60: Reply 172.16.13.12 is-at 00:22:fa:32:15:e5, length 46

The thing is, all of the interfaces of the machine that holds that IP (in one of its interfaces), that are connected to the same stack of switches, answer the request. But only that machine and none other (Proxy ARP or other) gives an answer.

After that, I considered to test ARP requests in the other interfaces of the source machine. And indeed, of course this happens in other interfaces, without bond or VLAN.

Code:

14:41:23.088427 00:22:fa:3c:24:a8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.16.11.12 tell 172.16.11.11, length 28
14:41:23.088513 00:22:fa:3c:24:b6 > 00:22:fa:3c:24:a8, ethertype ARP (0x0806), length 60: Reply 172.16.11.12 is-at 00:22:fa:3c:24:b6, length 46
14:41:23.088514 00:22:fa:3c:24:b4 > 00:22:fa:3c:24:a8, ethertype ARP (0x0806), length 60: Reply 172.16.11.12 is-at 00:22:fa:3c:24:b4, length 46
14:41:23.088555 00:22:fa:32:15:e5 > 00:22:fa:3c:24:a8, ethertype ARP (0x0806), length 60: Reply 172.16.11.12 is-at 00:22:fa:32:15:e5, length 46

Of course if I ask for the deleted IP (172.16.12.12), nobody answers...

jamarsa · Jun 29, 2021

Well, after further investigations, it seems that this behaviour is *intended* and not accidental. It is configured that way in order to allow for fault tolerance when having several NICs in the same physical structure. But in my case it's undesirable because of degraded performance when the interface that first answers to the ARP request is of reduced bandwidth.

https://serverfault.com/questions/834512/why-does-linux-answer-to-arp-on-incorrect-interfaces
https://netbeez.net/blog/avoiding-arp-flux-in-multi-interface-linux-hosts/

It can be disabled with the following parameters, via sysctl or adding to /etc/sysctl.conf

net.ipv4.conf.<interface>.arp_ignore=1 (or perhaps =2) where <interface> can be the name of the interface, or 'all'

net.ipv4.conf.<interface>.arp_announce=2

Both are by default 0. Setting arp_ignore to 1 in the affected interfaces was enough for me.

Search

Search

[SOLVED] Wrong MAC addresses in ARP table

jamarsa

Member

spirit

Distinguished Member

jamarsa

Member

spirit

Distinguished Member

jamarsa

Member

spirit

Distinguished Member

jamarsa

Member

jamarsa

Member

jamarsa

Member

spirit

Distinguished Member

jamarsa

Member

jamarsa

Member