Guest gets IP packets not aimed for it

skraw · Nov 16, 2019

Hello all,

I have a pretty standard setup for several linux guests on Proxmox. Lately I found out by monitoring the guest interfaces with snmp that there is traffic I cannot explain. So I tried to find the cause for it and found out that every guest connected to the same bridge and subnet gets IP packets not aimed for it on its interface. I have no idea why this happens. I would have expected that it only sees packets for its MAC, but that is not true. So the bridge acts more like a hub than a switch. It seems all sent packets get broadcasted to all guests. Can I change this behaviour? It looks like a big waste of performance to me.
--
Regards

skraw · Nov 19, 2019

Hello all,

regarding this problem I found something that may be helpful to others or for tracking down the problem.
It seems that some (all?) of the duplicated packets have a destination mac that is preconfigured in the corresponding guest network setup. In my case they start with "52:54" or "08". Looking at the bridges it seems they are never attached to any tap interface. Since this is a setup with several proxmox hosts and switches in between the traffic/macs are pointing to the trunk port on all hosts - or are not in "bridge showmacs" at all.
This seems the cause why the bridges then decide to broadcast this traffic to every port.
I am currently in the state of reconfiguring all questionable macs to "auto" and it looks like the problem goes away. These macs are visible in every bridge an destined to a tap device on the right host where the guest lives.
I have not looked at the bridge code, but this smells like a bug to me.

skraw · Nov 19, 2019

Hello all,

hm, I had to find out that this above "workaround" does not work. After some hour runtime the thing is exactly the same with the new "auto" macs. The setup consists of 3 proxmox hosts with a trunk port connected over 2 switches. Host 1 has the guest, host 2 shows the mac in the bridge table, host 3 does not and consequently broadcasts the traffic to all its bridge interfaces.

spirit · Nov 21, 2019

Hi,
the normal behaviour of a switch, if it's don't known the destination mac address, is to broadcast to all ports (this called BUM, Broadcast, unknown-unicast and multicast traffic)

if a guest client don't have yet an arp entry with destination ip, it's send a arp request, then the bridge send it everywhere, and if the destination send an arp reply, when the packet is coming back, the bridge listen the mac address.

same if the destination is sending arp request, the bridge will listen mac address too.

so, it's more ike your host3 never receive packet from the missing mac. (no unicast arp reply to a vm on host3/ no arp query broadcast from this mac)

another possible reason : it could happen is arp timeout on client is bigger thac mac address timeout in bridge for example. (bridge listen the mac, but drop it faster than clients, and client/server don't do new arp request)

maybe can you do a small schema with your hosts setup + vm config ?

skraw · Nov 21, 2019

Hello,
well, you are right with the arp explanation. In the meantime I found a workaround that does work. I made a cronjob on a guest on host 2 that sends arping -fc1 to every guest that falsely reaches the guest on host3. This leads to the corresponding macs being re-added to the bridges on host2 and host1 and is sufficient that no packets reach host3 falsely. But it seems I had to set all bridge setaging to 300. This should be default according to docs, but it seems it is not. As they time out faster (probably around 180). The switches inbetween the hosts have aging time of 300, too. Can you tell what default value is set for aging on bridges configured with ifupdown2? Is there a parameter for the interfaces definition where this can be set?

spirit · Nov 22, 2019

do you use ifupdown2 ?
normally, the default value is 300 for both ifupdown1 or ifupdown2.
can be tuned with in /etc/network/interfaces in bridge definition with "bridge-ageing 300"

What do you see current mac time age with
#brctl showmacs vmbrX

you should see time counter increase for each mac up to 300, then reset to 0, than up to 300.

spirit · Nov 22, 2019

you can see value also in "cat /sys/class/net/vmbrX/bridge/ageing_time "

mine is 30000 for ifupdown2 or ifupdown1 (=300s)

skraw · Nov 22, 2019

Well that's hard to tell now. I already had to set all the corresponding bridges by hand to 300.
Nevertheless I checked the other bridges in the boxes (there are 5). And they all show the same "30000").So I guess the major issue is the different arp timeouts of the guest producing the fake tcp streams compared to the underlying bridges. The arp timeout on the guest may well exceed 300s in default setup. So if he then continues transmissions and the bridges had dumped the macs from their tables the problem arises.
It seems that the arping I do now is the best solution to keep the bridges tables uptodate. If you can think of something simpler or practical tell me.
Thanks for naming the aging parameter.

spirit · Nov 22, 2019

if you don't have change arp value in your guest,
on linux && windows >= vista it's between 15-45s (so lower than bridge 300s)

on windows <= vista, it was betwwen 2-10min.

skraw · Nov 22, 2019

According to my information (net) this is not the case. linux has a timeout of 300s as aging timeout, but the garbage collection that really removes the entries can be delayed for another 30s or even more. So all docs say you may end up between 5-10 minutes at worst.
And that's about my experience in this topic.
There is no windows involved.

spirit · Nov 22, 2019

check
cat /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms

some details:

https://serverfault.com/questions/684380/default-arp-cache-timeout

"
Meanwhile, the value

base_reachable_time_ms

actually controls how long an ARP cache entry is valid, and it defaults to 30000 milliseconds. But each new ARP cache entry will actually receive a time to live value randomly set somewhere between base_reachable_time_ms / 2 and 3*base_reachable_time_ms / 2*.

This means each new cached ARP entry will have a starting timeout between 15 and 45 seconds, unless the value of base_reachable_time_ms is changed.
."

or

https://support.cumulusnetworks.com/hc/en-us/articles/202012933-Changing-ARP-timers-in-Cumulus-Linux
"A successful ARP response places a neighbor in a reachable state and allows the kernel to directly forward packets to it. Neighbors are kept in a reachable state based upon the kernel receiving traffic from them*. If no traffic is received, a neighbor will transition out of the reachable and into a stale state after a random number of interval between [base_reachable_time_ms/2] and [3*base_reachable_time_ms/2]."

skraw · Nov 22, 2019

And this is what I know:

Now the problem is that the neighbor entry will not be deleted if it's being referenced. The main thing that you're going to have problems with is the reference from the ipv4 routing table. There's a lot of complicated garbage collection stuff, but the important thing to note is that the garbage collector for the route cache only expires entries every 5 minutes (/proc/sys/net/ipv4/route/gc_timeout seconds) on a lot of kernels. This means the neighbor entry will have to be marked as stale (maybe 30 seconds, depending on base_reachable_time), then 5 minutes will have to go by before the route cache stops referencing the entry (if you're lucky), followed by some combination of gc_stale_time and gc_interval passing before it actually gets cleaned up (so, overall, somewhere between 5-10 minutes will pass).

(from stackoverflow)

spirit · Nov 22, 2019

if you do a "ip neigh s", check the state of mac address.

1) table empty
2) do a ping a unknown ip, an arp request is send, and a arp response is receive
the mac address is added with state "REACHEABLE"
stop ping

3) until state is reacheable, no more arp request is send

4) after 30-45s, the state change to "STALE" (the entry is not yet remove)
do a ping again, a new arp request is send again.

so you don't need to wait that entry to removed from arp cache.

skraw · Nov 22, 2019

The thing is: we are not talking about pings as payloads but heavy tcp traffic. So it is unlikely that you really get the arp request in between the ongoing tcp stream. If you think there is another explanation for the observed problem than guests having longer arp timeouts than the bridges, then what is it?

spirit · Nov 25, 2019

skraw said:
The thing is: we are not talking about pings as payloads but heavy tcp traffic. So it is unlikely that you really get the arp request in between the ongoing tcp stream. If you think there is another explanation for the observed problem than guests having longer arp timeouts than the bridges, then what is it?

This could happen too with asymetric traffic. I have already see that with vm with multiple interfaces for example, when traffic is coming in 1 interface and going out to another interfaces.
Another possiblity, do you use bonding? if yes, which mode ? do you have multiple physical switch ?

skraw · Dec 2, 2019

I doubt there is asymmetry involved. The problem only touches one subnet, all the involved vms are connected there. There is no bonding. The three proxmox hosts have a switch in front each. The three switches are connected with 10G.

Inlakesh · May 13, 2021

skraw said:
I doubt there is asymmetry involved. The problem only touches one subnet, all the involved vms are connected there. There is no bonding. The three proxmox hosts have a switch in front each. The three switches are connected with 10G.

As I have similar problem with my 2 PVE servers, I wonder if you found any further information/solution on this?

Search

Search

Guest gets IP packets not aimed for it

skraw

Well-Known Member

skraw

Well-Known Member

skraw

Well-Known Member

spirit

Distinguished Member

skraw

Well-Known Member

spirit

Distinguished Member

spirit

Distinguished Member

skraw

Well-Known Member

spirit

Distinguished Member

skraw

Well-Known Member

spirit

Distinguished Member

skraw

Well-Known Member

spirit

Distinguished Member

skraw

Well-Known Member

spirit

Distinguished Member

skraw

Well-Known Member

Inlakesh

Well-Known Member