Irrelevant packets as present for every VM on proxmox-host

vmswtje

Member
Jun 29, 2020
12
0
6
33
My network has some issues. When the network traffic increases, the network connections tend to be very slow even though it's an 10GB network. I'm not sure whether it's proxmox-related or not.

Example:
VMx = virtual machine x
VHx = proxmox virtual host x

VM1 = 192.168.0.51 (E2:A9:CC:75:79:AF) - on VH1
VM2 = 192.168.0.8 (8A:1D:1B:44:64:A7) - on VH2

VM3 & VM4 = different ip's and mac-adresses (same subnet/vlan) - on VH3

- There's traffic between VM1 (VH1) and VM2 (VH2) (HTTP) - this is normal and always the case
- Some traffic is (also?) being delivered to i.e. VM3, VM4 (VH3) - this seems to me as very strange (not always, more machines are involved)

I've set up an Proxmox-firewall on VM3 and VM4 and based on this. I also enabled mac/ip-firewall-filters. I can see this happening:

Code:
120 4 tap120i0-IN 29/Jun/2020:08:56:48 +0200 policy REJECT: IN=fwbr3i0 OUT=fwbr3i0 PHYSIN=fwln3i0 PHYSOUT=tap3i0 MAC=8a:1d:1b:44:64:a7:e2:a9:cc:75:79:af:08:00 SRC=192.168.0.51 DST=192.168.0.8 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=14385 DF PROTO=TCP SPT=36186 DPT=80 SEQ=980641566 ACK=0 WINDOW=29200 SYN
130 4 tap130i0-IN 29/Jun/2020:08:56:48 +0200 policy REJECT: IN=fwbr4i0 OUT=fwbr4i0 PHYSIN=fwln4i0 PHYSOUT=tap4i0 MAC=8a:1d:1b:44:64:a7:e2:a9:cc:75:79:af:08:00 SRC=192.168.0.51 DST=192.168.0.8 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=14385 DF PROTO=TCP SPT=36186 DPT=80 SEQ=980641566 ACK=0 WINDOW=29200 SYN
(i changed the networkadapters-vm-numbers in the logs to be consistent with my example)

Based on this, my first question is:

Q1. After this package arrives at VH2 (even though it shouldn't be here): is it normal that these packages are being delivered to VM3 and VM4 even though the MAC-addresses/IP's don't correspond with those vm's?
I guess this could be normal behavior because the virtual host can't find the mac-address locally because it's not on the node (it resides on another node).

I've already checked:
- the arp-cache on all proxmox nodes doesn't seem to include any vm's (= normal/default?).
- tap3i0-interface seems to be in promisc-mode (= normal/default?), the fwpr3i0, fwln3i0, fwbr3i0 devices are not in promisc (= normal/default?).

--

Q2. I guess the bigger problem is: the network traffic should have never entered VH3. Is should be traffic between VH1 and VH2. How could it have gone to VH3?

I've connected all proxmox nodes using LACP (active+active) to 2 cisco-switches, that are connected in stack using 2 fiber connections. See simplified diagram in attachment. The relevant network is all in the same subnet/vlan.

My guess is that there must be going something wrong on the switch, because the switch should have already seen the traffic has an destination on VH2 and not VH3.
I think that should work based on ARP, but the ARP table on the switch seems to be correct (the mac's are pointing to the right lacp-connections/LAG's). So I don't know why the packages are being delivered wrong.
 

Attachments

  • simplified-network-diagram.png
    simplified-network-diagram.png
    16.5 KB · Views: 10
Last edited:
Hi,
no this is not normal, you shouldn't see traffic on vm3/4 for destination to vm2.

>>- the arp-cache on all proxmox nodes doesn't seem to include any vm's (= normal/default?).

arp is layer3 (if you use bridge as gateway for example), you need to look at bridge mac address tables
# bridge fdb show

it should happen if bridge loose mac-address, then broadcast unicast traffic to all ports, until it learn again the mac address.

can you send your /etc/network/interfaces ?
(this could happen if you have set bridge_ageing 0 for example)


The same apply to your physical switches. (Just check that your mac-address-table timeout is not lower than your vms arp cache timeout)
 
I didn't know about that bridge mac address table! That was what I was looking for when I tried 'arp'.

RAW data from that command is in attachment (bridge.txt).
(I only changed the vm-numbers so that it's consistent with my examples, so i.e. tap120i0 --> tap3i0)

I don't understand why the mac-address of the source is present so many times even though the machine isn't present on this machine. I guess that's part of the issue?

Maybe relevant to tell you that I sometimes migrate live vm's between cluster nodes. Also I did set up/changed some firewalls in the last few months.
 

Attachments

  • interfaces.txt
    1.7 KB · Views: 7
  • bridge.txt
    16 KB · Views: 11
All seem fine on your node.

if your vm3/4 are on a different proxmox node, than vm2, and if you see traffic to vm2 going to vm3/4,
it's clearly a problem on your physical network.

if you only see some packets, time to time, they are chances that it's your physical switch loosing the mac-address, then it's flood to all proxmox nodes until the correct vm send a reply.

try to look at your cisco switch "mac address-table aging-time" value, to see if the value is not too low.
you can pretty safely increase it to 4h:

"mac address-table aging-time 14400"

it should be enough to be bigger than any vm os arp local cache (so vm will send arp request sooner than mac-address-table timeout, and will refresh it)

https://community.cisco.com/t5/switching/unicast-flooding-mac-address-aging-time/td-p/2037094
 
Hi spirit,

To ensure you saw what I saw:
Is it correct that the mac-address of VM1 is present so many times in the bridge.txt file even though this vm is not present on this proxmox node?

Also, I'm not sure whether it's "normal" because these 'floods' are quite big, in example: (this server doesn't have any network traffic normally)
You can see flows of approx 25Mbps every few minutes. On some servers even more.
1593440430097.png

I checked the value on my switch config, also found it in the interface. It's 300 now.
The switch can only have a value =< 630 so that isn't going to change much.
1593437042700.png
 
Last edited:
>>Is it correct that the mac-address of VM1 is present so many times in the bridge.txt file even though this vm is not present on this proxmox node?

I'm only seeing "E2:A9:CC:75:79:AF" once in your bridge.txt

"e2:a9:cc:75:79:af dev bond0 master vmbr0"

(it's on bond0, as it's external to the host).


I don't think it's a problem with your node, it's really a problem with your physical switch.
if you see traffic comming to proxmox VH3 for a VM which is not on this host, it's clearly your switch flooding traffic.


>>Also, I'm not sure whether it's "normal" because these 'floods' are quite big, in example: (this server doesn't have any network traffic normally)

what do you mean by "this server doesn't have any network traffic normally" . is it a silent host ? (no gateway, never do traffic).
because in this case, if this server never send traffic to your network (or maybe 1packet time to time > 5min), the mac-address-table of your switch will timeout.
if your server send packets < 5min (by sending packets, I mean: "send arp, try to etablished connections to outside, or reply in a connection established from other vm), it should be ok.


To debug, you should try to look at your switch mac-address-table when you have the problem, add check if you see the vm mac address on the right port.

also, maybe check if no bug exist on your switch os version, as with stacked switch, mac address tables between 2switchs need to be sync.


>>You can see flows of approx 25Mbps every few minutes. On some servers even more.

mmm,25mbps seem quite big. (how many dropped packets lines do you see in proxmox firewall ?)
 
>>Is it correct that the mac-address of VM1 is present so many times in the bridge.txt file even though this vm is not present on this proxmox node?

I'm only seeing "E2:A9:CC:75:79:AF" once in your bridge.txt

"e2:a9:cc:75:79:af dev bond0 master vmbr0"

That's not the only occurence in the file I'm afraid. See below (I copy-pasted it from bridge.txt)

Code:
e2:a9:cc:75:79:af dev fwln132i0 master fwbr132i0

e2:a9:cc:75:79:af dev fwln4i0 master fwbr4i0

e2:a9:cc:75:79:af dev fwln3i0 master fwbr3i0

e2:a9:cc:75:79:af dev fwln111i0 master fwbr111i0

e2:a9:cc:75:79:af dev fwln148i0 master fwbr148i0

e2:a9:cc:75:79:af dev fwln147i1 master fwbr147i1

e2:a9:cc:75:79:af dev fwln116i0 master fwbr116i0

e2:a9:cc:75:79:af dev fwln153i0 master fwbr153i0

>>Also, I'm not sure whether it's "normal" because these 'floods' are quite big, in example: (this server doesn't have any network traffic normally)

what do you mean by "this server doesn't have any network traffic normally" . is it a silent host ? (no gateway, never do traffic).

I'm sorry about this misunderstanding. No, I just wanted to note that this is a regular Ubuntu-mysql server normally has (much) less traffic (100kbps or something like this, not 25mb+)
It still has some traffic going on, also the other vm's (and switch) have this MAC in the ARP.

To debug, you should try to look at your switch mac-address-table when you have the problem, add check if you see the vm mac address on the right port.

Yes, I've checked many times (to be sure it didn't change) but in all cases the table is stable. Always on the right LAG (LACP-port)

also, maybe check if no bug exist on your switch os version, as with stacked switch, mac address tables between 2switchs need to be sync.

I didn't find anything in the changelogs, but I'll try to update the switch later this month (because of involved risk / downtime I've to prepare this) to be sure.

>>You can see flows of approx 25Mbps every few minutes. On some servers even more.

mmm,25mbps seem quite big. (how many dropped packets lines do you see in proxmox firewall ?)

Many, very many. i.e. I can find 22.532 drop-lines from vm1 today, only on this virtual host. They're all like the two lines I've mentioned before.
It's possible that it's somewhat related to specific situations, because the one I'm using in the example is by far the worst one at the moment.


P.S. thank you very much for your help so far! I'm very happy with your response time and helpful answers!
 
>>That's not the only occurence in the file I'm afraid. See below (I copy-pasted it from bridge.txt)

oh, sorry.
That's ok. when firewall is enable on a vm, a new bridge is created for each vm. if some arp is broadcasted by this source mac, it'll be registered in the mac address table (fwlnX is a link which connect the fwbr bridge to main vmbrX bridge)
so nothing strange.

>>Many, very many. i.e. I can find 22.532 drop-lines from vm1 today,

does it occur at a specific interval ? (each 5min, each 10min,....)or random ?
 
Hi, sorry for the late response, I was testing some things today.

Thank you for your explanation, this helped me to understand this and check it off my list.

I found out that the drops have 5, 10 or 15 minutes in between. So I guessed it could be the arp timeout on the switch. The max timeout seems to be 600 (I guess this is because I also use the switch as a gateway), so this woulnd't do the trick.
I tried using static mac-addresses, and this worked. This stopped the traffic from entering the wrong LAG's.

I'm not sure why the switch forgets many of the mac-addresses every x minutes even though most of the machines communicate more than enough (even the high traffic machines have this problem).
I found out some firmware update for the switch that's related to packet flooding at twice the aging time when using LAG's. I'm going to give the switch a firmware update later this month, maybe that's related in some way.

The switch also uses 100% of the cpu at all time, I'm not sure why, because the amount of traffic/tcam and simplicity of my configuration etc. seem to be okay, but maybe it's related to this problem too, I hope so, because when the traffic increases the problem (delay / packets everywhere) also increase in magnitude.

Merci beaucoup pour votre aide!!! (mon français n'est pas fort encore pour suivre votre courses, mais j'espère apprendre assez de français pour avoir des conversations normales :) )
 
Hello, I have exactly the same problem.

The VMs receive traffic destined to other MACs.

Enabling the firewall solves the problem, but it is strange that this only happens with VMs. The physical servers I have in my network do not have this problem.

@vmswtje , were you able to find out if it was a switch problem?
 
I'm not sure, but I think it could be related. Update of switching firmware didn't solve it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!