Linux Bridge reassemble fragmented packets

Asg.Systems · Sep 17, 2021

Hi to all,
we're experiencing a problem with firewall on a proxmox cluster and after few tests it seems it'a a linux bridge problem
The packet capture show that fragmented packets passing through the bridge are reassembled and sent out.
This is causing us some problems, even if proxmox cluster has global mtu set to 9000, appliances that run over it has 1500 mtu

Below you can find output of pveversion -v

Bash:

root@pmxnet03-b3311:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.98-1-pve)
pve-manager: 6.3-4 (running version: 6.3-4/0a38c56f)
pve-kernel-5.4: 6.3-5
pve-kernel-helper: 6.3-5
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.2.0-2
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.3-pve1

we've tried to move one instances of the firewalls on an updated node to version
PVE Manager Version
pve-manager/6.4-13

But nothing changed
We've proxomx firewall enabled for some management operations, could this affect ?
is there any flag on linux bridge that prevent packets to be reassembled ?
If i have to post some other output there will no problem

Regards,
ASG System

spirit · Sep 17, 2021

AFAIK, linux netfilter firewall need to do reassembly to work.

The question is why do you have fragmented packets ?

Asg.Systems · Sep 17, 2021

Hi spirit, thanks for your feedback.
There is fragmentation because the mtu is 1500 on the firewall appliance
And even by setting it at 9000 there would be MPLS network with mtu 1500, so i have to change this behavior without any workaround

There is fragmentation because packets of RADIUS request come to bridge with aroud 1860 mtu value since they contains the certificate
fragmented packets are reassembled from the linux bridge but are not re fragmented once they come out of the bridge
this is causng some problems, because radius server that has 1500 of mtu discard the packet
I need the bridge to stop sends out packet reassembled at original mtu
Even because the vNIC of the firewall are at 1500 of mtu

i've found this article
https://bugs.launchpad.net/neutron/+bug/1542032
that shows a remediation

Bash:

root@node-11:~# cat /proc/sys/net/bridge/bridge-nf-call-iptables
1

root@node-11:~# echo "0" > /proc/sys/net/bridge/bridge-nf-call-iptables

root@node-11:~# cat /proc/sys/net/bridge/bridge-nf-call-iptables
0

My question is if after there is the need to restart some network process or it can be done hotplug

spirit · Sep 18, 2021

Asg.Systems said:
Hi spirit, thanks for your feedback.
There is fragmentation because the mtu is 1500 on the firewall appliance
And even by setting it at 9000 there would be MPLS network with mtu 1500, so i have to change this behavior without any workaround

There is fragmentation because packets of RADIUS request come to bridge with aroud 1860 mtu value since they contains the certificate
fragmented packets are reassembled from the linux bridge but are not re fragmented once they come out of the bridge
this is causng some problems, because radius server that has 1500 of mtu discard the packet
I need the bridge to stop sends out packet reassembled at original mtu
Even because the vNIC of the firewall are at 1500 of mtu

i've found this article
https://bugs.launchpad.net/neutron/+bug/1542032
that shows a remediation

Bash:

root@node-11:~# cat /proc/sys/net/bridge/bridge-nf-call-iptables 1 root@node-11:~# echo "0" > /proc/sys/net/bridge/bridge-nf-call-iptables root@node-11:~# cat /proc/sys/net/bridge/bridge-nf-call-iptables 0

My question is if after there is the need to restart some network process or it can be done hotplug

you have 2 options : don't use proxmox firewall (echo "0" > /proc/sys/net/bridge/bridge-nf-call-iptables disable it),
or fix your mtu problem. (it's really really bad to have fragmentation, this will break ssh, https or any protocol with DF bit)

Asg.Systems · Sep 20, 2021

spirit said:
you have 2 options : don't use proxmox firewall (echo "0" > /proc/sys/net/bridge/bridge-nf-call-iptables disable it),
or fix your mtu problem. (it's really really bad to have fragmentation, this will break ssh, https or any protocol with DF bit)

Yesterday we've tried to reboot the system disabling the firewall, but nothing changed, on this specific VLAN (ID 2249) we're experiencing the issue.
Unfortunately we use the Jumbo MTU for some services on the VM.

The strange things is the following:

VM ----> Proxmox Bridge ------> Firewall VM ----> Proxmox Bridge ---> Destination VM
send a pkt 2 fragment 2 fragment 2 fragment 1 reassembled crock
(UDP or ICMP) (lenght 1800)

so the behavior of the packet is correct, because on the Network it travels in two fragment (since it is UDP and not TCP), when arrive in the bridge is correctly fragmented, when it exit the bridge (not the VM NIC, we've done the tcpdump on both TAP, Bridge and Bond) the Linux Bridge reassemble it.

This behavior it's not observed on othe VLANs, in fact we'll try to put the server on another VLAN , but could be great to understand why on this specific scenario we have this problem.

spirit · Sep 20, 2021

what is the mtu of the bridge ? (you can increase mtu of bridge, physycal interfaces,...) without any problem, only interfaces where you have ip address or inside the vm guest need to be reduce

Asg.Systems · Sep 20, 2021

spirit said:
what is the mtu of the bridge ? (you can increase mtu of bridge, physycal interfaces,...) without any problem, only interfaces where you have ip address or inside the vm guest need to be reduce

Hi,

the MTU on the linux bridge and the phisycal interfaces is 9000, the MTU on guest virtual machine is 1500

spirit · Sep 20, 2021

and
/proc/sys/net/bridge/bridge-nf-call-iptables == 0 ?

(pve-firewall service need to be disable at datacenter level, as it's trying to enable it for any rules on any vm)

djzoidberg · Sep 20, 2021

Is there no way to force the radius server to send UDP packets with 1k5 MTU?
Usually UDP packet fragmentation and virtual reassembly works (except for some particular loadbalancing methods).

Anyway if you can force this service to don't use jumbo frames you should resolve this issue and you should be able to use jumbo frames for all other traffic.

Another thing that you can check is if for some reason there is the DF-bit enable. Some devices allow you to "don't honor" the DF-bit or you can just override the value in transit with clear df bit tequinques.

EDIT 1:
an example with iptables:
iptables -t mangle -A POSTROUTING -j DF --clear

Another one on fortigate firewalls:
# config system global
# set honor-df disable
# end

felipe · Oct 14, 2021

Hi we experienced similar problems.
Sending packages < 1500 where never fragmented again (after beeing reassembled) and where droped
this ONLY happens for us if
1) we use vlan aware bridge
2) VM is VLAN tagged

so if the vm (tap device) is not vlan tagged or we use normal bridges (not vlan aware ) it works...
seems like a bug in netfilter.....

or /proc/sys/net/bridge/bridge-nf-call-iptables == 0 and disable firewall service on that proxmox host totally as it will reset to 1 it even if all firewall opions are unchecked as long as the firewall servive is running. but thats not an option for us. i think we will switch for normal bridges at the moment... until this gets fixed....

spirit · Oct 14, 2021

felipe said:
Hi we experienced similar problems.
Sending packages < 1500 where never fragmented again (after beeing reassembled) and where droped
this ONLY happens for us if
1) we use vlan aware bridge
2) VM is VLAN tagged

so if the vm (tap device) is not vlan tagged or we use normal bridges (not vlan aware ) it works...
seems like a bug in netfilter.....

or /proc/sys/net/bridge/bridge-nf-call-iptables == 0 and disable firewall service on that proxmox host totally as it will reset to 1 it even if all firewall opions are unchecked as long as the firewall servive is running. but thats not an option for us. i think we will switch for normal bridges at the moment... until this gets fixed....

About netfilter, I found a interesting thread
https://www.spinics.net/lists/netdev/msg596072.html
"
When the "/proc/sys/net/bridge/bridge-nf-call-iptables" is on, bridge
will do defragment at PREROUTING and re-fragment at POSTROUTING. At
the re-fragment bridge will check if the max frag size is larger than
the bridge's MTU in br_nf_ip_fragment(), if it is true packets will
be dropped."

Currently, the bridge use the smallest mtu of different interfaces plugged in the bridge. (ethX, but also veth|tap vm interfces)

About vlan, I really don't known, maybe vlanaware bridge need unfragmented packet to add the vlan tag .

felipe · Oct 14, 2021

hmmm yes very hard to find out. but seems to be a bug. strange that not many people report this. but it is 100% reproducible. like that vlan aware bridges will allways make problems when guest vms send packages bigger 1500. of course one can set the mtu higher at the bond device after the bridge but thats just an odd workaround.

felipe · Oct 14, 2021

just a stupid question. as vlan ware does not work for me i tried to change one host. as i want the cluster & management in a vlan i want do make
vmbr0v200 with the ip but i am not able to create a bridge with that name in the gui.... (its red not allowed to press the button because name is not allowed)

spirit · Oct 15, 2021

felipe said:
hmmm yes very hard to find out. but seems to be a bug. strange that not many people report this. but it is 100% reproducible. like that vlan aware bridges will allways make problems when guest vms send packages bigger 1500. of course one can set the mtu higher at the bond device after the bridge but thats just an odd workaround.

I'll try to find more info about vlan.
Anyway, fragmentation is bad, you shouldn't have fragmentation on your network, it'll break protocol like ssh, https, where DF bit is set.
you should increate the mtu on your bond && bridge device && your phyysical switch, it's not a workaround, it's a correct setup.

felipe · Oct 15, 2021

on the network in general it is not good to have fragmentation. but if the vm has a MTU of 1500 it has to fragment the packages. also ovre the internet it will never get transportet with jumbo frames. so thats the problem. it is reasambled by netfilter and then never fragmentet to the original size of the 1500 mtu as it came from the interface. also maybe there are some larger packages with mtu bigger then 9000. then even if i put switch and bond & bridge to jumbo frames it will be dropped as it es reassembled by netfilter to a size whatever.

a ping like -s 12000 will never go trough with vlan aware bridges. (and tagged network device) but will on normal bridges.

Asg.Systems · Jan 26, 2022

Just to add a clarification after the proxmox case, the reassemblation of the packets is caused by the Netfilter only if the FIrewall is enabled on the cluster

I found also this, now I try if enabling conntrack (is disabled in our environment for other problem that was caused)
https://serverfault.com/questions/5...d-and-subsequent-fragments-of-an-allowed-pack

Asg.Systems · Jan 26, 2022

Asg.Systems said:
Just to add a clarification after the proxmox case, the reassemblation of the packets is caused by the Netfilter only if the FIrewall is enabled on the cluster

I found also this, now I try if enabling conntrack (is disabled in our environment for other problem that was caused)
https://serverfault.com/questions/5...d-and-subsequent-fragments-of-an-allowed-pack

nothing to do, also with connection tracking enable the ICMP and UDP messages are reassembled by the PVE netfilter

fst · Oct 29, 2022

This might be related:
https://bugzilla.proxmox.com/show_bug.cgi?id=4158

Asg.Systems · Nov 25, 2022

fst said:
This might be related:
https://bugzilla.proxmox.com/show_bug.cgi?id=4158

That's really nice, we've opened another one on netfilter since the issue seems to be netfilter https://bugzilla.netfilter.org/show_bug.cgi?id=1644

spirit · Nov 25, 2022

Asg.Systems said:
That's really nice, we've opened another one on netfilter since the issue seems to be netfilter https://bugzilla.netfilter.org/show_bug.cgi?id=1644

as far I remember, the problem is that netfilter (with iptables) is not able to refragment packet in postrouring in bridge mode. and degrafmentation is mandatory to for conntrack. I'm not sure it's fixable, at least with iptables. maybe with nftables, and true bridge conntrack, with nftables will not be ready until proxmox 8.

https://www.mail-archive.com/netfilter-devel@vger.kernel.org/msg21403.html

Linux Bridge reassemble fragmented packets

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Member

Member

Member

Member

Distinguished Member

We value your privacy