OpenVZ Containers lose internet connection (VLAN/venet)

cle.ram · Feb 8, 2012

I have rather weird issues with my OpenVZ containers using venet and running in Proxmox 1.9 (Kernel 2.6.32-6-pve, fully updated as of 2012-02-08) hosted at OVH.
The Setup:
- a private 172.16.0.0/12 and a public x.x.x.x/28 subnet are assigned to the VLAN-Interface
- the Containers are each provided with ONLY a public IP out of the /28 subnet to their venet-Interface

Problem:
At first the connectivity of the containers seem just fine for about four hours (the timeframe is reproducable).
After this time they are no longer accessible by other machines in the public x.x.x.x/28 subnet or the internet. They cannot ping other machines with public IPs, too.
The strange part is, that the containers can ping machines in the private 172.16.0.0/12 subnet all the time - but cannot be pinged from these machines.
The containers public IPs also respond when directly accessed from the HN.

This issues are gone (for the next four hours), when the containers are rebooted.

/etc/network/interfaces:

Code:

# The loopback network interface
auto lo
iface lo inet loopback

# for Routing
auto vmbr1
iface vmbr1 inet manual
    post-up /etc/pve/kvm-networking.sh
    bridge_ports dummy0
    bridge_stp off
    bridge_fd 0

# vmbr0: Bridging. Make sure to use only MAC adresses that were assigned to you.
auto vmbr0
iface vmbr0 inet static
    address xxx.xxx.xxx.xxx
    netmask xxx.xxx.xxx.xxx
    network xxx.xxx.xxx.xxx
    broadcast xxx.xxx.xxx.xxx
    gateway xxx.xxx.xxx.xxx
    bridge_ports eth0
    bridge_stp off
    bridge_fd 0

auto eth0.XXXX
iface eth0.XXXX inet static
        address 172.16.0.2 
        netmask 255.240.0.0 
        post-up ip r a 172.16.0.0/12 via 172.31.255.254 dev eth0.XXXX; true 

auto eth0.XXXX:0
iface eth0.XXXX:0 inet static 
        address xxx.xxx.xxx.2 
        network xxx.xxx.xxx.0 
        broadcast xxx.xxx.xxx.15 
        gateway xxx.xxx.xxx.14 
        netmask 255.255.255.240 
        post-up /sbin/ip route add default via xxx.xxx.xxx.14  dev eth0.XXXX table 125
        post-up /sbin/ip rule add from xxx.xxx.xxx.0/28 table 125 
        post-down /sbin/ip route del default via xxx.xxx.xxx.14 dev eth0.XXXX table 125
        post-down /sbin/ip rule del from xxx.xxx.xxx.0/28 table 125

XXXX being the VLAN-ID

/etc/sysctl.conf:

Code:

net.ipv4.ip_forward=1
net.ipv4.conf.default.forwarding=1
net.ipv4.conf.default.proxy_arp = 1
net.ipv4.conf.eth0/XXXX.proxy_arp = 1

route -n:

Code:

 Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
xxx.xxx.xxx.7    0.0.0.0         255.255.255.255 UH    0      0        0 venet0
xxx.xxx.xxx.9    0.0.0.0         255.255.255.255 UH    0      0        0 venet0
xxx.xxx.xxx.8    0.0.0.0         255.255.255.255 UH    0      0        0 venet0
xxx.xxx.xxx.11   0.0.0.0         255.255.255.255 UH    0      0        0 venet0
xxx.xxx.xxx.10  0.0.0.0         255.255.255.255 UH    0      0        0 venet0
xxx.xxx.xxx.0    0.0.0.0         255.255.255.240 U     0      0        0 eth0.XXXX
xxx.xxx.xxx.0   0.0.0.0         255.255.255.0   U     0      0        0 vmbr0
172.16.0.0        0.0.0.0         255.240.0.0     U     0      0        0 eth0.XXXX
0.0.0.0         xxx.xxx.xxx.14   0.0.0.0         UG    0      0        0 eth0.XXXX
0.0.0.0         xxx.xxx.xxx.254  0.0.0.0         UG    0      0        0 vmbr0

I hope someone has an idea how I could solve these issues...
Thanks in advance!

EDIT:

tcpdump -i eth0.XXXX shows that ARP-Requests to a affected IP are beeing received by the HN but are not answered.

If I ping from inside a container, "tcpdump -i eth.XXXX" on the HN shows that echo requests are beeing transmitted but again: ARP-Requests which are not beeing answered and no echo replies.

If I ping some machine in 172.16.0.0/12 there the echo replies get to the virtual container:

Code:

listening on eth0.XXXX, link-type EN10MB (Ethernet), capture size 96 bytes
03:57:41.596299 IP [I]PUBLIC-IP-OF-CONTAINER[/I] > 172.16.0.4: ICMP echo request, id 56072, seq 1, length 64
03:57:41.599311 arp who-has [I]PUBLIC-IP-OF-CONTAINER[/I] tell 172.16.0.4
03:57:41.904644 IP 172.16.0.4 > [I]PUBLIC-IP-OF-CONTAINER[/I]: ICMP echo reply, id 56072, seq 1, length 64

cle.ram · Feb 9, 2012

I want to document my findings (including a solution) here, if anyone else has similar problems.

To make it short: the configuration suggested by OVH won't work correctly with any Linux based system with kernel version 2.6.31 or newer.

Background:
Beginning with 2.6.31, the "rp_filter=1" (reverse path filter) setting has become working correctly (in earlier versions it was just acting as rp_filter=0 regardless what setting was applied to it). If rp_filter is enabled (which is the default nowadays in most distributions), it checks if an incoming packet is actually directed to some network that really is accessible by that machine (it's basically checked against the FIB). Furthermore there are two modes of rp_filter: Strict mode (rp_filter=1) and loose mode (2). In strict mode, it is checked additionally, if the incoming packet has arrived exactly on that interface which is meant to get traffic originating from the source. If such packet arrives on any other interface it will be dropped in strict mode. Loose mode (rp_filter=2) instead matches only against the FIB and accepts the packet regardless of the interface. Setting rp_filter=0 will disable rp_filtering completely resulting in no checks at all which is not recommended and rather dirty.

To get back to the setup from OVH:
As soon as the VLAN configuration is done, the vmbr0 interface (and the associated IP) will not respond to any ARP requests anymore regarding packets originated from the internet (because of rp_filter=1 evaluates vmbr0 as unsuitable for that packets). The VLAN-IP of the HN itself is still accessible by the internet. Due the internal routing of OVH, all traffic that originates from the internet and is meant to go to the virtual nodes using venet, will arrive on vmbr0 (which doesnt respond to arp requests), thus the connection is timing out.
Now I'll reconstruct, why the connectivity to the VNs seems to work fine for four hours at first:
If a OpenVZ container is beeing restarted, it's HN associated ARP information seems to be broadcasted over all interfaces (unsure if that is 100% correct, but at least it does kind of that). The router of the OVH-Infrastructure gets these and puts it into its internal ARP table. If any packets to the public VLAN-subnet are arriving now, they are processed corretly and the VN is accessible. But the OVH-Router (seems to be a Cisco) has an ARP-cache timout of exactly four hours. After that timeframe the entrys are beeing deleted in the ARP table of the OVH-Router. The VNs are now inaccessible because no more arp/ip-relations are available in the OVH-Router and the HN's vmbr0 (again) doesn't respond to the now coming ARP requests.

Fix: setting "net.ipv4.conf.vmbr0.rp_filter = 2" in sysctl.conf solves the problem (I additionally had to do "echo 2 > /proc/sys/net/ipv4/conf/vmbr0/rp_filter" get it active on a running HN ).
If anyone has an idea of a more ideal solution (probably by adding some crazy post-up routing rules to /etc/network/interfaces) I'd be happy to hear.

Other notes (but possibly with no relevance after setting "rp_filter=2"):
Before I found the rp_filter issues, I was trying to get the VNs to ping each other in the "4-hour-connectivity-timeframe", which was at fist not working because of the proposed routing settings by OVH in /etc/network/interfaces. I got it fixed by setting following routes/rules in the VLAN specific part:

Code:

#VLAN Config:


auto eth0.XXXX
iface eth0.XXXX inet static
        address 172.16.0.2 
        netmask 255.240.0.0 
        post-up   ip route add 172.16.0.0/12 via 172.31.255.254 dev eth0.XXXX
        post-down ip route del 172.16.0.0/12 via 172.31.255.254 dev eth0.XXXX




auto eth0.XXXX:0
iface eth0.XXXX:0 inet static 
        address xxx.xxx.xxx.xxx.2 
        network xxx.xxx.xxx.0 
        broadcast xxx.xxx.xxx.15 
        gateway xxx.xxx.xxx.14 
        netmask 255.255.255.240 
        post-up    /sbin/ip rule add to xxx.xxx.xxx.0/28 lookup main pref 125              
        post-up    /sbin/ip rule add from xxx.xxx.xxx.0/28 iif venet0 table 125            
        post-up    /sbin/ip route add default via xxx.xxx.xxx.14 dev eth0.XXXX:0 table 125
        post-down  /sbin/ip rule del to xxx.xxx.xxx.0/28 lookup main pref 125                
        post-down  /sbin/ip rule del from xxx.xxx.xxx.0/28 iif venet0 table 125             
        post-down  /sbin/ip route del default via xxx.xxx.xxx.14 dev eth0.XXXX:0 table 125

It would be fine if some Mod could mark this thread as SOLVED.
Thanks!

Search

Search

OpenVZ Containers lose internet connection (VLAN/venet)

cle.ram

New Member

cle.ram

New Member