Something bad is happening with OpenVZ!

SamTzu · Oct 10, 2013

Ok. Something bad seems to be happening with OpenVZ.
I noticed an error messages on one of our Proxmox servers.

Oct 10 22:28:54 vip1 kernel: Neighbour table overflow.
Oct 10 22:28:54 vip1 kernel: Neighbour table overflow.
Oct 10 22:28:54 vip1 kernel: Neighbour table overflow.
Oct 10 22:28:54 vip1 kernel: Neighbour table overflow.
Oct 10 22:28:54 vip1 kernel: Neighbour table overflow.
Oct 10 22:28:58 vip1 corosync[2580]: [TOTEM ] A processor failed, forming new configuration.
Oct 10 22:28:58 vip1 corosync[2580]: [CLM ] CLM CONFIGURATION CHANGE
Oct 10 22:28:58 vip1 corosync[2580]: [CLM ] New Configuration:
Oct 10 22:28:58 vip1 corosync[2580]: [CLM ] Members Left:
Oct 10 22:28:58 vip1 corosync[2580]: [CLM ] Members Joined:
Oct 10 22:28:58 vip1 corosync[2580]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 10 22:28:59 vip1 corosync[2580]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 10 22:29:17 vip1 kernel: __ratelimit: 391 callbacks suppressed
Oct 10 22:29:17 vip1 kernel: Neighbour table overflow.
Oct 10 22:29:17 vip1 kernel: Neighbour table overflow.

etc...

The problem became clear when I tried to see how there can be so many connections on the server without me seeing them normally.
(This is what Neighbour table overflow normally means, lots of connections. Or IPv6 problems.)

Doing

arp -a

revealed that there were many ghost connections from internal network that simply did not exist.

? (10.9.141.139) at <incomplete> on vmbr0? (10.9.129.77) at <incomplete> on vmbr0
? (10.9.143.131) at <incomplete> on vmbr0
? (10.9.143.73) at <incomplete> on vmbr0
? (10.9.143.209) at <incomplete> on vmbr0
? (10.9.142.213) at <incomplete> on vmbr0
? (10.9.143.111) at <incomplete> on vmbr0
? (10.9.140.48) at <incomplete> on vmbr0
? (10.9.139.207) at <incomplete> on vmbr0
? (10.9.139.224) at <incomplete> on vmbr0
? (10.9.131.219) at <incomplete> on vmbr0
? (10.9.129.33) at <incomplete> on vmbr0
? (10.9.143.89) at <incomplete> on vmbr0

There are no such IP's in use in our network.

When I tried to restart the network with

/etc/init.d/networking restart

I got even weirder error.

root@vip1:/var/log# /etc/init.d/networking restartRunning /etc/init.d/networking restart is deprecated because it may not re-enable some interfaces ... (warning).
Reconfiguring network interfaces...
Waiting for vmbr0 to get ready (MAXWAIT is 2 seconds).
grep: unrecognized option '--all'
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
done.

I'm at a loss here. Who is doing what here?

screenie · Oct 11, 2013

how many entries do you have in your arp cache?
have you checked the network settings on your host an vm's? maybe your broadcast domain is to large

SamTzu · Oct 11, 2013

There were several hundred entries in the arp cache but no, I don't see how broadcast domain settings could generate non-existing IP addresses on 10.9.x.x network.

screenie · Oct 11, 2013

it's hard to say something without details about your network environment but i would do tcpdumps on your switch/host/vm interfaces to find out from which direction they are coming in...

i had a similar case where a customer had a layer-2 connection between two sites where the connection were connected to the one of the switches on each site - but the problem was caused by the isp as he had not isolated this layer-2 connection and so the customer had the isp's backbone connected on both sites....

SamTzu · Oct 11, 2013

That particular host had 1 Bind DNS server, 2 Zimbra and 1 ZenOss VM's running on it. They were all behind external firewall. I fail to see how this could happen unless there was something seriously wrong.

wahmed · Oct 11, 2013

kernel: neighbour table overflow is usually a common problem for a large network.

If you run :
# sysctl net.ipv4.neigh.default.gc_thresh1
# sysctl net.ipv4.neigh.default.gc_thresh2
# sysctl net.ipv4.neigh.default.gc_thresh3

You will see the following output respectively:
# net.ipv4.neigh.default.gc_thresh1 = 128
# net.ipv4.neigh.default.gc_thresh1 = 512
# net.ipv4.neigh.default.gc_thresh1 = 1024

For large network with table overflow issue, the following entries in /etc/sysctl.conf can prevent the errors:
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 =1024

To load new changes without restart:
# sysctl -p

The issue can also occur in a small network where subnet mask is very small.

lynuxadmin · Oct 11, 2013

Hi

Thanks

SamTzu · Oct 12, 2013

Thx lynuxadmin, I actually knew that but I don't really want to "by pass" the problem without figuring out first how the "ghost" IP's originally appeared.

Search

Search

Something bad is happening with OpenVZ!

SamTzu

Renowned Member

screenie

Active Member

SamTzu

Renowned Member

screenie

Active Member

SamTzu

Renowned Member

wahmed

Famous Member

lynuxadmin

Guest

SamTzu

Renowned Member