Negative number of orphaned TCP sockets

  • Thread starter Thread starter psokolovas
  • Start date Start date
P

psokolovas

Guest
Hi

I`ve started to use VE chkpnt/restore, but after several days I noticed, that sometimes, in random VEs after restoring, number of TCP orphaned sockets becomes negative (e.g. -1, or -4). And this is system wide, because HN dmesg shows:

TCP: too many of orphaned sockets (-1 in CT1060)
printk: 52 messages suppressed.
TCP: too many of orphaned sockets (-3 in CT1105)
printk: 37 messages suppressed.

vzctl restart 1105 does not help. Count still comes back to negative value. And this causes lots of troubles to VE users, because TCP connections start to drop, resulting in not loaded pictures, terminated html etc.

I use kernel 2.6.18-4-pve. There are no beancounters overused.

As long as I know programming, there should be check in kernel code whether number is negative, and if yes, should be counted as 0. But as long as -4 means = 65532 or even more, if double int is used, I think problem is there.

All my tries to solve this problem without rebooting HN failed. Only reboot clears these counters - but it is not acceptable solution.
EDIT: Just found one more solution:

1. vzctl stop 1105
2. wait for dmesg on HN: Ub 1105 helds 31192 in tcpsndbuf on put
3. vzctl start 1105

Waited about 30 seconds. Cool. But still not acceptable solution :)

Questions:

1. Have anyone experienced the same problem, and if Yes - what was the solution. E.g. maybe it is possible to reset all open/orphaned sockets and their counters by issuing some kind of cat smth > /proc/somewhere to do this?

2. Maybe it is possible to patch the kernel to behave as I stated above, in case number gets negative? If Yes - maybe we should patch PVE kernel?

Thanks!
 
Last edited by a moderator:
Please can you report a bug to the openvz bug tracker - including detailed instruction howto reproduce the bug.
 
I`ve reported this to openvz kernel maintainers:
http://bugzilla.openvz.org/show_bug.cgi?id=1735

Also, I believe I found the workaround, can You tell Your kernel maintainers to include this fix to the current 2.6.18 kernel release? I tried to do this myself, but kernel is hard to compile - gcc 4.1.2 fails, newer also fails, it is hard to get working 4.1.3 compiler including quilt etc - i think You better recompile it and include as update. Thanks!

Workaround:
in net/ipv4/tcp.c we replace:

if (ub_too_many_orphans(sk, orphans)) {
with:
if ((ub_too_many_orphans(sk, orphans)) && (orphans > 0)) {
This will eliminate false TCP resets if inaccuracy happens in counting orphans below zero. I never seen this negative number goes below -7, so, +-7 orphans will not play a role in UB, but will fix false negative TCP resets.

If You want, I can help with testing, please, compile 2.6.18-4-pve with my patch and send it out to me, or provide me with its FTP location, I will test it out on my production HNs (because I am 99% sure it will help, and 99,99% sure it wont hurt :)

After that, You can release it as update.

Thanks again!