Strange Network Dropout

jaybinks · May 6, 2011

So I have a cluster of 4 Proxmox boxes..
using 1.6 with 2.6.18-2 kernel ( proxmox package )

the kernel selection is because the vendor of the software we run recommends this kernel version... it just so happens that 2.6.18 is the latest Stable Kernel as per OpenVZ project. So that was an easy pick..

up till now ( about 12 months ) we have not been seeing this issue at all.

it started after doing some software upgrades ( inside the VM's )
what we do during a software upgrade is copy the VE to a new ID ( while its running.. ) "cp -a /var/lib/vz/private/100 /var/lib/vz/private/101" then do the same with the conf... "cp /etc/vz/conf/100.conf /etc/vz/conf/101.conf".

we take great care to make sure the new VE Id is not currently in use ...
( however it may be worth noting that we deleted some old VE's of late.. and the VE ID's have started to be re-used .. not sure if thats potentially part of the problem )

after copying VE 100 to 101... we stop 100 .. and start 101 ..
then upgrade the server software inside the VE .. ( this lets us roll back from 101 to 100 if the upgrade causes issues )

we have used this method for 12 months or so .. with no issues what so ever..

a few weeks ago... we used this same method again... but after the upgrade we noticed our monitoring system started to "flap" occasionally ( at most 2 or 3 times per day... sometimes its only 2 or 3 times per week )

nagios reports that the VE is not responding to ICMP pings... ( and at the same time notifies other services are offline.. not just ICMP )

however a few sec later ( 30-60 sec or whatever it is ) it reports that its all good.
and generally.. we dont get an alert for a day or more ...

now this was mildly annoying on the first VE I upgraded, however ive done 4 more now.. and its started happening on all of these .. ( to varying degrees )

I have no idea where to look !!..
one of the only things Ive found so far is that it seems to always happen the same sec that I find this "pvemirror[10725]: starting cluster syncronization" in daemon.log ..

my cluster is on GIGE and my network monitoring does not show network saturation when we get these alarms..

can someone point me in the right direction to start tracking this down further ??

as a precaution can I make pvemirror sync the cluster Less often ??

jaybinks · May 6, 2011

Something I missed in my original post was the PCaps ..

so I did PCaps on both ends... "PCap A" from our nagios box..
"PCap B" from our Virtual Server ( In proxmox )

in the "PCap A" I saw
an ICMP packet & SIP Options message leave the nagois box

then "PCap B" I saw
the ICMP and SIP Options come in ...
I never saw a response to either the ICMP or the SIP options .. ( no response left the VE )

at "PCap A" I obviously saw no response..
a few sec later ... the ping / SIP started to flow normally again... and nagios flagged the box as "OK"

I originally thought this was the SIP Server dropping SIP Packets. .. ( and not responding )
until I caught these Pcap files, and saw ICMP was doing the exact same thing at the exact same time ..

Strange Network Dropout

jaybinks

Guest

jaybinks

Guest

We value your privacy