proxy detected vanished client connection

decibel83

Renowned Member
Oct 15, 2008
210
1
83
Hi.
I have a PVE cluster with 7 nodes with the same configuration (same network and same switch), but I have some random problems on two of them (node5 and node6): sometimes they loose the communication with the cluster, they are very slow, I cannot enter in containers ("vzctl enter CTID" does not give me any response), and on them I have these errors in /var/log/syslog:

Code:
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 17:17:25 node5 pveproxy[560091]: proxy detected vanished client connection
Oct 11 17:18:56 node5 pveproxy[560091]: proxy detected vanished client connection
Oct 11 17:20:27 node5 pveproxy[568073]: proxy detected vanished client connection
Oct 11 17:21:58 node5 pveproxy[568073]: proxy detected vanished client connection

Code:
Oct 11 17:15:27 node6 pveproxy[4088]: proxy detected vanished client connection
Oct 11 17:16:52 node6 pveproxy[4087]: proxy detected vanished client connection

"pvecm nodes" returns me this:

Code:
root@node5:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  323432   2015-10-11 15:39:47  node1
   2   M  323432   2015-10-11 15:39:47  node2
   3   M  323432   2015-10-11 15:39:47  node3
   4   M  323432   2015-10-11 15:39:47  node4
   5   M  323408   2015-10-11 11:46:29  node5
   6   M  323436   2015-10-11 15:42:17  node6
   7   M  323432   2015-10-11 15:39:47  node7

root@node6:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  323436   2015-10-11 15:42:16  node1
   2   M  323436   2015-10-11 15:42:16  node2
   3   M  323436   2015-10-11 15:42:16  node3
   4   M  323436   2015-10-11 15:42:16  node4
   5   M  323436   2015-10-11 15:42:16  node5
   6   M  323412   2015-10-11 15:42:09  node6
   7   M  323436   2015-10-11 15:42:16  node7

Code:
root@node5:~# pvecm status
Version: 6.2.0
Config Version: 9
Cluster Name: mycluster
Cluster Id: 13573
Cluster Member: Yes
Cluster Generation: 323436
Membership state: Cluster-Member
Nodes: 7
Expected votes: 7
Total votes: 7
Node votes: 1
Quorum: 4  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node5
Node ID: 5
Multicast addresses: 239.192.53.58 
Node addresses: 192.168.60.5

root@node6:~# pvecm status
Version: 6.2.0
Config Version: 9
Cluster Name: mycluster
Cluster Id: 13573
Cluster Member: Yes
Cluster Generation: 323436
Membership state: Cluster-Member
Nodes: 7
Expected votes: 7
Total votes: 7
Node votes: 1
Quorum: 4  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node6
Node ID: 6
Multicast addresses: 239.192.53.58 
Node addresses: 192.168.60.6

They can communicate with all other cluster members, and their date/time are correct.
This is my multicast tests:

Code:
root@node1:~# omping -m 239.192.53.58 node1 node2 node3 node4 node5 node6 node7
[...]
node2 :   unicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.084/0.139/0.173/0.026
node2 : multicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.094/0.155/0.201/0.028
node3 :   unicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.057/0.099/0.150/0.019
node3 : multicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.075/0.111/0.163/0.022
node4 :   unicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.076/0.120/0.160/0.023
node4 : multicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.091/0.135/0.185/0.026
node5 :   unicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.080/0.202/0.410/0.053
node5 : multicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.090/0.214/0.419/0.051
node6 :   unicast, xmt/rcv/%loss = 44/44/0%, min/avg/max/std-dev = 0.063/0.099/0.214/0.028
node6 : multicast, xmt/rcv/%loss = 44/44/0%, min/avg/max/std-dev = 0.075/0.108/0.219/0.028
node7 :   unicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.100/0.141/0.303/0.044
node7 : multicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.108/0.158/0.331/0.045

root@node5:~# omping -c 600 -i 1 -q node1 node2 node3 node4 node5 node6 node7
[...]
node1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.198/0.471/0.082
node1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.076/0.201/0.452/0.068
node2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.173/1.370/0.088
node2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.078/0.204/1.396/0.088
node3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.167/0.467/0.067
node3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.075/0.198/0.484/0.068
node4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.156/0.582/0.066
node4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.077/0.170/0.565/0.057
node6 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.187/3.087/0.141
node6 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.084/0.208/3.113/0.141
node7 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.083/0.205/0.436/0.055
node7 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.102/0.221/0.431/0.050

Could you help me please?

Thank you very much!
Bye
 
Last edited:
I realized that these two nodes have an high load average even if apparently they haven't heavy processes running:

Code:
root@node5:/var/lib# w 18:50:21 up 2 days, 23:26,  2 users,  load average: 8.41, 7.89, 6.41
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.253.6:S. 18:38    8:45   0.00s  0.00s vzctl stop 306
root     pts/2    192.168.253.6    17:21    1.00s  0.35s  0.00s w

"vzctl stop" on one of this node is running since ten minutes without any answer:

Code:
root@node5:~# vzctl stop 306
Stopping container ...

This is the htop output:

Code:
  1  [||||||                                                        7.1%]     5  [|||                                                           2.6%]
  2  [                                                              0.0%]     6  [                                                              0.0%]
  3  [|                                                             0.7%]     7  [                                                              0.0%]
  4  [                                                              0.0%]     8  [|                                                             0.6%]
  Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||  11070/32158MB]     Tasks: 101, 183 thr; 1 running
  Swp[|                                                       38/38143MB]     Load average: 7.03 7.55 6.46 
                                                                              Uptime: 2 days, 23:28:23


    PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 607065 root       20   0 8725M 8210M  3880 S  5.0 25.5  2:21.78 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 614217 root       20   0 24272  2604  1388 R  3.0  0.0  0:00.35 htop
 607086 root       20   0 8725M 8210M  3880 S  2.0 25.5  1:16.56 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 607087 root       20   0 8725M 8210M  3880 S  1.0 25.5  0:52.46 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 523164 root        0 -20  207M 71792 43864 S  0.0  0.2  2:07.40 corosync -f
 605883 www-data   20   0  286M 64164  5352 S  0.0  0.2  0:12.58 pveproxy worker
 606987 root       20   0 4559M 1630M  3880 S  0.0  5.1  0:51.66 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
 606967 root       20   0 4559M 1630M  3880 S  0.0  5.1  1:31.62 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
 606988 root       20   0 4559M 1630M  3880 S  0.0  5.1  0:34.84 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
      1 root       20   0 10604   772   704 S  0.0  0.0  0:02.55 init [2]
    471 root       20   0 21392  1268   832 S  0.0  0.0  0:00.20 udevd --daemon
   2087 root       20   0 18928   892   708 S  0.0  0.0  0:00.23 /sbin/rpcbind -w
   2104 statd      20   0 27508  1260  1028 S  0.0  0.0  0:00.00 /sbin/rpc.statd
   2127 root       20   0 25252   676   524 S  0.0  0.0  0:00.00 /usr/sbin/rpc.idmapd
   2249 root       20   0  4952   480   432 S  0.0  0.0  0:03.53 /usr/sbin/iscsid
   2250 root       10 -10  5452  3400  2236 S  0.0  0.0  0:14.75 /usr/sbin/iscsid
   2384 root       20   0  122M  2500  1236 S  0.0  0.0  0:02.89 /usr/sbin/rsyslogd -c5
   2385 root       20   0  122M  2500  1236 S  0.0  0.0  0:00.44 /usr/sbin/rsyslogd -c5
   2386 root       20   0  122M  2500  1236 S  0.0  0.0  0:00.47 /usr/sbin/rsyslogd -c5
   2362 root       20   0  122M  2500  1236 S  0.0  0.0  0:03.85 /usr/sbin/rsyslogd -c5
   2376 root       20   0  9368   944   636 S  0.0  0.0  0:06.12 /bin/bash /usr/sbin/ksmtuned
   2383 root       20   0  4052   436   432 S  0.0  0.0  0:00.00 startpar -f -- ksmtuned
   2425 root       20   0 15048   304   252 S  0.0  0.0  0:00.00 /usr/sbin/vzeventd
   2446 daemon     20   0 16628   196   172 S  0.0  0.0  0:00.00 /usr/sbin/atd
   2466 root       20   0 16844   272   268 S  0.0  0.0  0:00.00 /usr/bin/daemon /etc/init.d/megaclisas-statusd check_megaclisas
   2469 root       20   0  4136   728   584 S  0.0  0.0  0:00.91 /bin/sh /etc/init.d/megaclisas-statusd check_megaclisas
   2488 ntp        20   0 39028  2000  1744 S  0.0  0.0  0:08.63 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 106:111
   2525 root       20   0 49888   748   632 S  0.0  0.0  0:00.03 /usr/sbin/sshd
   2567 root       20   0  303M  1584   908 S  0.0  0.0  0:09.56 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
   2568 root       20   0  303M  1584   908 S  0.0  0.0  0:09.57 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
   2569 root       20   0  303M  1584   908 S  0.0  0.0  0:09.52 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit

And this is the iotop one:

Code:
Total DISK READ:       0.00 B/s | Total DISK WRITE:      64.69 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                   
  336 be/3 root        0.00 B/s   22.83 K/s  0.00 %  0.03 % [jbd2/dm-0-8]
609302 be/4 root        0.00 B/s    3.81 K/s  0.00 %  0.00 % rrdcached -l unix:/var/run/rrdcached.sock -j ~lib/rrdcached/db/ -B -p /var/run/rrdcached.pid
131072 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % dsm_sa_datamgrd
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init [2]
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    4 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
    8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/1]
    9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
   10 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
   11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/2]
   13 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
   14 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   15 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
   16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/3]
   17 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
   18 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
   19 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
   20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/4]
   21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
   22 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
   23 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
   24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/5]
   25 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/5]
   26 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/5]
   27 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
   28 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/6]
   29 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/6]
   30 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/6]
   31 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
   32 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/7]
   33 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/7]
   34 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/7]
   35 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/0]
   36 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/1]
   37 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/4]

So I cannot understand what is the cause of this high load average!

I see that node5 and node6 comes back online randomly.

Do you have any idea?