proxy detected vanished client connection

decibel83 · Oct 11, 2015

Hi.
I have a PVE cluster with 7 nodes with the same configuration (same network and same switch), but I have some random problems on two of them (node5 and node6): sometimes they loose the communication with the cluster, they are very slow, I cannot enter in containers ("vzctl enter CTID" does not give me any response), and on them I have these errors in /var/log/syslog:

Code:

Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 11:42:27 node5 pveproxy[456134]: ipcc_send_rec failed: Connection refused
Oct 11 17:17:25 node5 pveproxy[560091]: proxy detected vanished client connection
Oct 11 17:18:56 node5 pveproxy[560091]: proxy detected vanished client connection
Oct 11 17:20:27 node5 pveproxy[568073]: proxy detected vanished client connection
Oct 11 17:21:58 node5 pveproxy[568073]: proxy detected vanished client connection

Code:

Oct 11 17:15:27 node6 pveproxy[4088]: proxy detected vanished client connection
Oct 11 17:16:52 node6 pveproxy[4087]: proxy detected vanished client connection

"pvecm nodes" returns me this:

Code:

root@node5:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  323432   2015-10-11 15:39:47  node1
   2   M  323432   2015-10-11 15:39:47  node2
   3   M  323432   2015-10-11 15:39:47  node3
   4   M  323432   2015-10-11 15:39:47  node4
   5   M  323408   2015-10-11 11:46:29  node5
   6   M  323436   2015-10-11 15:42:17  node6
   7   M  323432   2015-10-11 15:39:47  node7

root@node6:~# pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M  323436   2015-10-11 15:42:16  node1
   2   M  323436   2015-10-11 15:42:16  node2
   3   M  323436   2015-10-11 15:42:16  node3
   4   M  323436   2015-10-11 15:42:16  node4
   5   M  323436   2015-10-11 15:42:16  node5
   6   M  323412   2015-10-11 15:42:09  node6
   7   M  323436   2015-10-11 15:42:16  node7

Code:

root@node5:~# pvecm status
Version: 6.2.0
Config Version: 9
Cluster Name: mycluster
Cluster Id: 13573
Cluster Member: Yes
Cluster Generation: 323436
Membership state: Cluster-Member
Nodes: 7
Expected votes: 7
Total votes: 7
Node votes: 1
Quorum: 4  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node5
Node ID: 5
Multicast addresses: 239.192.53.58 
Node addresses: 192.168.60.5

root@node6:~# pvecm status
Version: 6.2.0
Config Version: 9
Cluster Name: mycluster
Cluster Id: 13573
Cluster Member: Yes
Cluster Generation: 323436
Membership state: Cluster-Member
Nodes: 7
Expected votes: 7
Total votes: 7
Node votes: 1
Quorum: 4  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node6
Node ID: 6
Multicast addresses: 239.192.53.58 
Node addresses: 192.168.60.6

They can communicate with all other cluster members, and their date/time are correct.
This is my multicast tests:

Code:

root@node1:~# omping -m 239.192.53.58 node1 node2 node3 node4 node5 node6 node7
[...]
node2 :   unicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.084/0.139/0.173/0.026
node2 : multicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.094/0.155/0.201/0.028
node3 :   unicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.057/0.099/0.150/0.019
node3 : multicast, xmt/rcv/%loss = 49/49/0%, min/avg/max/std-dev = 0.075/0.111/0.163/0.022
node4 :   unicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.076/0.120/0.160/0.023
node4 : multicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.091/0.135/0.185/0.026
node5 :   unicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.080/0.202/0.410/0.053
node5 : multicast, xmt/rcv/%loss = 45/45/0%, min/avg/max/std-dev = 0.090/0.214/0.419/0.051
node6 :   unicast, xmt/rcv/%loss = 44/44/0%, min/avg/max/std-dev = 0.063/0.099/0.214/0.028
node6 : multicast, xmt/rcv/%loss = 44/44/0%, min/avg/max/std-dev = 0.075/0.108/0.219/0.028
node7 :   unicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.100/0.141/0.303/0.044
node7 : multicast, xmt/rcv/%loss = 20/20/0%, min/avg/max/std-dev = 0.108/0.158/0.331/0.045

root@node5:~# omping -c 600 -i 1 -q node1 node2 node3 node4 node5 node6 node7
[...]
node1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.198/0.471/0.082
node1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.076/0.201/0.452/0.068
node2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.173/1.370/0.088
node2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.078/0.204/1.396/0.088
node3 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.053/0.167/0.467/0.067
node3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.075/0.198/0.484/0.068
node4 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.054/0.156/0.582/0.066
node4 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.077/0.170/0.565/0.057
node6 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.061/0.187/3.087/0.141
node6 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.084/0.208/3.113/0.141
node7 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.083/0.205/0.436/0.055
node7 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.102/0.221/0.431/0.050

Could you help me please?

Thank you very much!
Bye

decibel83 · Oct 11, 2015

I realized that these two nodes have an high load average even if apparently they haven't heavy processes running:

Code:

root@node5:/var/lib# w 18:50:21 up 2 days, 23:26,  2 users,  load average: 8.41, 7.89, 6.41
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    192.168.253.6:S. 18:38    8:45   0.00s  0.00s vzctl stop 306
root     pts/2    192.168.253.6    17:21    1.00s  0.35s  0.00s w

"vzctl stop" on one of this node is running since ten minutes without any answer:

Code:

root@node5:~# vzctl stop 306
Stopping container ...

This is the htop output:

Code:

  1  [||||||                                                        7.1%]     5  [|||                                                           2.6%]
  2  [                                                              0.0%]     6  [                                                              0.0%]
  3  [|                                                             0.7%]     7  [                                                              0.0%]
  4  [                                                              0.0%]     8  [|                                                             0.6%]
  Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||  11070/32158MB]     Tasks: 101, 183 thr; 1 running
  Swp[|                                                       38/38143MB]     Load average: 7.03 7.55 6.46 
                                                                              Uptime: 2 days, 23:28:23


    PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 607065 root       20   0 8725M 8210M  3880 S  5.0 25.5  2:21.78 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 614217 root       20   0 24272  2604  1388 R  3.0  0.0  0:00.35 htop
 607086 root       20   0 8725M 8210M  3880 S  2.0 25.5  1:16.56 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 607087 root       20   0 8725M 8210M  3880 S  1.0 25.5  0:52.46 /usr/bin/kvm -id 405 -chardev socket,id=qmp,path=/var/run/qemu-server/405.qmp,server,now
 523164 root        0 -20  207M 71792 43864 S  0.0  0.2  2:07.40 corosync -f
 605883 www-data   20   0  286M 64164  5352 S  0.0  0.2  0:12.58 pveproxy worker
 606987 root       20   0 4559M 1630M  3880 S  0.0  5.1  0:51.66 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
 606967 root       20   0 4559M 1630M  3880 S  0.0  5.1  1:31.62 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
 606988 root       20   0 4559M 1630M  3880 S  0.0  5.1  0:34.84 /usr/bin/kvm -id 104 -chardev socket,id=qmp,path=/var/run/qemu-server/104.qmp,server,now
      1 root       20   0 10604   772   704 S  0.0  0.0  0:02.55 init [2]
    471 root       20   0 21392  1268   832 S  0.0  0.0  0:00.20 udevd --daemon
   2087 root       20   0 18928   892   708 S  0.0  0.0  0:00.23 /sbin/rpcbind -w
   2104 statd      20   0 27508  1260  1028 S  0.0  0.0  0:00.00 /sbin/rpc.statd
   2127 root       20   0 25252   676   524 S  0.0  0.0  0:00.00 /usr/sbin/rpc.idmapd
   2249 root       20   0  4952   480   432 S  0.0  0.0  0:03.53 /usr/sbin/iscsid
   2250 root       10 -10  5452  3400  2236 S  0.0  0.0  0:14.75 /usr/sbin/iscsid
   2384 root       20   0  122M  2500  1236 S  0.0  0.0  0:02.89 /usr/sbin/rsyslogd -c5
   2385 root       20   0  122M  2500  1236 S  0.0  0.0  0:00.44 /usr/sbin/rsyslogd -c5
   2386 root       20   0  122M  2500  1236 S  0.0  0.0  0:00.47 /usr/sbin/rsyslogd -c5
   2362 root       20   0  122M  2500  1236 S  0.0  0.0  0:03.85 /usr/sbin/rsyslogd -c5
   2376 root       20   0  9368   944   636 S  0.0  0.0  0:06.12 /bin/bash /usr/sbin/ksmtuned
   2383 root       20   0  4052   436   432 S  0.0  0.0  0:00.00 startpar -f -- ksmtuned
   2425 root       20   0 15048   304   252 S  0.0  0.0  0:00.00 /usr/sbin/vzeventd
   2446 daemon     20   0 16628   196   172 S  0.0  0.0  0:00.00 /usr/sbin/atd
   2466 root       20   0 16844   272   268 S  0.0  0.0  0:00.00 /usr/bin/daemon /etc/init.d/megaclisas-statusd check_megaclisas
   2469 root       20   0  4136   728   584 S  0.0  0.0  0:00.91 /bin/sh /etc/init.d/megaclisas-statusd check_megaclisas
   2488 ntp        20   0 39028  2000  1744 S  0.0  0.0  0:08.63 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 106:111
   2525 root       20   0 49888   748   632 S  0.0  0.0  0:00.03 /usr/sbin/sshd
   2567 root       20   0  303M  1584   908 S  0.0  0.0  0:09.56 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
   2568 root       20   0  303M  1584   908 S  0.0  0.0  0:09.57 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
   2569 root       20   0  303M  1584   908 S  0.0  0.0  0:09.52 /usr/bin/rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached/journal/ -F -b
F1Help  F2Setup F3SearchF4FilterF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit

And this is the iotop one:

Code:

Total DISK READ:       0.00 B/s | Total DISK WRITE:      64.69 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                   
  336 be/3 root        0.00 B/s   22.83 K/s  0.00 %  0.03 % [jbd2/dm-0-8]
609302 be/4 root        0.00 B/s    3.81 K/s  0.00 %  0.00 % rrdcached -l unix:/var/run/rrdcached.sock -j ~lib/rrdcached/db/ -B -p /var/run/rrdcached.pid
131072 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % dsm_sa_datamgrd
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init [2]
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    4 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
    8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/1]
    9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
   10 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
   11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/2]
   13 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
   14 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   15 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
   16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/3]
   17 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
   18 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
   19 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
   20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/4]
   21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
   22 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
   23 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
   24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/5]
   25 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/5]
   26 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/5]
   27 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
   28 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/6]
   29 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/6]
   30 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/6]
   31 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
   32 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [stopper/7]
   33 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/7]
   34 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/7]
   35 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/0]
   36 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/1]
   37 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/4]

So I cannot understand what is the cause of this high load average!

I see that node5 and node6 comes back online randomly.

Do you have any idea?

dietmar · Oct 11, 2015

decibel83 said:
I realized that these two nodes have an high load average even if apparently they haven't heavy processes running:

Unix load is a difficult to understand metric. Here is an explanation:

https://en.wikipedia.org/wiki/Load_(computing)

So this is more about processes waiting for IO (not CPU related).

Search

Search

proxy detected vanished client connection

decibel83

Renowned Member

decibel83

Renowned Member

dietmar

Proxmox Staff Member

We value your privacy