Cluster offline / cman works / clustat shows all online / all nodes red

spirit · Apr 10, 2013

If you have igmp snooping (filtering) somewhere, you can have the case of omping works, and corosync not working.
(because both use differents multicast address, I don't remember, but omping have a default specific multicast address, and corosync create an address from the cluster name)

So your cisco switchs (with IGMP snooping : Enabled), create multicast groups by multicast address, each groups have physical ports. (sh igmp snooping groups)
So you can have a working group for omping, and a non working group for corosync.

So, first thing to check if is corosync works.
# /etc/init.d/cman restart
# pvecm status

check on your cisco switchs that group is ok (sh igmp snooping groups)

If that's ok, then you can restart
#/etc/iinit.d/pve-cluster restart -> so, you can write in /etc/pve/.....

Then finally restart
#/etc/init.d/pvestatd -> your node should come back in green

But try to disable igmp snooping for your switchs first to see if it's helping. (you can enable/disable it by vlan I think)

I don't know for your cisco switchs, by mine, show ip igmp snooping groups

your can notice differents mutlicast groups with differents ports

Code:

# show ip igmp snooping groups
Vlan      Group                    Type        Version     Port List
-----------------------------------------------------------------------
94        239.192.3.235            igmp        v2          Gi0/3, Gi0/9, 
                                                           Gi0/24
94        239.192.13.22            igmp        v2          Gi0/5, Gi0/7, 
                                                           Gi0/10, Gi0/12, 
                                                           Gi0/24


sw-dist33# show ip igmp snooping groups vlan 94
Vlan      Group                    Type        Version     Port List
-----------------------------------------------------------------------
94        239.192.3.235            igmp        v2          Gi0/3, Gi0/9, 
                                                           Gi0/24
94        239.192.13.22            igmp        v2          Gi0/5, Gi0/7, 
                                                           Gi0/10, Gi0/12, 
                                                           Gi0/24
sw-dist33#

Edit:
default multicast address:

omping: 232.43.211.234

corosync:
By default it will generate a multicast address using 239.192.x.x where x.x is the 16bit cluster ID number split into bytes.
you can see the multicast address of corosync with "pvecm status"

So for test, you need to find use omping with your corosync multicast address
"omping -m 239.192.x.x ...."

Chris Rivera · Apr 10, 2013

Can you send me an email to crivera [at] gmail . com?

I'd like to discuss some private network information on a non public forum and keep the troubleshooting and solution on the forum.

####

I've tried to restart the cman service now on the nodes and they will not stop.

root@proxmox11:~# pvecm status
Version: 6.2.0
Config Version: 58
Cluster Name: FL-Cluster
Cluster Id: 6836
Cluster Member: Yes
Cluster Generation: 5217712
Membership state: Cluster-Member
Nodes: 11
Expected votes: 11
Total votes: 11
Node votes: 1
Quorum: 6
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox11
Node ID: 1

I ran:

# conf t# no ip igmp snooping

so igmp on the switch is disabled

cisco3750-1.mia.fortatrust.com#show ip igmp snooping groups vlan 805
cisco3750-1.mia.fortatrust.com#show ip igmp snooping groups vlan 801
cisco3750-1.mia.fortatrust.com#

Chris Rivera · Apr 10, 2013

Syslogs from node 3

Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [status] crit: cpg_send_message failed: 9
Apr 10 12:49:50 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16230
Apr 10 12:49:51 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16240
Apr 10 12:49:52 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16250
Apr 10 12:49:53 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16260
Apr 10 12:49:54 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16270
Apr 10 12:49:55 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16280
Apr 10 12:49:56 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16290
Apr 10 12:49:57 proxmox3a pmxcfs[586014]: [dcdb] notice: cpg_join retry 16300

Chris Rivera · Apr 10, 2013

I tried using the multicast address listed in the pvecm status

omping -m 239.192.**.** 64.***.**.113
omping: Can't find local address in arguments

I was able to get this working by using the local ip of the server i am running omping on

root@proxmox11:~# omping -m 239.192.26.206 63.217.249.142
63.217.249.142 : waiting for response msg
63.217.249.142 : joined (S,G) = (*, 239.192.26.206), pinging
63.217.249.142 : unicast, seq=1, size=69 bytes, dist=0, time=0.017ms
63.217.249.142 : multicast, seq=1, size=69 bytes, dist=0, time=0.034ms
63.217.249.142 : unicast, seq=2, size=69 bytes, dist=0, time=0.048ms
63.217.249.142 : multicast, seq=2, size=69 bytes, dist=0, time=0.056ms
63.217.249.142 : unicast, seq=3, size=69 bytes, dist=0, time=0.076ms

so multicast seems to be working from the node using the multicast address assigned by pvecm status.

Let me know if there are any other things i can do to troubleshoot this.

spirit, my email address was wrong sorry crivera305 [at] gmail . com.

Chris Rivera · Apr 10, 2013

Looking at the switch for the port interface i can see that multicast packets are only changing 1 every 10-15 seconds.... if i run a omping i can see that multicast packets move about 10-15 every 5 seconds. I am suspecting that what ever proxmox service is set to send out and receive multicast is whats having the issue.

What is the normal multicast rate per 5,10 seconds should be normal?

As you can see below show interface shows the same multicast packages when viewed 5-10 seconds after the first message

GigabitEthernet1/0/17 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 001b.8f95.a811 (bia 001b.8f95.a811)
Description: ProxmoxNode_11
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
Media-type configured as connector
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:37, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 3326172142
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 443000 bits/sec, 117 packets/sec
5 minute output rate 372000 bits/sec, 118 packets/sec
36909206416 packets input, 7481399660023 bytes, 0 no buffer
Received 3137803299 broadcasts (3137703684 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 3137703684 multicast, 0 pause input
0 input packets with dribble condition detected
39063500759 packets output, 15663360771550 bytes, 0 underruns
0 output errors, 0 collisions, 12 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out

cisco3750-1.mia.fortatrust.com#sho interfaces gigabitEthernet 1/0/17
GigabitEthernet1/0/17 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 001b.8f95.a811 (bia 001b.8f95.a811)
Description: ProxmoxNode_11
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
Media-type configured as connector
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:42, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 3326172142
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 445000 bits/sec, 119 packets/sec
5 minute output rate 377000 bits/sec, 120 packets/sec
36909206416 packets input, 7481399660023 bytes, 0 no buffer
Received 3137803299 broadcasts (3137703684 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 3137703684 multicast, 0 pause input
0 input packets with dribble condition detected
39063500759 packets output, 15663360771550 bytes, 0 underruns
0 output errors, 0 collisions, 12 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out

Chris Rivera · Apr 10, 2013

What would cause cman to not stop?

root@poxmox5:~# service cman restart
Stopping cluster:
Stopping dlm_controld...
[FAILED]

spirit · Apr 11, 2013

Chris Rivera said:
I tried using the multicast address listed in the pvecm status

omping -m 239.192.**.** 64.***.**.113
omping: Can't find local address in arguments

omping -m 239.192.26.206 node1ip node2ip node3ip on each node

spirit · Apr 11, 2013

I think it's not a problem, corosync continue to try to join the cluster after a mutlicast break, so restart cman is really need if corosync daemon crash.

if pvecm status show:
Expected votes: X
Total votes: X

with expected vodes = total votes, then all yours nodes are synced.

Note that you need to have quorum ((total nodes / 2)+ 1), before be able to restart pve-cluster
#/etc/init.d/pve-cluster

also, if you have disable snooping on cisco switch, dont forget to do it on linux bridge

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

Chris Rivera · Apr 11, 2013

Expected votes: 11
Total votes: 11

Shows all online.

#####

Clustat also shows the same:

root@proxmox2:~# clustat
Cluster Status for FL-Cluster @ Thu Apr 11 11:41:33 2013
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online, Local
proxmox3a 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9a 9 Online
Proxmox10 10 Online
proxmox12 11 Online

still cman is not restartable on any node.

#####

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

was completed on node 2, networking was restarted. Still shows offline / red, even when accessing the interface thru node 2 ip address.

Chris Rivera · Apr 11, 2013

this was completed... i can verify that all unicast / multicast communication between nodes are working

spirit · Apr 11, 2013

Chris Rivera said:
this was completed... i can verify that all unicast / multicast communication between nodes are working

so ,
Expected votes: 11
Total votes: 11

your multicast && corosync is ok, no need to restart cman.

next thing is to try

# /etc/init.d/pve-cluster restart

and check if you can open/write a vm config file in /etc/pve/qemu-server/<vmid>.conf.
on each node

if that is ok, you need to restart pvestatd

#/etc/init.pvestatd restart

if after that, nodes are red, then something is hanging pvestatd
you need to check in
#cat /var/log/daemon.log|grep pvestatd

Chris Rivera · Apr 11, 2013

/etc/init.d/pve-cluster restart......done

nano /etc/pve/qemu-server/105.conf...... permission denied

/etc/init.d/pve-cluster stop on all nodes in the cluster...... done
/etc/init.d/pve-cluster start on all nodes in the cluster ...... done

nano /etc/pve/qemu-server/105.conf...... permission denied

At this point i can see the issue with the the pve-cluster shared mount /etc/pve/

/etc/init.d/pvestatd restart.... done still red

#####

cat /var/log/daemon.log|grep pvestatd

Apr 11 11:37:40 proxmox2 pvestatd[575642]: status update time (14.731 seconds)
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:38:30 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:24 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: status update time (44.195 seconds)
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 11:39:35 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:52:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:55:55 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:05 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:15 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:56:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2425 /bin/cat /proc/net/dev' failed: exit code 10
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2477 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2463 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2444 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2431 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2445 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2457 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2430 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2426 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2499 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2448 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2467 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2464 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2432 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2433 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2478 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2427 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2452 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Transport endpoint is not connected
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2441 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2465 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2449 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2425 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2477 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2463 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2444 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2431 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2445 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2457 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2430 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2426 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2499 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2448 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2467 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2464 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2432 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2433 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2478 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2427 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:16 proxmox2 pvestatd[575642]: command '/usr/sbin/vzctl exec 2452 /bin/cat /proc/net/dev' failed: exit code 14
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 12:57:25 proxmox2 pvestatd[575642]: WARNING: ipcc_send_rec failed: Connection refused
Apr 11 13:00:10 proxmox2 pvestatd[575642]: server closing
Apr 11 13:00:10 proxmox2 pvestatd[315707]: starting server

Chris Rivera · Apr 11, 2013

I found another post that seems to be relevant to what i am experiencing:

http://forum.proxmox.com/threads/7836-pvestatd-WARNING-ipcc_send_rec-failed-Connection-refused

I do run backups 2 times a week Sundays / Wednesday. Not sure if this is the cause of it because i've noticed the downtime to happen when the backups are not happening but wanted to bring it up incase that could be an underlying issue.

When the backups are ran we are backing up 11 nodes of vms / cts to 1 large gluster nfs mount.

Spirit,

Do you offer consulting services? I am looking for a faster way to come up with a solution, its already spanned over 3 days... i estimate another 2 days before the cluster magically comes back online. I would like to figure out exactly what is causing this weekly recurrence.

Chris Rivera · Apr 11, 2013

I tried to remove a non working node from the cluster.

root@proxmox2:/etc/pve/qemu-server# pvecm delnode proxmox12
cluster not ready - no quorum?

so while clustat shows all nodes online
pvecm status shows all nodes in vote

but removing a node from the cluster shows the cluster has no quorum..

Chris Rivera · Apr 11, 2013

OK this seems to be clearing up again. This is another part of the cycle that happens right before things start to work again.

I have node 10,9,6 online when i pull up these node ips. On these nodes cman is working and restarting cman is possible. clustat shows:

root@proxmox6:~# clustat
Cluster Status for FL-Cluster @ Thu Apr 11 15:59:15 2013
Member Status: Inquorate

Member Name ID Status
------ ---- ---- ------
proxmox11 1 Offline
proxmox2 2 Offline
proxmox3a 3 Offline
proxmox4 4 Offline
poxmox5 5 Offline
proxmox6 6 Online, Local
proxmox7 7 Online
proxmox8 8 Offline
proxmox9a 9 Online
Proxmox10 10 Online
proxmox12 11 Offline

Node 7 is the only node with like 5 vms so i rebooted it since cman was not working. Since rebooting the node came back online and joined the cluster of the nodes i provided above.

#######################

clustat shows all nodes online for the nodes where cman is not restartable

Cluster Status for FL-Cluster @ Thu Apr 11 16:00:22 2013
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3a 3 Online, Local
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9a 9 Online
Proxmox10 10 Online
proxmox12 11 Online

######################

The issue is cman.... cman thinks its not crashed & running, but its not working. I should always be able to start or stop any services on my own dedicated server.

clustat / cman is providing and update online status from when it was running which is not right. I believe this is a bug with the cman service that needs further investigation

On the nodes where cman will not restart if i:

ps -ef | grep coro

kill -9 {coropid} # this kills the corosync / cman service...

clustat shows cman is not running

service cman start fails

######

Restarting the node seems to be to fastest solution but not the most plausible... as mentioned before this is part of the cycle when things start coming back so if i wait another day or two i should have all nodes back online and green.

1. there was no solution? we know what problems i had but what was the cause of it?
2. how do i stop this from happening?
3. Who can we pay to guarantee they can track down / and stop this from happening weekly ( this is personally stressing )
4. Is it possible that when trying to start or Stop dlm_controld and its failing that this kernel /module / daemon is not functioning properly?
5. Is this normal? If the service is forced kill can i restart it without restarting the node?
6. I believe if the following outputs of clustat / pvecm nodes / pvecm status had a timestamp of the last time the data was refreshed this would help spot this issue out immediately

Having a timestamp for clustat would easily let me know that all nodes show online but hasnt been updated in 3 days there is a problem with cman.... just a suggestion

Chris Rivera · Apr 15, 2013

still no solution to a weekly occurring problem.

Any advice?

Search

Search

Cluster offline / cman works / clustat shows all online / all nodes red

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

spirit

Distinguished Member

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

We value your privacy