My cluster seems broken, please help

Ponytech

New Member
May 30, 2013
13
0
1
ponytech.net
Here is my setup:

I have 2 nodes (let's call them node1 and node2) on ip addresses 10.19.82.1 and 10.19.82.2. It used to work nicely.
I now want to add a third node (with ip address 10.19.82.3) but it fails :


Code:
root@node3:~# pvecm add 10.19.82.1
unable to copy ssh ID

root@node3:~# ssh-copy-id 10.19.82.1
cat: write error: Permission denied

root@node1:~# touch /etc/pve/test
touch: cannot touch `/etc/pve/test': Permission denied

It looks like my /etc/pve is read only, but I can't find why.

Status looks OK on both nodes:
Code:
root@node1:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node1
Node ID: 2
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.1

root@node2:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node2
Node ID: 1
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.2

Unicast is working:
Code:
root@node1:~# ssmpingd

root@node2:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.2
  unicast from 10.19.82.1, seq=1 dist=0 time=8.970 ms


root@node3:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.3
  unicast from 10.19.82.1, seq=1 dist=0 time=9.123 ms
 ...

I tried restarting pve-cluster, cman, pvedaemon, pvestatd, pve-manager. It all worked but cman :

Code:
root@node1# service cman restart
Stopping cluster:
   Stopping dlm_controld...
[FAILED]

Anyone can help ? I had been stuck on this for hours :(
Thanks a lot.
 
Code:
# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:
Code:
Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?
 
Code:
# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:
Code:
Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?
Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo
 
Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo

Code:
root@node3:~# ls -l /etc/pve
total 3
-rw-r----- 1 root www-data  451 Apr 19 16:59 authkey.pub
-rw-r----- 1 root www-data   16 Apr 19 16:59 datacenter.cfg
lrwxr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node3
drwxr-x--- 2 root www-data    0 Jan 20  2012 nodes
lrwxr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node3/openvz
drwx------ 2 root www-data    0 Jan 20  2012 priv
-rw-r----- 1 root www-data 1350 Apr 19 16:59 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Apr 19 16:59 pve-www.key
lrwxr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node3/qemu-server
-rw-r----- 1 root www-data  119 Jan 20  2012 vzdump.cron


root@node3:~# pvecm status
cman_tool: Cannot open connection to cman, is it running ?
 
quorum = majority

With two nodes both need to be online to have quorum.
It is suggested to have a minimum of three nodes in a cluster. With threee two need to be online to have quorum.

Have you tried rebooting the nodes?


Thanks for the explanation.

I would strongly prefer not having to reboot the nodes as I have live services running on those.


I have tried pvecm expected 1 to reduce the quorom but /etc/pve is still read-only after that command.

I suspect a network problem between my nodes. Unicast seems to work, what else can I look for? On what port are the nodes communicating over?