My cluster seems broken, please help

Ponytech

New Member
May 30, 2013
13
0
1
ponytech.net
Here is my setup:

I have 2 nodes (let's call them node1 and node2) on ip addresses 10.19.82.1 and 10.19.82.2. It used to work nicely.
I now want to add a third node (with ip address 10.19.82.3) but it fails :


Code:
root@node3:~# pvecm add 10.19.82.1
unable to copy ssh ID

root@node3:~# ssh-copy-id 10.19.82.1
cat: write error: Permission denied

root@node1:~# touch /etc/pve/test
touch: cannot touch `/etc/pve/test': Permission denied

It looks like my /etc/pve is read only, but I can't find why.

Status looks OK on both nodes:
Code:
root@node1:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node1
Node ID: 2
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.1

root@node2:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node2
Node ID: 1
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.2

Unicast is working:
Code:
root@node1:~# ssmpingd

root@node2:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.2
  unicast from 10.19.82.1, seq=1 dist=0 time=8.970 ms


root@node3:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.3
  unicast from 10.19.82.1, seq=1 dist=0 time=9.123 ms
 ...

I tried restarting pve-cluster, cman, pvedaemon, pvestatd, pve-manager. It all worked but cman :

Code:
root@node1# service cman restart
Stopping cluster:
   Stopping dlm_controld...
[FAILED]

Anyone can help ? I had been stuck on this for hours :(
Thanks a lot.
 
Code:
# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:
Code:
Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?
 
Code:
# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:
Code:
Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?
Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo
 
Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo

Code:
root@node3:~# ls -l /etc/pve
total 3
-rw-r----- 1 root www-data  451 Apr 19 16:59 authkey.pub
-rw-r----- 1 root www-data   16 Apr 19 16:59 datacenter.cfg
lrwxr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node3
drwxr-x--- 2 root www-data    0 Jan 20  2012 nodes
lrwxr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node3/openvz
drwx------ 2 root www-data    0 Jan 20  2012 priv
-rw-r----- 1 root www-data 1350 Apr 19 16:59 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Apr 19 16:59 pve-www.key
lrwxr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node3/qemu-server
-rw-r----- 1 root www-data  119 Jan 20  2012 vzdump.cron


root@node3:~# pvecm status
cman_tool: Cannot open connection to cman, is it running ?
 
quorum = majority

With two nodes both need to be online to have quorum.
It is suggested to have a minimum of three nodes in a cluster. With threee two need to be online to have quorum.

Have you tried rebooting the nodes?


Thanks for the explanation.

I would strongly prefer not having to reboot the nodes as I have live services running on those.


I have tried pvecm expected 1 to reduce the quorom but /etc/pve is still read-only after that command.

I suspect a network problem between my nodes. Unicast seems to work, what else can I look for? On what port are the nodes communicating over?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!