My cluster seems broken, please help

Ponytech · Apr 20, 2014

Here is my setup:

I have 2 nodes (let's call them node1 and node2) on ip addresses 10.19.82.1 and 10.19.82.2. It used to work nicely.
I now want to add a third node (with ip address 10.19.82.3) but it fails :

Code:

root@node3:~# pvecm add 10.19.82.1
unable to copy ssh ID

root@node3:~# ssh-copy-id 10.19.82.1
cat: write error: Permission denied

root@node1:~# touch /etc/pve/test
touch: cannot touch `/etc/pve/test': Permission denied

It looks like my /etc/pve is read only, but I can't find why.

Status looks OK on both nodes:

Code:

root@node1:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node1
Node ID: 2
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.1

root@node2:~# pvecm status
Version: 6.2.0
Config Version: 8
Cluster Name: ponytech
Cluster Id: 28530
Cluster Member: Yes
Cluster Generation: 62192
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: node2
Node ID: 1
Multicast addresses: 239.192.111.225
Node addresses: 10.19.82.2

Unicast is working:

Code:

root@node1:~# ssmpingd

root@node2:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.2
  unicast from 10.19.82.1, seq=1 dist=0 time=8.970 ms


root@node3:# asmping 239.192.111.225 10.19.82.1
asmping joined (S,G) = (*,239.192.111.234)
pinging 10.19.82.1 from 10.19.82.3
  unicast from 10.19.82.1, seq=1 dist=0 time=9.123 ms
 ...

I tried restarting pve-cluster, cman, pvedaemon, pvestatd, pve-manager. It all worked but cman :

Code:

root@node1# service cman restart
Stopping cluster:
   Stopping dlm_controld...
[FAILED]

Anyone can help ? I had been stuck on this for hours

Thanks a lot.

dietmar · Apr 20, 2014

What is the output of

# ls -l /etc/pve

Are there any errors in /var/log/syslog?

Ponytech · Apr 20, 2014

Code:

# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:

Code:

Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?

udo · Apr 20, 2014

Hi,
and what is the output of "pvecm status" on node 3?

Udo

udo · Apr 20, 2014

Ponytech said:

Code:

# ls -l /etc/pve
total 4
-r--r----- 1 root www-data  451 May 24  2013 authkey.pub
-r--r----- 1 root www-data  290 Sep 10  2013 cluster.conf
-r--r----- 1 root www-data  238 Sep 10  2013 cluster.conf.old
-r--r----- 1 root www-data   16 May 24  2013 datacenter.cfg
lr-xr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node1
dr-xr-x--- 2 root www-data    0 Jan 20  2012 nodes
lr-xr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node1/openvz
dr-x------ 2 root www-data    0 Jan 20  2012 priv
-r--r----- 1 root www-data 1533 May 24  2013 pve-root-ca.pem
-r--r----- 1 root www-data 1675 May 24  2013 pve-www.key
lr-xr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node1/qemu-server
-r--r----- 1 root www-data  142 May 24  2013 storage.cfg
-r--r----- 1 root www-data  119 Jan 20  2012 vzdump.cron

syslog is full of this error:

Code:

Apr 20 19:12:54 node1 pmxcfs[496585]: [status] crit: cpg_send_message failed: 9

Any ideas on what's wrong ?

Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo

Ponytech · Apr 21, 2014

udo said:
Hi,
I guess Dietmar mean the output from "ls -l /etc/pve" on node 3.

Udo

Code:

root@node3:~# ls -l /etc/pve
total 3
-rw-r----- 1 root www-data  451 Apr 19 16:59 authkey.pub
-rw-r----- 1 root www-data   16 Apr 19 16:59 datacenter.cfg
lrwxr-x--- 1 root www-data    0 Jan  1  1970 local -> nodes/node3
drwxr-x--- 2 root www-data    0 Jan 20  2012 nodes
lrwxr-x--- 1 root www-data    0 Jan  1  1970 openvz -> nodes/node3/openvz
drwx------ 2 root www-data    0 Jan 20  2012 priv
-rw-r----- 1 root www-data 1350 Apr 19 16:59 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Apr 19 16:59 pve-www.key
lrwxr-x--- 1 root www-data    0 Jan  1  1970 qemu-server -> nodes/node3/qemu-server
-rw-r----- 1 root www-data  119 Jan 20  2012 vzdump.cron


root@node3:~# pvecm status
cman_tool: Cannot open connection to cman, is it running ?

dietmar · Apr 21, 2014

cman is not running?

Ponytech · Apr 21, 2014

dietmar said:
cman is not running?

Apparetly not on node3. Do you have to make it run before joining the cluster?

Anyway it doesn't explain why my /etc/pve is read only on node1. Any idea about that ?

mir · Apr 21, 2014

Ponytech said:
Anyway it doesn't explain why my /etc/pve is read only on node1. Any idea about that ?

If there is no quorum /etc/pve will be read-only on all nodes.

Ponytech · Apr 21, 2014

mir said:
If there is no quorum /etc/pve will be read-only on all nodes.

I am not familiar with this quorum concept, can you point me to some good resources?

How can I know if the quorum is reached or not?

dietmar · Apr 21, 2014

mir said:
If there is no quorum /etc/pve will be read-only on all nodes.

@mir: The pvecm status above say the cluster is quorate.

e100 · Apr 21, 2014

quorum = majority

With two nodes both need to be online to have quorum.
It is suggested to have a minimum of three nodes in a cluster. With threee two need to be online to have quorum.

Have you tried rebooting the nodes?

Ponytech · Apr 21, 2014

e100 said:
quorum = majority

With two nodes both need to be online to have quorum.
It is suggested to have a minimum of three nodes in a cluster. With threee two need to be online to have quorum.

Have you tried rebooting the nodes?

Thanks for the explanation.

I would strongly prefer not having to reboot the nodes as I have live services running on those.

I have tried pvecm expected 1 to reduce the quorom but /etc/pve is still read-only after that command.

I suspect a network problem between my nodes. Unicast seems to work, what else can I look for? On what port are the nodes communicating over?

tom · Apr 21, 2014

Ponytech said:
... On what port are the nodes communicating over?

http://pve.proxmox.com/wiki/Ports

Search

Search

My cluster seems broken, please help

Ponytech

New Member

dietmar

Proxmox Staff Member

Ponytech

New Member

udo

Distinguished Member

udo

Distinguished Member

Ponytech

New Member

dietmar

Proxmox Staff Member

Ponytech

New Member

mir

Famous Member

Ponytech

New Member

dietmar

Proxmox Staff Member

e100

Renowned Member

Ponytech

New Member

tom

Proxmox Staff Member