Can't work with Cluster (ve 2.1)

eugene · Jul 30, 2012

Hi.
Just tried to set up cluster on PVE 2.1 and I failed. Don't know what to do with that.
Here is the situation:
1) 3 servers (vmserv (10.101.0.10), vmserv1 (10.101.0.11), vmserv2 (10.101.0.12) ).
2) "vmserv" has VMs, other servers - no.
3) I started setup according to wiki http://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster . But something went wrong and I can't finish with cluster. I even can't revert system state back. I thought it should be possible to revert system state back (e.g. by cleaning configs), but can't see a way to do it.

root@vmserv:~# cman_tool nodes -a
Node Sts Inc Joined Name
1 M 12 2012-07-27 03:20:26 vmserv1
Addresses: 10.101.0.10

------- This is wrong. because IP of the "vmserv1" is 10.101.0.11. I can't say how that happened. I can't remove vmserv1 from cluster, because it
can't remove itself, I think so. And I can't do anything with that, because it is not fully configured....

Tried at "vmserv1":

root@vmserv:~# pvecm e 1
root@vmserv:~# pvecm delnode vmserv1
cluster not ready - no quorum?

------- Seems like the problem with quorum appeared in time of cluster setup. Tried "service cman restart" and " service pve-cluster restart" everywhere, didn't help.

Here is my /etc/hosts (all servers has the same):

root@vmserv:~# cat /etc/hosts
127.0.0.1 localhost
10.101.0.10 vmserv.atz.dmza.bogus vmserv
10.101.0.11 vmserv1.atz.dmza.bogus vmserv1
10.101.0.12 vmserv1.atz.dmza.bogus vmserv2

Here is my cluster config from "vmserv":
root@vmserv:~# cat /etc/pve/cluster.conf

<?xml version="1.0"?>
<cluster name="cluster" config_version="3">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey">
</cman>
<clusternodes>
<clusternode name="vmserv1" votes="1" nodeid="1"/></clusternodes>
</cluster>

In /var/log/syslog at "vmserv" we can see that it trying to do something on failed cluster (I think so):
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144420
Jul 30 10:10:31 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144430
Jul 30 10:10:32 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144440
Jul 30 10:10:33 vmserv dlm_controld[2184]: daemon cpg_leave error retrying
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144450
Jul 30 10:10:34 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144460

In /var/log/syslog at "vmserv1":
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144420
Jul 30 10:10:31 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144430
Jul 30 10:10:32 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144440
Jul 30 10:10:33 vmserv dlm_controld[2184]: daemon cpg_leave error retrying
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144450
Jul 30 10:10:34 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144460

The problem is that "vmserv" already has VMs and I worry about them. So, I decided to ask community first. What should I do with that? How can I stop the cluster and back things as they were before?

Thanks.

tom · Jul 30, 2012

your cluster.conf shows not all nodes - why? did you join the nodes? how? why does this fail?

does your network supports IP multicast - check it. see http://pve.proxmox.com/wiki/Multicast_notes

dietmar · Jul 31, 2012

Does it help if you restart cman and pve-cluster:

# /etc/init.d/cman stop
# /etc/init.d/cman start
# /etc/init.d/pve-cluster restart

eugene · Jul 31, 2012

tom said:
your cluster.conf shows not all nodes - why? did you join the nodes? how? why does this fail?

does your network supports IP multicast - check it. see http://pve.proxmox.com/wiki/Multicast_notes

Thanks for reply.
1) Not all nodes because I failed to add even 1 node, so I stoped.
on "vmserv1" I executed: "pvecm add vmserv", got usual output which stoped on "Wating for quorum...", then I got timeout message . After googling I tried cman and pve-cluster restart on both servers, it didn't help.

2) Multicast is supported. Checked with your link, works fine.

eugene · Jul 31, 2012

dietmar said:
Does it help if you restart cman and pve-cluster:

# /etc/init.d/cman stop
# /etc/init.d/cman start
# /etc/init.d/pve-cluster restart

# /etc/init.d/cman stop

Stopping cluster:
Stopping dlm_controld...
[FAILED]

# /etc/init.d/cman start

Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... (it takes about 10-15 sec, then became OK) [ OK ]
Unfencing self... [ OK ]

# /etc/init.d/pve-cluster restart
Restarting pve cluster filesystem: pve-cluster.

It didn't help. Anything else ?

P.S. I found that file system "/etc/pve" is read only on vmserv and vmserv1, so I can't edit any files in it.

root@vmserv:~# touch /etc/pve/1touch: cannot touch `/etc/pve/1': Permission denied

root@vmserv:~# mount
.........
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,default_permissions,allow_other)

In syslog still the same as in 1st post.

dietmar · Jul 31, 2012

what is the output of

# pvecm status

mir · Jul 31, 2012

eugene said:
P.S. I found that file system "/etc/pve" is read only on vmserv and vmserv1, so I can't edit any files in it.

root@vmserv:~# touch /etc/pve/1touch: cannot touch `/etc/pve/1': Permission denied

root@vmserv:~# mount
.........
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,default_permissions,allow_other)

In syslog still the same as in 1st post.

You are sure that the server didn't fail filecheck of root file system under boot and therefore has mounted the root file system read only?

eugene · Jul 31, 2012

root@vmserv:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vmserv1
Node ID: 1
Multicast addresses: 239.192.52.104
Node addresses: 10.101.0.10

eugene · Jul 31, 2012

dietmar said:
what is the output of

# pvecm status

root@vmserv:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vmserv1
Node ID: 1
Multicast addresses: 239.192.52.104
Node addresses: 10.101.0.10

eugene · Jul 31, 2012

mir said:
You are sure that the server didn't fail filecheck of root file system under boot and therefore has mounted the root file system read only?

Yes. The root FS mounted as usual and I have no problems with it. But /etc/pve is another mount location, which initiated by "pve-cluster", I think so. And as I can see "/var/lib/pve-cluster/config.db-wal" is the disk to mount to /etc/pve by "pve-cluster".

"/var/lib/pve-cluster/config.db-wal" is mounting via fuse. I would like to stop pve-cluster and maybe try to edit it manually, but I think it is some kind of internal pve team thing which has no public documentation.

eugene · Jul 31, 2012

Mmmmmm.... I found that I can't create or modify any VM's, getting "Node vmserv seems to be offline!".
And I'm getting "cluster not ready - no quorum?" at tab "HA".

Any thoughts?

eugene · Jul 31, 2012

I am not sure, but I think the problem is that I added a node from main node (my mistake). But how to fix this?

The sequence of executing of commands was almost like (don't remember exactly):
1) I copied ssh pub key from vmserv1 to vmserv

2) Created cluster: root@vmserv# pvecm create cluster

3) Did wrong step when adding node to a cluster, I executed adding process at wrong server (vmserv), must be at vmserv1: root@vmserv# pvecm add vmserv1

4) Then decided to do things "right" and tried to add vmserv1 again from right server: root@vmserv1# pvecm add vmserv1 - got timeout for quorum and this is it.

So 2 nodes (vmserv and vmserv1) became offline, means I can't do anything on them: nor create a new VM, nor edit existing, etc

Some kind of "splitting of the brain"?

eugene · Jul 31, 2012

Finally got it worked. I was able to drop current cluster config and recreate all things again. Starting working with cluster.

The topic can be closed.

jplorier · Oct 17, 2012

eugene said:
Finally got it worked. I was able to drop current cluster config and recreate all things again. Starting working with cluster.

The topic can be closed.

Hi, can you post how you were able to delete /etc/pve/cluster.conf or any commands you used to delete the cluster?
man pages are far from helpful, maybe proxmox guys can work this out as I'm new to proxmox and the only docs that help are in the web.
Regards,

ictdude · Mar 3, 2014

eugene said:
Finally got it worked. I was able to drop current cluster config and recreate all things again. Starting working with cluster.

The topic can be closed.

I have a similar problem. All is in readonly !? How did you solve this ?

Here you can ready my problem: http://forum.proxmox.com/threads/17...Totem-is-trying-form-a-cluster-How-can-i-stop

jplorier · Mar 5, 2014

Hi,

I'm sorry to tell you that I no longer use proxmox. Though it's a great product, I had such a hard time to get the right info that I moved to ovirt. It's not so great as proxmox in some features but in the other hand, it's evolving so fast and the comunity is very helpful.
Regards,

ictdude said:
I have a similar problem. All is in readonly !? How did you solve this ?

Here you can ready my problem: http://forum.proxmox.com/threads/17...Totem-is-trying-form-a-cluster-How-can-i-stop

ictdude · Mar 5, 2014

jplorier said:
Hi,

I'm sorry to tell you that I no longer use proxmox. Though it's a great product, I had such a hard time to get the right info that I moved to ovirt. It's not so great as proxmox in some features but in the other hand, it's evolving so fast and the community is very helpful.

Regards,

I still like Proxmox and know how to handle the most important issues i had

Until now i can administrator this system fine -)

Here my complete problem + Solution: https://forum.proxmox.com/threads/1...rm-a-cluster-How-can-i-stop?p=91446#post91446

Thnx for the reply. I already did solve this problem you can check the solution here: https://forum.proxmox.com/threads/9567-help-emergency?p=91447#post91447

Can't work with Cluster (ve 2.1)

eugene

New Member

tom

Proxmox Staff Member

dietmar

Proxmox Staff Member

eugene

New Member

eugene

New Member

dietmar

Proxmox Staff Member

mir

Famous Member

eugene

New Member

eugene

New Member

eugene

New Member

eugene

New Member

eugene

New Member

eugene

New Member

jplorier

New Member

ictdude

Renowned Member

jplorier

New Member

ictdude

Renowned Member

We value your privacy