Can't work with Cluster (ve 2.1)

eugene

New Member
Feb 15, 2012
11
0
1
Hi.
Just tried to set up cluster on PVE 2.1 and I failed. Don't know what to do with that.
Here is the situation:
1) 3 servers (vmserv (10.101.0.10), vmserv1 (10.101.0.11), vmserv2 (10.101.0.12) ).
2) "vmserv" has VMs, other servers - no.
3) I started setup according to wiki http://pve.proxmox.com/wiki/Proxmox_VE_2.0_Cluster . But something went wrong and I can't finish with cluster. I even can't revert system state back. I thought it should be possible to revert system state back (e.g. by cleaning configs), but can't see a way to do it.

root@vmserv:~# cman_tool nodes -a
Node Sts Inc Joined Name
1 M 12 2012-07-27 03:20:26 vmserv1
Addresses: 10.101.0.10

------- This is wrong. because IP of the "vmserv1" is 10.101.0.11. I can't say how that happened. I can't remove vmserv1 from cluster, because it
can't remove itself, I think so. And I can't do anything with that, because it is not fully configured....

Tried at "vmserv1":

root@vmserv:~# pvecm e 1
root@vmserv:~# pvecm delnode vmserv1
cluster not ready - no quorum?

------- Seems like the problem with quorum appeared in time of cluster setup. Tried "service cman restart" and " service pve-cluster restart" everywhere, didn't help.

Here is my /etc/hosts (all servers has the same):

root@vmserv:~# cat /etc/hosts
127.0.0.1 localhost
10.101.0.10 vmserv.atz.dmza.bogus vmserv
10.101.0.11 vmserv1.atz.dmza.bogus vmserv1
10.101.0.12 vmserv1.atz.dmza.bogus vmserv2

Here is my cluster config from "vmserv":
root@vmserv:~# cat /etc/pve/cluster.conf

<?xml version="1.0"?>
<cluster name="cluster" config_version="3">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey">
</cman>
<clusternodes>
<clusternode name="vmserv1" votes="1" nodeid="1"/></clusternodes>
</cluster>




In /var/log/syslog at "vmserv" we can see that it trying to do something on failed cluster (I think so):
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144420
Jul 30 10:10:31 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144430
Jul 30 10:10:32 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144440
Jul 30 10:10:33 vmserv dlm_controld[2184]: daemon cpg_leave error retrying
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144450
Jul 30 10:10:34 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144460

In /var/log/syslog at "vmserv1":
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:30 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144420
Jul 30 10:10:31 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144430
Jul 30 10:10:32 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144440
Jul 30 10:10:33 vmserv dlm_controld[2184]: daemon cpg_leave error retrying
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [status] crit: cpg_send_message failed: 9
Jul 30 10:10:33 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144450
Jul 30 10:10:34 vmserv pmxcfs[457959]: [dcdb] notice: cpg_join retry 144460






The problem is that "vmserv" already has VMs and I worry about them. So, I decided to ask community first. What should I do with that? How can I stop the cluster and back things as they were before?

Thanks.
 
Last edited:
Does it help if you restart cman and pve-cluster:

# /etc/init.d/cman stop
# /etc/init.d/cman start
# /etc/init.d/pve-cluster restart
 
your cluster.conf shows not all nodes - why? did you join the nodes? how? why does this fail?

does your network supports IP multicast - check it. see http://pve.proxmox.com/wiki/Multicast_notes

Thanks for reply.
1) Not all nodes because I failed to add even 1 node, so I stoped.
on "vmserv1" I executed: "pvecm add vmserv", got usual output which stoped on "Wating for quorum...", then I got timeout message . After googling I tried cman and pve-cluster restart on both servers, it didn't help.

2) Multicast is supported. Checked with your link, works fine.
 
Last edited:
Does it help if you restart cman and pve-cluster:

# /etc/init.d/cman stop
# /etc/init.d/cman start
# /etc/init.d/pve-cluster restart

# /etc/init.d/cman stop

Stopping cluster:
Stopping dlm_controld...
[FAILED]

# /etc/init.d/cman start

Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... (it takes about 10-15 sec, then became OK) [ OK ]
Unfencing self... [ OK ]


# /etc/init.d/pve-cluster restart
Restarting pve cluster filesystem: pve-cluster.


It didn't help. Anything else ?

P.S. I found that file system "/etc/pve" is read only on vmserv and vmserv1, so I can't edit any files in it.

root@vmserv:~# touch /etc/pve/1touch: cannot touch `/etc/pve/1': Permission denied

root@vmserv:~# mount
.........
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,default_permissions,allow_other)

In syslog still the same as in 1st post.
 
P.S. I found that file system "/etc/pve" is read only on vmserv and vmserv1, so I can't edit any files in it.

root@vmserv:~# touch /etc/pve/1touch: cannot touch `/etc/pve/1': Permission denied

root@vmserv:~# mount
.........
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,default_permissions,allow_other)

In syslog still the same as in 1st post.
You are sure that the server didn't fail filecheck of root file system under boot and therefore has mounted the root file system read only?
 
root@vmserv:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vmserv1
Node ID: 1
Multicast addresses: 239.192.52.104
Node addresses: 10.101.0.10
 
what is the output of

# pvecm status

root@vmserv:~# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vmserv1
Node ID: 1
Multicast addresses: 239.192.52.104
Node addresses: 10.101.0.10
 
You are sure that the server didn't fail filecheck of root file system under boot and therefore has mounted the root file system read only?

Yes. The root FS mounted as usual and I have no problems with it. But /etc/pve is another mount location, which initiated by "pve-cluster", I think so. And as I can see "/var/lib/pve-cluster/config.db-wal" is the disk to mount to /etc/pve by "pve-cluster".

"/var/lib/pve-cluster/config.db-wal" is mounting via fuse. I would like to stop pve-cluster and maybe try to edit it manually, but I think it is some kind of internal pve team thing which has no public documentation. :)
 
Last edited:
Mmmmmm.... I found that I can't create or modify any VM's, getting "Node vmserv seems to be offline!".
And I'm getting "cluster not ready - no quorum?" at tab "HA".

Any thoughts?
 
I am not sure, but I think the problem is that I added a node from main node (my mistake). But how to fix this?

The sequence of executing of commands was almost like (don't remember exactly):
1) I copied ssh pub key from vmserv1 to vmserv

2) Created cluster: root@vmserv# pvecm create cluster

3) Did wrong step when adding node to a cluster, I executed adding process at wrong server (vmserv), must be at vmserv1: root@vmserv# pvecm add vmserv1

4) Then decided to do things "right" and tried to add vmserv1 again from right server: root@vmserv1# pvecm add vmserv1 - got timeout for quorum and this is it.

So 2 nodes (vmserv and vmserv1) became offline, means I can't do anything on them: nor create a new VM, nor edit existing, etc


Some kind of "splitting of the brain"?
 
Last edited:
Finally got it worked. I was able to drop current cluster config and recreate all things again. Starting working with cluster. :)

The topic can be closed.
 
Finally got it worked. I was able to drop current cluster config and recreate all things again. Starting working with cluster. :)

The topic can be closed.

Hi, can you post how you were able to delete /etc/pve/cluster.conf or any commands you used to delete the cluster?
man pages are far from helpful, maybe proxmox guys can work this out as I'm new to proxmox and the only docs that help are in the web.
Regards,
 
Hi,

I'm sorry to tell you that I no longer use proxmox. Though it's a great product, I had such a hard time to get the right info that I moved to ovirt. It's not so great as proxmox in some features but in the other hand, it's evolving so fast and the comunity is very helpful.
Regards,

I have a similar problem. All is in readonly !? How did you solve this ?

Here you can ready my problem: http://forum.proxmox.com/threads/17...Totem-is-trying-form-a-cluster-How-can-i-stop
 
Hi,

I'm sorry to tell you that I no longer use proxmox. Though it's a great product, I had such a hard time to get the right info that I moved to ovirt. It's not so great as proxmox in some features but in the other hand, it's evolving so fast and the community is very helpful.

Regards,

I still like Proxmox and know how to handle the most important issues i had :) Until now i can administrator this system fine -)

Here my complete problem + Solution: https://forum.proxmox.com/threads/1...rm-a-cluster-How-can-i-stop?p=91446#post91446

Thnx for the reply. I already did solve this problem you can check the solution here: https://forum.proxmox.com/threads/9567-help-emergency?p=91447#post91447


 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!