RGManager won't start

jimz0r

New Member
Nov 19, 2013
22
0
1
Hi,

i'm trying to set up a test cluster with 3 nodes in HA with a fencing device.
when i'm trying to active it at the HA tab i'm getting the follow error:

When i go to Node > services > start RGManager it says:
Starting Cluster Service Manager: [ OK ]
TASK OK

But the status says ''stopped''

When using
/etc/init.d/rgmanager start in CLi it says the same



root@Proxmox01:~# /etc/init.d/rgmanager start
Starting Cluster Service Manager: [ OK ]
root@Proxmox01:~#



root@Proxmox01:~# /etc/init.d/rgmanager status
rgmanager is stopped



Everything is configured properly.

- 3 Proxmox ve 3.1 Nodes
- Nodes are in cluster.
- Shared storage by NFS

Configured as HA ( followed instruction from the Fencing wiki page)

http://pve.proxmox.com/wiki/Fencing#...g_on_all_nodes

Failover configured by CLI

http://pve.proxmox.com/wiki/Fencing#...e_cluster.conf

Fail over script:

<?xml version="1.0"?>
<cluster config_version="37" name="pilotfase">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.60" login="hpapc" name="apc" passwd="12345678" power_wait="10"/>
</fencedevices>
<clusternodes>
<clusternode name="Proxmox01" nodeid="1" votes="1">
<fence>
<method name="power">
<device name="apc" port="1" secure="on"/>
<device name="apc" port="2" secure="on"/>
</method>
</fence>
</clusternode>
<clusternode name="Proxmox02" nodeid="2" votes="1">
<fence>
<method name="power">
<device name="apc" port="3" secure="on"/>
<device name="apc" port="4" secure="on"/>
</method>
</fence>
</clusternode>
<clusternode name="Proxmox03" nodeid="3" votes="1">
<fence>
<method name="power">
<device name="apc" port="5" secure="on"/>
<device name="apc" port="6" secure="on"/>
</method>
</fence>
</clusternode>
</clusternodes>
</cluster>

I hope somebody has any idea to solve this issue.

Best Regards,
Jimmy
 
What is the output of

# dlm_tool dump

root@Proxmox01:~# dlm_tool dump
1385541785 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_cont rold.log
1385541785 dlm_controld 1364188437 started
1385541785 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_cont rold.log
1385541785 found /dev/misc/dlm-control minor 56
1385541785 found /dev/misc/dlm-monitor minor 55
1385541785 found /dev/misc/dlm_plock minor 54
1385541785 /dev/misc/dlm-monitor fd 12
1385541785 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1385541785 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1385541785 cluster node 1 added seq 4
1385541785 set_configfs_node 1 192.168.1.51 local 1
1385541785 totem/rrp_mode = 'none'
1385541785 set protocol 0
1385541785 group_mode 3 compat 0
1385541785 setup_cpg_daemon 14
1385541785 dlm:controld conf 1 1 0 memb 1 join 1 left
1385541785 set_protocol member_count 1 propose daemon 1.1.1 kernel 1.1.1
1385541785 run protocol from nodeid 1
1385541785 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1385541785 plocks 16
1385541785 plock cpg message size: 104 bytes
1385541793 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_cont rold.log
1385541805 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_cont rold.log
1385541807 cluster node 2 added seq 8
1385541807 set_configfs_node 2 192.168.1.53 local 0
1385541819 cluster node 3 added seq 12
1385541819 set_configfs_node 3 192.168.1.52 local 0
1385541820 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1385541823 dlm:controld conf 3 1 0 memb 1 2 3 join 3 left
 
Oh, seems you did not configured rgmanager - there is no 'rm' section in your cluster config?

Try to add an empty config "<rm></rm>"
 
Oh, seems you did not configured rgmanager - there is no 'rm' section in your cluster config?

Try to add an empty config "<rm></rm>"

Thank you very much Dietmar that solved the issue, but....

I'm trying to test the HA by starting 2 VM's (They are managed by HA).
But they won't start, after 10 minutes the output is still :

Executing HA start for VM 100
Member Proxmox01 trying to enable pvevm:100...

Our goal is to get a cluster running like in this YouTube video.
http://www.youtube.com/watch?v=fFn0f_pSd9w

When we have achieved this goal, We are going to extend to a virtual learning environment for students.
This environment will be available for aprox 130 students.
I hope you understand the importance of our task.

Best Regards,
Jimmy
 
Any hint in the logs (/var/log/syslog, /var/log/cluster/*)?


Thank you for your reply.

We found out that the kvm option was not enabled in the bios settings of one of the servers.

However we have tested by disconnecting a utp cable from Proxmox02 and it worked, for a while..
After the test we rebooted Proxmox02 and now we are with the same problem again as before.

RGManager won't start from Proxmox02 and Proxmox03.

When trying to start vm/ha107 on Proxmox02 we receive the next output:

Executing HA start for VM 107
Member Proxmox02 trying to enable pvevm:107...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:107 -m Proxmox02' failed: exit code 1

RGManager only runs on Proxmox01

When trying to start a VM/HA the output is:

Executing HA start for VM 103
Member Proxmox01 trying to enable pvevm:103...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:103 -m Proxmox01' failed: exit code 1


/var/log/syslog output

Nov 28 13:36:19 Proxmox01 pvesh: <root@pam> starting task UPID:proxmox01:00000AB0:000019CD:529738C3:startall::root@pam:
Nov 28 13:36:19 Proxmox01 spiceproxy[2737]: starting server
Nov 28 13:36:19 Proxmox01 spiceproxy[2737]: starting 1 worker(s)
Nov 28 13:36:19 Proxmox01 spiceproxy[2737]: worker 2738 started
Nov 28 13:36:26 Proxmox01 kernel: venet0: no IPv6 routers present
Nov 28 13:36:29 Proxmox01 task UPID:proxmox01:00000AB0:000019CD:529738C3:startall::root@pam:: cluster not ready - no quorum?
Nov 28 13:36:29 Proxmox01 pvesh: <root@pam> end task UPID:proxmox01:00000AB0:000019CD:529738C3:startall::root@pam: cluster not ready - no quorum?
Nov 28 13:36:54 Proxmox01 pmxcfs[2348]: [status] notice: received log
Nov 28 13:37:15 Proxmox01 pmxcfs[2348]: [dcdb] crit: received write while not quorate - trigger resync
Nov 28 13:37:15 Proxmox01 pmxcfs[2348]: [dcdb] crit: leaving CPG group
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: start cluster connection
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: members: 1/2348, 2/2362, 3/2372
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: starting data syncronisation
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: received sync request (epoch 1/2348/00000005)
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: received all states
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: leader is 2/2362
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: synced members: 2/2362, 3/2372
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: waiting for updates from leader
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: update complete - trying to commit (got 1 inode updates)
Nov 28 13:37:16 Proxmox01 pmxcfs[2348]: [dcdb] notice: all data is up to date
Nov 28 13:37:28 Proxmox01 pmxcfs[2348]: [status] notice: received log
Nov 28 13:37:41 Proxmox01 pvedaemon[2685]: <root@pam> successful auth for user 'root@pam'
Nov 28 13:37:59 Proxmox01 pvedaemon[2685]: <root@pam> starting task UPID:proxmox01:00000B26:00004107:52973927:hastart:102:root@pam:
Nov 28 13:37:59 Proxmox01 pvedaemon[2854]: command 'clusvcadm -e pvevm:102 -m Proxmox01' failed: exit code 1
Nov 28 13:37:59 Proxmox01 pvedaemon[2685]: <root@pam> end task UPID:proxmox01:00000B26:00004107:52973927:hastart:102:root@pam: command 'clusvcadm -e pvevm:102 -m Proxmox01' failed: exit code 1
Nov 28 13:38:22 Proxmox01 pvedaemon[2685]: <root@pam> starting task UPID:proxmox01:00000B31:000049AC:5297393E:hastart:103:root@pam:
Nov 28 13:38:22 Proxmox01 pvedaemon[2865]: command 'clusvcadm -e pvevm:103 -m Proxmox01' failed: exit code 1
Nov 28 13:38:22 Proxmox01 pvedaemon[2685]: <root@pam> end task UPID:proxmox01:00000B31:000049AC:5297393E:hastart:103:root@pam: command 'clusvcadm -e pvevm:103 -m Proxmox01' failed: exit code

*Update:

After rebooting all nodes RGManager won't start from any node anymore.
however it still says:
Starting Cluster Service Manager: [ OK ]
TASK OK
 
Last edited:
So nobody has any idea about this RGManager issue?

My guess is that you simply rebooted both nodes, an did not take care to get quorate. So rgmanager did not start because cluster was not quorate at boot time.

Try the following on the node that does not run rgmanager:

# service cman start
# service rgmanager start
 
Last edited by a moderator:
My guess is that you simply rebooted both nodes, an did not take care to get quorate. So rgmanager did not start because cluster was not quorate at boot time.

Try the following on the node that does not run rgmanager:

# service start cman
# service start rgmanager

Hello Dietmar,

Today we had our first test day with 25 users working on 38 vm's and beyond that we are using old hardware our test passed.

3x HP proliant Proxmox 3.1
1x HP proliant Freenas
2x MSA 60 storages 24x 500gb 7.2k hd configured as raid 5

Now i can focus back on HA because its a must have.
When using your command it gives the following error:

root@Proxmox01:~# service start cman
start: unrecognized service
root@Proxmox01:~# service start rgmanager
start: unrecognized service

When using /etc/init.d/ it says it start but it doesn't

root@Proxmox01:/# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]
root@Proxmox01:/# /etc/init.d/rgmanager start
Starting Cluster Service Manager: [ OK ]
root@Proxmox01:/# /etc/init.d/rgmanager status
rgmanager is stopped


My guess is that you simply rebooted both nodes, an did not take care to get quorate. So rgmanager did not start because cluster was not quorate at boot time.

You are right. i have indeed rebooted both nodes. does that mean that the rgmanager is crashed and need to be reinstalled?
 
Last edited:
if you run HA, you need to make sure that you NEVER loose quorum, NEVER.

that means, everything needs to be redundant, if you do reboots you need to make sure that still enough nodes are availble to form the quorum.

very important, the cluster network has to be redundant, use bonding and separate switches.
 
Can't you see that you run totally wrong commands?

you wrote him the wrong commands, I corrected your post above so that other does do the same.
 
Hello Tom,

Thank you very much for you reply's.

I used the command but same results as my post above when using the other command.

root@Proxmox01:~# service cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Unfencing self... [ OK ]
root@Proxmox01:~# service rgmanager start
Starting Cluster Service Manager: [ OK ]
root@Proxmox01:~# service rgmanager status
rgmanager is stopped

I gues the only option left is to reinstall all nodes.
Tomorrow we have another testfase 3 test with 15 users.
After the test all nodes will get a reinstall and i will post a new status update after testing the HA. This time without rebooting all nodes after a HA test. ^^
 
what do you get here:

> fence_tool ls
 
what do you get here:

> fence_tool ls

Thanks to you we were able to solve this issue.

I had to add all nodes again (fence_tool join)
Now rgmanager is running again.

One last question.


When 3 nodes are running all with vm's in HA.
Does RGManager need to be running on all nodes or only on one node?
 
Hi, This is an old thread, but I have the same problem today on a test cluster

#clusvcadm -e testservice
"Could not connect to resource group manager"

rgmanager is started.

using strace:

#strace clusvcadm -e testservice
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4672fb1000
write(1, "Local machine trying to enable s"..., 44Local machine trying to enable service:testservice...) = 44
socket(PF_FILE, SOCK_STREAM, 0) = 5
connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) = -1 ENOENT (No such file or directory)
close(5) = 0
write(1, "Could not connect to resource gr"..., 44Could not connect to resource group manager
) = 44
exit_group(1)

I have see some bug report about this on redhat buzille, don't have found the response yet


Another note, I fenced is not started, rgmanager is not starting without any log.
I think the init.d script should be improve, after the start, checking if rgmanager is running.
and also check that fenced is started
Can be done cleanly with checking /var/run/rgmanager.pid
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!