rgmanager: group join failed -1 -1

JonB · Jul 26, 2012

Hi

I have some trouble with my Proxmox VE 2.13 installation. No guests are started at boot and neither if I start them manually. The error message when starting manually is:

Code:

Executing HA start for CT 100
Member dkproxmoxve1 trying to enable pvevm:100...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:100 -m dkproxmoxve1' failed: exit code 1

I just started with Proxmox VE so I only have one machine, planned to migrate more today, but ...

Using the webinterface I found the host name in the left column, and then the services tab. In there I noticed that the resource group manager daemon was stopped. It did not help to start it. So I logged in using SSH and there ran /etc/init.d/rgmanager start. It reported OK, but running again with status reveals it is stopped. No rgmanager process is running when doing ps aux.

Uname -a gives: Linux proxmoxve1 2.6.32-13-pve #1 SMP Mon Jul 9 08:39:20 CEST 2012 x86_64 GNU/Linux

Wednesday 2012-07-25 I ran apt-get update && upgrade and it installed the new kernel pve-kernel-2.6.32-13-pve + some more stuff I dont remember. It also removed version 12, but 11 is still installed? This resulted in some boot problems, I got the grub rescue prompt, but a boot using systemrescuecd I managed to get it booting again.

Thursday 2012-07-26 I first removed some packages that deborphan said was not in usage. (they got removed with --purge)

libgssrpc4
libkdb5-4
libxi6
libkadm5srv-mit7
libkadm5clnt-mit7
libxtst6

But later I reinstalled them because rgmanager was not working, but I do not think that was the problem, because grepping for rgmanager in /var/log shows entries from Wednesday 2012-07-25:

Code:

syslog.1: Jul 25 15:29:25 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1
syslog:   Jul 26 11:17:26 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1
syslog:   Jul 26 10:58:04 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1
syslog:   Jul 26 10:56:50 proxmoxve1 pvedaemon[2813]: starting service rgmanager: UPID:proxmoxve1:00000AFD:0000BA0B:50110652:srvstart:rgmanager:root@pam:
syslog:   Jul 26 10:56:50 proxmoxve1 pvedaemon[2585]: <root@pam> starting task UPID:proxmoxve1:00000AFD:0000BA0B:50110652:srvstart:rgmanager:root@pam:
syslog:   Jul 26 10:56:50 proxmoxve1 pvedaemon[2585]: <root@pam> end task UPID:proxmoxve1:00000AFD:0000BA0B:50110652:srvstart:rgmanager:root@pam: OK
syslog:   Jul 26 10:56:50 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1
syslog:   Jul 26 10:49:09 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1
syslog:   Jul 26 10:44:06 proxmoxve1 kernel: dlm: rgmanager: group join failed -1 -1

I did not notice the error Wednesday 2012-07-25 because I was working on something else, and since the virtual guest machine is not in production, then nagios does not monitor it.

Thursday 2012-07-26 I also upgraded or installed libpve-storage-perl when running apt-get update && upgrade.

If I manually use vzlist -S and vzctl start <ctid> then the virtual guest machine does start, but I prefer using PVE tools.

If I run pvecm status in the console I get:

Code:

Version: 6.2.0
Config Version: 2
Cluster Name: CPH-PVE
Cluster Id: 8921
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Node votes: 1
Quorum: 1  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: proxmoxve1
Node ID: 1
Multicast addresses: 239.192.34.251 
Node addresses: 192.168.12.211

Running dpkg -l | grep pve gives me:

Code:

ii  clvm                            2.02.95-1pve2                Cluster LVM Daemon for lvm2
ii  corosync-pve                    1.4.3-1                      Standards-based cluster framework (daemon and modules)
ii  dmsetup                         2:1.02.74-1pve2              Linux Kernel Device Mapper userspace library
ii  fence-agents-pve                3.1.8-1                      fence agents for redhat cluster suite
ii  libcorosync4-pve                1.4.3-1                      Standards-based cluster framework (libraries)
ii  libdevmapper1.02.1              2:1.02.74-1pve2              Linux Kernel Device Mapper userspace library
ii  libopenais3-pve                 1.1.4-2                      Standards-based cluster framework (libraries)
ii  libpve-access-control           1.0-24                       Proxmox VE access control library
ii  libpve-common-perl              1.0-28                       Proxmox VE base library
ii  libpve-storage-perl             2.0-27                       Proxmox VE storage management library
ii  lvm2                            2.02.95-1pve2                Linux Logical Volume Manager
ii  openais-pve                     1.1.4-2                      Standards-based cluster framework (daemon and modules)
ii  pve-cluster                     1.0-27                       Cluster Infrastructure for Proxmox Virtual Environment
ii  pve-firmware                    1.0-17                       Binary firmware code for the pve-kernel
ii  pve-kernel-2.6.32-11-pve        2.6.32-66                    The Proxmox PVE Kernel Image
ii  pve-kernel-2.6.32-13-pve        2.6.32-72                    The Proxmox PVE Kernel Image
ii  pve-manager                     2.1-12                       The Proxmox Virtual Environment
ii  pve-qemu-kvm                    1.1-6                        Full virtualization on x86 hardware
ii  redhat-cluster-pve              3.1.92-2                     Red Hat cluster suite
ii  resource-agents-pve             3.9.2-3                      resource agents for redhat cluster suite
ii  vzctl                           3.0.30-2pve5                 OpenVZ - server virtualization solution - control tools

tom · Jul 26, 2012

did you followed the guide exactly? see http://forum.proxmox.com/threads/10408-KVM-1-1-and-new-Kernel

JonB · Jul 26, 2012

I do not expect to follow any other guide than what apt-get upgrade might possibly tell me which can not be auto configured from the existing configuration.

I do not think I have sheepdog installed, but ceph is there:

Code:

ii  ceph-common                     0.48argonaut-1~bpo60+1       common utilities to mount and interact with a ceph storage cluster

I do not use fencing, so I never configured that.

I do not remember being asked about anything by redhat-cluster-pve.

My pveversion -v is slightly different. The 2 differences is that I have this line:

Code:

pve-kernel-2.6.32-11-pve: 2.6.32-66

but I am running

Code:

running kernel: 2.6.32-13-pve

The second difference I have from the box in that other forum thead is:

Code:

--- forum.txt    2012-07-26 13:00:31.273373834 +0200
+++ local.txt    2012-07-26 13:00:44.977372398 +0200
@@ -15,7 +16,7 @@
 pve-firmware: 1.0-17
 libpve-common-perl: 1.0-28
 libpve-access-control: 1.0-24
-libpve-storage-perl: 2.0-26
+libpve-storage-perl: 2.0-27
 vncterm: 1.0-2
 vzctl: 3.0.30-2pve5
 vzprocps: 2.0.11-2

I tried the pveam update that Chris Rivera suggested, no output, exitcode 0, but it did not help.

I have attached a strace output below. Looking through /etc/init.d/rgmanager I checked the following first:

Code:

root@proxmoxve1:/home/jonbendtsen# ls -la /etc/cluster/cluster.conf
-rw-r----- 1 root root 283 Jul  3 13:17 /etc/cluster/cluster.conf

unchanged since installation I think.

Code:

root@proxmoxve1:/home/jonbendtsen# ls -la /etc/default/redhat-cluster-pve
-rw-r--r-- 1 root root 140 Jun 13 16:36 /etc/default/redhat-cluster-pve
root@proxmoxve1:/home/jonbendtsen# cat /etc/default/redhat-cluster-pve 
# this file is sourced by the following init scripts:
# /etc/init.d/cpglockd
# /etc/init.d/cman
# /etc/init.d/rgmanager

# FENCE_JOIN="yes"

Code:

root@proxmoxve1:/home/jonbendtsen# ls -la /var/lock/
total 0
drwxrwxrwt  5 root     root 116 Jul 26 13:03 .
drwxr-xr-x 14 root     root 138 Jun 26 16:16 ..
drwxr-xr-x  2 www-data root   6 Jun 21 15:36 apache2
-rw-r-----  1 root     root   0 Jul 26 13:03 aptitude
-rw-r--r--  1 root     root   0 Jul 26 10:49 cman
drw-------  2 root     root  17 Jul 26 10:49 iscsi
drwxr-xr-x  2 root     root   6 Jun 21 15:33 qemu-server
-rw-r--r--  1 root     root   0 Jul 26 13:03 rgmanager
-rw-r--r--  1 root     root   0 Jul 26 10:49 vz
-rw-r--r--  1 root     root   0 Jul 26 10:49 vzeventd

Code:

root@proxmoxve1:/home/jonbendtsen# corosync-objctl 2>/dev/null |grep ringnumber | wc -l
1
root@proxmoxve1:/home/jonbendtsen# corosync-objctl 2>/dev/null |grep ringnumber 
totem.interface.ringnumber=0
root@proxmoxve1:/home/jonbendtsen# ps ax | grep cpglock
   7509 pts/3    S+     0:00 grep cpglock

Code:

root@proxmoxve1:/home/jonbendtsen# ccs_tool query /cluster/rm >/dev/null 2>&1 ; echo $?
0

Code:

root@proxmoxve1:/home/jonbendtsen# ls -la /var/run/cluster/
total 4
drwxr-xr-x 2 root root    6 Jun 26 18:01 .
drwxr-xr-x 9 root root 4096 Jul 26 10:49 ..

Code:

root@proxmoxve1:/etc# grep -irl RGMGR_OPTS *
init.d/rgmanager
rc0.d/K01rgmanager
rc1.d/K01rgmanager
rc2.d/S21rgmanager
rc3.d/S21rgmanager
rc4.d/S21rgmanager
rc5.d/S21rgmanager
rc6.d/K01rgmanager

Code:

root@proxmoxve1:/tmp/rgmanager.strace# strace -ff -o rgmanger.strace /usr/sbin/rgmanager
root@proxmoxve1:/tmp/rgmanager.strace# ls -la
total 472
drwxr-xr-x 2 root root    114 Jul 26 13:25 .
drwxrwxrwt 5 root root    118 Jul 26 13:24 ..
-rw-r--r-- 1 root root 234494 Jul 26 13:25 rgmanger.strace.7754
-rw-r--r-- 1 root root  10369 Jul 26 13:25 rgmanger.strace.7758
-rw-r--r-- 1 root root 226553 Jul 26 13:25 rgmanger.strace.7759
-rw-r--r-- 1 root root     93 Jul 26 13:25 rgmanger.strace.7760

View attachment jons_rgmanger_strace.zip

JonB · Jul 26, 2012

JonB said:
I tried the pveam update that Chris Rivera suggested, no output, exitcode 0, but it did not help.

It seems like it did something in the background.

Code:

Jul 26 13:03:36 starting update
[GNUPG:] IMPORT_OK 0 9ABD7E02AD243AD3C2FBBCCCB0C1CC225CAC72FE
[GNUPG:] IMPORT_RES 1 0 0 0 1 0 0 0 0 0 0 0 0 0
[GNUPG:] IMPORT_OK 0 694CFF26795A29BAE07B4EB585C25E95A16EB94D
[GNUPG:] IMPORT_RES 1 0 0 0 1 0 0 0 0 0 0 0 0 0
Jul 26 13:03:36 start download http://download.proxmox.com/appliances/aplinfo.dat.asc
Jul 26 13:03:36 download finished: 200 OK
Jul 26 13:03:36 start download http://download.proxmox.com/appliances/aplinfo.dat.gz
Jul 26 13:03:36 download finished: 200 OK
Jul 26 13:03:36 gpg: Signature made Mon 09 May 2011 12:34:06 PM CEST using DSA key ID 5CAC72FE
Jul 26 13:03:36 [GNUPG:] SIG_ID +8IpFVIbah0J3Y31psFYw7I0HDg 2011-05-09 1304937246
Jul 26 13:03:36 [GNUPG:] GOODSIG B0C1CC225CAC72FE Proxmox Support Team <support@proxmox.com>
Jul 26 13:03:36 gpg: Good signature from "Proxmox Support Team <support@proxmox.com>"
Jul 26 13:03:36 [GNUPG:] VALIDSIG 9ABD7E02AD243AD3C2FBBCCCB0C1CC225CAC72FE 2011-05-09 1304937246 0 4 0 17 2 00 9ABD7E02AD243AD3C2FBBCCCB0C1CC225CAC72FE
Jul 26 13:03:36 signature valid: 9ABD7E02AD243AD3C2FBBCCCB0C1CC225CAC72FE
Jul 26 13:03:36 update sucessful
Jul 26 13:03:36 start download http://releases.turnkeylinux.org/pve/aplinfo.dat.asc
Jul 26 13:03:36 download finished: 200 OK
Jul 26 13:03:36 start download http://releases.turnkeylinux.org/pve/aplinfo.dat.gz
Jul 26 13:03:36 download finished: 200 OK
Jul 26 13:03:36 gpg: Signature made Thu 12 Apr 2012 12:13:41 PM CEST using RSA key ID A16EB94D
Jul 26 13:03:36 [GNUPG:] SIG_ID 1WeiQK/GpiRZYwXgBzUpbp9S7ss 2012-04-12 1334225621
Jul 26 13:03:36 [GNUPG:] GOODSIG 85C25E95A16EB94D Turnkey Linux Release Key <release@turnkeylinux.com>
Jul 26 13:03:36 gpg: Good signature from "Turnkey Linux Release Key <release@turnkeylinux.com>"
Jul 26 13:03:36 [GNUPG:] VALIDSIG 694CFF26795A29BAE07B4EB585C25E95A16EB94D 2012-04-12 1334225621 0 4 0 1 2 00 694CFF26795A29BAE07B4EB585C25E95A16EB94D
Jul 26 13:03:36 signature valid: 694CFF26795A29BAE07B4EB585C25E95A16EB94D
Jul 26 13:03:36 update sucessful

JonB · Jul 26, 2012

JonB said:
I have attached a strace output below. Looking through /etc/init.d/rgmanager I checked the following first:
View attachment 1068

I noticed that the file rgmanger.strace.7759.txt (4 txt files included in the above .zip file)

Code:

open("/dev/dlm_rgmanager", O_RDWR)      = -1 ENOENT (No such file or directory)

so I have attached my lsmod output View attachment lsmod.txt

Code:

root@proxmoxve1:/dev# ls -la dlm*
crw-rw-rw- 1 root root 10, 57 Jul 26 10:49 dlm-control
crw-rw-rw- 1 root root 10, 56 Jul 26 10:49 dlm-monitor
crw-rw---- 1 root root 10, 55 Jul 26 10:49 dlm_plock

dietmar · Jul 26, 2012

You simply can't use rgmanager without fencing - that is required.

JonB · Jul 26, 2012

dietmar said:
You simply can't use rgmanager without fencing - that is required.

Why? It worked before upgrading. And I only have 1 server (so far).

tom · Jul 26, 2012

rgmanager is for HA. if you have one server its not needed and useless. please explain what you mean "it worked before".

JonB · Jul 26, 2012

tom said:
rgmanager is for HA. if you have one server its not needed and useless. please explain what you mean "it worked before".

Before Wednesday 2012-07-25 where I applied the update http://forum.proxmox.com/threads/10408-KVM-1-1-and-new-Kernel my single OpenVZ virtual guest machine would start after a reboot. And I could start and stop it manually. After applying that update I can not even start the single OpenVZ virtual guest machine manually. The error message using the webadmin interface is:

Code:

Executing HA start for CT 100 Member proxmoxve1 trying to enable pvevm:100...Could not connect to resource group manager TASK ERROR: command 'clusvcadm -e pvevm:100 -m proxmoxve1' failed: exit code 1

And it also applies if I run clusvcadm -e pvevm:100 -m proxmoxve1 by hand in the console.

Code:

root@proxmoxve1:/home/jonbendtsen# clusvcadm -e pvevm:100 -m proxmoxve1 Member proxmoxve1 trying to enable pvevm:100...Could not connect to resource group manager

So I started to look for that resource group manager.

tom · Jul 26, 2012

no idea what you do here, rgmanager is not useable for a single host.

JonB · Jul 26, 2012

My reason for setting up HA was such that if it died, then HA would automatically start it again. I have now removed the "managed by HA" and now it can start when I do it manually. Will try rebooting. After rebooting it started normally. Will manage without HA until I get a 2. server.

Search

Search

rgmanager: group join failed -1 -1

JonB

Member

tom

Proxmox Staff Member

JonB

Member

JonB

Member

JonB

Member

dietmar

Proxmox Staff Member

JonB

Member

tom

Proxmox Staff Member

JonB

Member

tom

Proxmox Staff Member

JonB

Member

We value your privacy