rgmanager running per cli but not pve

RobFantini

Famous Member
May 24, 2012
2,023
107
133
Boston,Mass
On all 4 nodes from cli rgmanager is running
Code:
# /etc/init.d/rgmanager status

rgmanager (pid 479867 479865) is running...

But it does not show as running at PVE > Datacenter > Summary on the services column. PVECluster is the only service running on the 4 nodes.

and when a HA managed kvm tries to start we get this:
Code:
Executing HA start for VM 102
Member fbc240 trying to enable pvevm:102...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:102 -m fbc240' failed: exit code 1


We probably have something not set up correctly.

I've checked multicast, fence_tool a few other things but can not find out the cause.

Any suggestions to fix this?

Thanks.
 
Last edited:
cluster.conf :
Code:
<?xml version="1.0"?>
<cluster config_version="95" name="fbcluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.11" login="fbcadmin" name="apc11" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.78" login="fbcadmin" name="apc78" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.88" login="fbcadmin" name="apc88" passwd="032scali"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="fbc241" nodeid="4" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="4" secure="on"/>
          <device name="apc78" port="4" secure="on"/>
          <device name="apc88" port="4" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc240" nodeid="1" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="2" secure="on"/>
          <device name="apc78" port="2" secure="on"/>
          <device name="apc88" port="2" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc100" nodeid="3" votes="1">
      <fence>
        <method name="power">
          <device name="apc78" port="3" secure="on"/>
          <device name="apc88" port="3" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc1" nodeid="5" votes="1">
      <fence>
        <method name="power">
          <device name="apc88" port="6" secure="on"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <failoverdomains>
      <failoverdomain name="fbc240-fbc241" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc240" priority="1"/>
        <failoverdomainnode name="fbc241" priority="100"/>
      </failoverdomain>
      <failoverdomain name="fbc241-fbc240" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc241" priority="1"/>
        <failoverdomainnode name="fbc240" priority="100"/>
      </failoverdomain>
    </failoverdomains>
    <pvevm autostart="1" vmid="9002"/>
  </rm>
</cluster>

pveversion -v
[CODE]
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

[/CODE]
 
Code:
# clustat
Cluster Status for fbcluster @ Tue Jun 12 05:15:30 2012                                 
Member Status: Quorate                                                                   
                                                                                           
 Member Name                                          ID   Status                           
 ------ ----                                          ---- ------                            
 fbc240                                                   1 Online                             
 fbc100                                                   3 Online                                 
 fbc241                                                   4 Online, Local                          
 fbc1                                                      5 Online
 
I had to get 102 running so removed it from HA.

we are now using 9002 on fbc241 , as 9002 is a testing kvm. so:
Code:
fbc241 s012 ~ # clusvcadm -e pvevm:9002 -m fbc241
Member fbc241 trying to enable pvevm:9002...Could not connect to resource group manager
 
Code:
fbc241 s012 ~ # clustat -x 
<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="fbcluster" id="52020" generation="1216"/>
  <quorum quorate="1" groupmember="0"/>
  <nodes>
    <node name="fbc240" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="fbc100" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000003"/>
    <node name="fbc241" state="1" local="1" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000004"/>
    <node name="fbc1" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000005"/>
  </nodes>
</clustat>
 
Please check what process that is (pid 2640 2639). Are those really rgmanager instances?

Does it help if you restart rgmanager:

# /etc/init.d/rgmanager restart
 
from ps afx :
Code:
   2639 ?        S<Ls   0:00 rgmanager
   2640 ?        S<l    0:00  \_ rgmanager

as for restart, it hangs. tihs after 20 min. note that when we restart any of the ha server 's the rgmanager process has to be killed for reboot to occur..
Code:
root@fbc1 /home/rob # /etc/init.d/rgmanager restart
Stopping Cluster Service Manager:
 
Also
In April migration was working perfectly.

Sometime late May I noticed nodes would not migrate.. But am just now getting to fixing ..

We have removed and added nodes . Maybe that caused an issue?
 
Have you tried rebooting the Proxmox nodes?

I have seen rgmanager deadlock when network communications are disrupted to all nodes (ie turning off the network switch)
The output of clustat you provided does not match the deadlock I have seen so your symptoms are similar but not exactly like the deadlock I have seen.

The only way I was able to recover was to reboot each node one by one.
As each node started up rgmanager started properly.

Looks like RedHat has released several bug fixes that will likely fix my deadlock problem and possibly RobFantini's problem too.
I have reported the relevent information in my original bug report here: https://bugzilla.proxmox.com/show_bug.cgi?id=105
 
"The only way I was able to recover was to reboot each node one by one."

thanks for the response.

so we'll do this probably next weekend:
1- all nodes need to be turned off
2- then start nodes one at a time.

Also we've found that the nodes do not shutdown unless the rgmanager process is manually killed. Did you have to do that when you had the similar issue?
 
"The only way I was able to recover was to reboot each node one by one."

thanks for the response.

so we'll do this probably next weekend:
1- all nodes need to be turned off
2- then start nodes one at a time.

Also we've found that the nodes do not shutdown unless the rgmanager process is manually killed. Did you have to do that when you had the similar issue?

With the problem I experienced I only needed to reboot each node one at a time, I did not need to turn them all off and start them one by one.
But maybe turning them all off would help.
Also, if I remember correctly, rgmanager could not be killed and we had to force the reboot.

Hopefully the recent updates that RedHat released will help and dietmar can get some new packages built soon.
 
I did reboot all the nodes Sunday. But one at a time. I suppose it'll be worth trying to shut al ldown and start up one at a time on the weekend.
 
I will try to assemble new packages.

here is more log info:
Code:
 root@fbc1 /var/log # ll /var/log/cluster/fence_na.log-rw-r--r-- 1 root root 338 May 27 16:33 /var/log/cluster/fence_na.log
root@fbc1 /var/log # cat /var/log/cluster/fence_na.log
Node Assassin: . [].
TCP Port: ...... [238].
Node: .......... [00].
Login: ......... [].
Password: ...... [].
Action: ........ [metadata].
Version Request: [no].
Done reading args.
Connection to Node Assassin: [] failed.
Error was: [unknown remote host: ]
Username and/or password invalid. Did you use the command line switches properly?




** also  these :
fbc241 s012 /var/log/cluster # cat rgmanager.log
Jun 05 09:11:05 rgmanager #1: Quorum Dissolved
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:06 rgmanager stop on pvevm "102" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:06 rgmanager stop on pvevm "101" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager stop on pvevm "104" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager stop on pvevm "1023" returned 2 (invalid argument(s))




fbc240 s009 /var/log/cluster # cat  rgmanager.log
Jun 05 08:39:55 rgmanager #1: Quorum Dissolved
Jun 05 08:39:56 rgmanager [pvevm] VM 105 is already stopped
Jun 05 08:39:56 rgmanager [pvevm] VM 115 is already stopped
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!