rgmanager running per cli but not pve

RobFantini · Jun 12, 2012

On all 4 nodes from cli rgmanager is running

Code:

# /etc/init.d/rgmanager status

rgmanager (pid 479867 479865) is running...

But it does not show as running at PVE > Datacenter > Summary on the services column. PVECluster is the only service running on the 4 nodes.

and when a HA managed kvm tries to start we get this:

Code:

Executing HA start for VM 102
Member fbc240 trying to enable pvevm:102...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:102 -m fbc240' failed: exit code 1

We probably have something not set up correctly.

I've checked multicast, fence_tool a few other things but can not find out the cause.

Any suggestions to fix this?

Thanks.

RobFantini · Jun 12, 2012

cluster.conf :

Code:

<?xml version="1.0"?>
<cluster config_version="95" name="fbcluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.11" login="fbcadmin" name="apc11" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.78" login="fbcadmin" name="apc78" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.88" login="fbcadmin" name="apc88" passwd="032scali"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="fbc241" nodeid="4" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="4" secure="on"/>
          <device name="apc78" port="4" secure="on"/>
          <device name="apc88" port="4" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc240" nodeid="1" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="2" secure="on"/>
          <device name="apc78" port="2" secure="on"/>
          <device name="apc88" port="2" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc100" nodeid="3" votes="1">
      <fence>
        <method name="power">
          <device name="apc78" port="3" secure="on"/>
          <device name="apc88" port="3" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc1" nodeid="5" votes="1">
      <fence>
        <method name="power">
          <device name="apc88" port="6" secure="on"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <failoverdomains>
      <failoverdomain name="fbc240-fbc241" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc240" priority="1"/>
        <failoverdomainnode name="fbc241" priority="100"/>
      </failoverdomain>
      <failoverdomain name="fbc241-fbc240" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc241" priority="1"/>
        <failoverdomainnode name="fbc240" priority="100"/>
      </failoverdomain>
    </failoverdomains>
    <pvevm autostart="1" vmid="9002"/>
  </rm>
</cluster>

pveversion -v
[CODE]
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

[/CODE]

dietmar · Jun 12, 2012

What is the output of

# clustat

RobFantini · Jun 12, 2012

Code:

# clustat
Cluster Status for fbcluster @ Tue Jun 12 05:15:30 2012                                 
Member Status: Quorate                                                                   
                                                                                           
 Member Name                                          ID   Status                           
 ------ ----                                          ---- ------                            
 fbc240                                                   1 Online                             
 fbc100                                                   3 Online                                 
 fbc241                                                   4 Online, Local                          
 fbc1                                                      5 Online

dietmar · Jun 12, 2012

And

# clusvcadm -e pvevm:102 -m fbc240

still fail?

dietmar · Jun 12, 2012

Also post

# clustat -x

RobFantini · Jun 12, 2012

I had to get 102 running so removed it from HA.

we are now using 9002 on fbc241 , as 9002 is a testing kvm. so:

Code:

fbc241 s012 ~ # clusvcadm -e pvevm:9002 -m fbc241
Member fbc241 trying to enable pvevm:9002...Could not connect to resource group manager

RobFantini · Jun 12, 2012

Code:

fbc241 s012 ~ # clustat -x 
<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="fbcluster" id="52020" generation="1216"/>
  <quorum quorate="1" groupmember="0"/>
  <nodes>
    <node name="fbc240" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="fbc100" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000003"/>
    <node name="fbc241" state="1" local="1" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000004"/>
    <node name="fbc1" state="1" local="0" estranged="0" rgmanager="0" rgmanager_master="0" qdisk="0" nodeid="0x00000005"/>
  </nodes>
</clustat>

dietmar · Jun 12, 2012

That indicates that rgmanager is not running on any node??

RobFantini · Jun 12, 2012

dietmar said:
That indicates that rgmanager is not running on any node??

hence the title of the thread. on each node rgmanager thinks it is running per:

Code:

fbc241 s012 /etc/init.d # ./rgmanager status
rgmanager (pid 2640 2639) is running...

dietmar · Jun 12, 2012

Please check what process that is (pid 2640 2639). Are those really rgmanager instances?

Does it help if you restart rgmanager:

# /etc/init.d/rgmanager restart

RobFantini · Jun 12, 2012

from ps afx :

Code:

   2639 ?        S<Ls   0:00 rgmanager
   2640 ?        S<l    0:00  \_ rgmanager

as for restart, it hangs. tihs after 20 min. note that when we restart any of the ha server 's the rgmanager process has to be killed for reboot to occur..

Code:

root@fbc1 /home/rob # /etc/init.d/rgmanager restart
Stopping Cluster Service Manager:

RobFantini · Jun 12, 2012

Also
In April migration was working perfectly.

Sometime late May I noticed nodes would not migrate.. But am just now getting to fixing ..

We have removed and added nodes . Maybe that caused an issue?

e100 · Jun 13, 2012

Have you tried rebooting the Proxmox nodes?

I have seen rgmanager deadlock when network communications are disrupted to all nodes (ie turning off the network switch)
The output of clustat you provided does not match the deadlock I have seen so your symptoms are similar but not exactly like the deadlock I have seen.

The only way I was able to recover was to reboot each node one by one.
As each node started up rgmanager started properly.

Looks like RedHat has released several bug fixes that will likely fix my deadlock problem and possibly RobFantini's problem too.
I have reported the relevent information in my original bug report here: https://bugzilla.proxmox.com/show_bug.cgi?id=105

dietmar · Jun 13, 2012

e100 said:
Looks like RedHat has released several bug fixes that will likely fix my deadlock problem and possibly RobFantini's problem too.

I will try to assemble new packages.

RobFantini · Jun 13, 2012

"The only way I was able to recover was to reboot each node one by one."

thanks for the response.

so we'll do this probably next weekend:
1- all nodes need to be turned off
2- then start nodes one at a time.

Also we've found that the nodes do not shutdown unless the rgmanager process is manually killed. Did you have to do that when you had the similar issue?

RobFantini · Jun 13, 2012

dietmar said:
I will try to assemble new packages.

Dietmar
If it helps, we could wait a week or so and test the updated package.

e100 · Jun 13, 2012

RobFantini said:
"The only way I was able to recover was to reboot each node one by one."

thanks for the response.

so we'll do this probably next weekend:
1- all nodes need to be turned off
2- then start nodes one at a time.

Also we've found that the nodes do not shutdown unless the rgmanager process is manually killed. Did you have to do that when you had the similar issue?

With the problem I experienced I only needed to reboot each node one at a time, I did not need to turn them all off and start them one by one.
But maybe turning them all off would help.
Also, if I remember correctly, rgmanager could not be killed and we had to force the reboot.

Hopefully the recent updates that RedHat released will help and dietmar can get some new packages built soon.

RobFantini · Jun 13, 2012

I did reboot all the nodes Sunday. But one at a time. I suppose it'll be worth trying to shut al ldown and start up one at a time on the weekend.

RobFantini · Jun 14, 2012

dietmar said:
I will try to assemble new packages.

here is more log info:

Code:

 root@fbc1 /var/log # ll /var/log/cluster/fence_na.log-rw-r--r-- 1 root root 338 May 27 16:33 /var/log/cluster/fence_na.log
root@fbc1 /var/log # cat /var/log/cluster/fence_na.log
Node Assassin: . [].
TCP Port: ...... [238].
Node: .......... [00].
Login: ......... [].
Password: ...... [].
Action: ........ [metadata].
Version Request: [no].
Done reading args.
Connection to Node Assassin: [] failed.
Error was: [unknown remote host: ]
Username and/or password invalid. Did you use the command line switches properly?




** also  these :
fbc241 s012 /var/log/cluster # cat rgmanager.log
Jun 05 09:11:05 rgmanager #1: Quorum Dissolved
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:05 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:06 rgmanager stop on pvevm "102" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager [pvevm] got empty cluster VM list
Jun 05 09:11:06 rgmanager stop on pvevm "101" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager stop on pvevm "104" returned 2 (invalid argument(s))
Jun 05 09:11:06 rgmanager stop on pvevm "1023" returned 2 (invalid argument(s))




fbc240 s009 /var/log/cluster # cat  rgmanager.log
Jun 05 08:39:55 rgmanager #1: Quorum Dissolved
Jun 05 08:39:56 rgmanager [pvevm] VM 105 is already stopped
Jun 05 08:39:56 rgmanager [pvevm] VM 115 is already stopped

rgmanager running per cli but not pve

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Proxmox Staff Member

Famous Member

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Renowned Member

Proxmox Staff Member

Famous Member

Famous Member

Renowned Member

Famous Member

Famous Member

We value your privacy