Adding - Deleting HA kvm fails half nodes

Sakis

Active Member
Aug 14, 2013
121
6
38
Hello,

I have a working cluster of 7 nodes. I have enabled HA for around 20 kvms. Everything works good until recently i tried to add HA to a kvm and i noticed strange logs.

All nodes report

pveversion
Code:
proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-9
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

fence_tools ls
Code:
fence domain
member count  7
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 3 4 5 6 7

pvecm nodes (Inc numbers result vary from node to node)
Code:
Node  Sts   Inc   Joined               Name
   1   M   1560   2015-01-30 16:26:07  node7
   2   M   1520   2015-01-22 08:15:25  node2
   3   M   1536   2015-01-22 08:31:09  node3
   4   M   1568   2015-02-02 22:00:51  node4
   5   M   1544   2015-01-27 16:00:40  node8
   6   M   1400   2015-01-15 14:49:57  node9
   7   M   1400   2015-01-15 14:49:57  node11

When i add or delete a kvm this is the log of a correct behaviour.
Code:
Feb 10 08:57:29 node8 rgmanager[818615]: [pvevm] VM 195 is running
Feb 10 08:57:29 node8 rgmanager[818614]: [pvevm] VM 198 is running
Feb 10 08:57:30 node8 rgmanager[818654]: [pvevm] VM 202 is running
Feb 10 08:57:30 node8 rgmanager[818659]: [pvevm] VM 208 is running
Feb 10 08:57:30 node8 pmxcfs[3958]: [status] notice: received log
Feb 10 08:57:40 node8 rgmanager[818864]: [pvevm] VM 208 is running
Feb 10 08:57:40 node8 rgmanager[818873]: [pvevm] VM 162 is running
Feb 10 08:57:49 node8 rgmanager[819047]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819226]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819238]: [pvevm] VM 174 is running
Feb 10 08:58:00 node8 rgmanager[819266]: [pvevm] VM 202 is running
Feb 10 08:58:09 node8 rgmanager[819477]: [pvevm] VM 174 is running
Feb 10 08:58:09 node8 rgmanager[819497]: [pvevm] VM 195 is running
Feb 10 08:58:09 node8 rgmanager[819503]: [pvevm] VM 198 is running
Feb 10 08:58:10 node8 rgmanager[819537]: [pvevm] VM 208 is running
Feb 10 08:58:12 node8 pmxcfs[3958]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node8 corosync[4377]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node8 pmxcfs[3958]: [status] notice: update cluster info (cluster name  cluster, version = 541)
Feb 10 08:58:12 node8 rgmanager[948566]: Status Child Max set to 20
Feb 10 08:58:12 node8 rgmanager[948566]: Reconfiguring
Feb 10 08:58:12 node8 rgmanager[948566]: Loading Service Data
Feb 10 08:58:16 node8 rgmanager[948566]: Stopping changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Restarting changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Starting changed resources.
Feb 10 08:58:22 node8 rgmanager[821632]: [pvevm] VM 271 is running
Feb 10 08:58:22 node8 rgmanager[821654]: [pvevm] VM 195 is running
Feb 10 08:58:23 node8 rgmanager[821665]: [pvevm] VM 198 is running
Feb 10 08:58:23 node8 rgmanager[821675]: [pvevm] VM 174 is running
Feb 10 08:58:23 node8 rgmanager[821709]: [pvevm] VM 202 is running
Feb 10 08:58:23 node8 rgmanager[821710]: [pvevm] VM 162 is running
Feb 10 08:58:23 node8 rgmanager[821717]: [pvevm] VM 208 is running
Feb 10 08:58:32 node8 rgmanager[822122]: [pvevm] VM 271 is running
Feb 10 08:58:32 node8 rgmanager[822129]: [pvevm] VM 174 is running
Feb 10 08:58:33 node8 rgmanager[822162]: [pvevm] VM 195 is running
Feb 10 08:58:33 node8 rgmanager[822166]: [pvevm] VM 198 is running
Feb 10 08:58:33 node8 rgmanager[822195]: [pvevm] VM 202 is running
Feb 10 08:58:33 node8 rgmanager[822208]: [pvevm] VM 162 is running
Feb 10 08:58:33 node8 rgmanager[822219]: [pvevm] VM 208 is running
Feb 10 08:59:13 node8 rgmanager[822923]: [pvevm] VM 271 is running
Feb 10 08:59:13 node8 rgmanager[822928]: [pvevm] VM 174 is running

Since the problem appeared 3 of the nodes log the following, when i alter the cluster.conf with a new or a removed ha resource.
Code:
Feb 10 08:57:32 node3 rgmanager[473442]: [pvevm] VM 250 is running
Feb 10 08:57:51 node3 rgmanager[473748]: [pvevm] VM 157 is running
Feb 10 08:57:52 node3 rgmanager[473764]: [pvevm] VM 250 is running
Feb 10 08:58:02 node3 rgmanager[473969]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474117]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474130]: [pvevm] VM 157 is running
Feb 10 08:58:12 node3 pmxcfs[4042]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node3 corosync[4411]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node3 pmxcfs[4042]: [status] notice: update cluster info (cluster name  cluster, version = 541)
Feb 10 08:58:31 node3 rgmanager[475404]: [pvevm] VM 157 is running
Feb 10 08:58:32 node3 rgmanager[475424]: [pvevm] VM 250 is running
Feb 10 08:58:42 node3 rgmanager[475598]: [pvevm] VM 250 is running
Feb 10 08:58:52 node3 rgmanager[475758]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476093]: [pvevm] VM 157 is running
Feb 10 08:59:12 node3 rgmanager[476113]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476133]: [pvevm] VM 250 is running
Feb 10 08:59:22 node3 rgmanager[476894]: [pvevm] VM 157 is running
Feb 10 08:59:32 node3 rgmanager[477274]: [pvevm] VM 197 is running
Feb 10 08:59:52 node3 rgmanager[477581]: [pvevm] VM 157 is running
Feb 10 08:59:52 node3 rgmanager[477601]: [pvevm] VM 250 is running
Feb 10 09:00:02 node3 rgmanager[477856]: [pvevm] VM 250 is running

Its missing the rgmannager reconfiguration.

The result is now that clustat doesnt export the same states in all nodes. These with the problem have the old conf. The good nodes report the change that happened.

I have alter my cluster conf accordin to some threads inside this forum relative to rgmannager and totem. Its working well for more than 6 months.
I added <totem token="54000" window_size="150"/> and <rm status_child_max="20">
Detailed cluster.conf
Code:
<?xml version="1.0"?>
<cluster config_version="544" name="cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <totem token="54000" window_size="150"/>
  <fencedevices>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence008" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence010" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence012" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence014" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence020" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence022" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence024" passwd="xxxxxxx"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="node2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence010"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node3" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence012"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node4" nodeid="4" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence014"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node7" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence008"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node8" nodeid="5" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence020"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node9" nodeid="6" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence022"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node11" nodeid="7" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence024"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm status_child_max="20">
    <pvevm autostart="1" vmid="139"/>
   ...
    <pvevm autostart="1" vmid="112"/>
  </rm>
</cluster>

Rgmmanager log doesnt help. Any hints? Enabling debug maybe help? What services will need restart? I would like to avoid stop-starts-migrates in this state.
 
Yes i always do when i make changes by hand in the cluster.conf that cant be done with gui.
Mostly i use proxmox gui for add/remove HA vm, that does it automatically.
 
I try to restart rgmannager to failed nodes after migrating the HA machines from them. Rgmannager never stops. Now i have also in bad nodes these messages
rgmanager #52: Failed changing RG status
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!