Adding - Deleting HA kvm fails half nodes

Sakis · Feb 12, 2015

Hello,

I have a working cluster of 7 nodes. I have enabled HA for around 20 kvms. Everything works good until recently i tried to add HA to a kvm and i noticed strange logs.

All nodes report

pveversion

Code:

proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-9
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

fence_tools ls

Code:

fence domain
member count  7
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 3 4 5 6 7

pvecm nodes (Inc numbers result vary from node to node)

Code:

Node  Sts   Inc   Joined               Name
   1   M   1560   2015-01-30 16:26:07  node7
   2   M   1520   2015-01-22 08:15:25  node2
   3   M   1536   2015-01-22 08:31:09  node3
   4   M   1568   2015-02-02 22:00:51  node4
   5   M   1544   2015-01-27 16:00:40  node8
   6   M   1400   2015-01-15 14:49:57  node9
   7   M   1400   2015-01-15 14:49:57  node11

When i add or delete a kvm this is the log of a correct behaviour.

Code:

Feb 10 08:57:29 node8 rgmanager[818615]: [pvevm] VM 195 is running
Feb 10 08:57:29 node8 rgmanager[818614]: [pvevm] VM 198 is running
Feb 10 08:57:30 node8 rgmanager[818654]: [pvevm] VM 202 is running
Feb 10 08:57:30 node8 rgmanager[818659]: [pvevm] VM 208 is running
Feb 10 08:57:30 node8 pmxcfs[3958]: [status] notice: received log
Feb 10 08:57:40 node8 rgmanager[818864]: [pvevm] VM 208 is running
Feb 10 08:57:40 node8 rgmanager[818873]: [pvevm] VM 162 is running
Feb 10 08:57:49 node8 rgmanager[819047]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819226]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819238]: [pvevm] VM 174 is running
Feb 10 08:58:00 node8 rgmanager[819266]: [pvevm] VM 202 is running
Feb 10 08:58:09 node8 rgmanager[819477]: [pvevm] VM 174 is running
Feb 10 08:58:09 node8 rgmanager[819497]: [pvevm] VM 195 is running
Feb 10 08:58:09 node8 rgmanager[819503]: [pvevm] VM 198 is running
Feb 10 08:58:10 node8 rgmanager[819537]: [pvevm] VM 208 is running
Feb 10 08:58:12 node8 pmxcfs[3958]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node8 corosync[4377]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node8 pmxcfs[3958]: [status] notice: update cluster info (cluster name  cluster, version = 541)
Feb 10 08:58:12 node8 rgmanager[948566]: Status Child Max set to 20
Feb 10 08:58:12 node8 rgmanager[948566]: Reconfiguring
Feb 10 08:58:12 node8 rgmanager[948566]: Loading Service Data
Feb 10 08:58:16 node8 rgmanager[948566]: Stopping changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Restarting changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Starting changed resources.
Feb 10 08:58:22 node8 rgmanager[821632]: [pvevm] VM 271 is running
Feb 10 08:58:22 node8 rgmanager[821654]: [pvevm] VM 195 is running
Feb 10 08:58:23 node8 rgmanager[821665]: [pvevm] VM 198 is running
Feb 10 08:58:23 node8 rgmanager[821675]: [pvevm] VM 174 is running
Feb 10 08:58:23 node8 rgmanager[821709]: [pvevm] VM 202 is running
Feb 10 08:58:23 node8 rgmanager[821710]: [pvevm] VM 162 is running
Feb 10 08:58:23 node8 rgmanager[821717]: [pvevm] VM 208 is running
Feb 10 08:58:32 node8 rgmanager[822122]: [pvevm] VM 271 is running
Feb 10 08:58:32 node8 rgmanager[822129]: [pvevm] VM 174 is running
Feb 10 08:58:33 node8 rgmanager[822162]: [pvevm] VM 195 is running
Feb 10 08:58:33 node8 rgmanager[822166]: [pvevm] VM 198 is running
Feb 10 08:58:33 node8 rgmanager[822195]: [pvevm] VM 202 is running
Feb 10 08:58:33 node8 rgmanager[822208]: [pvevm] VM 162 is running
Feb 10 08:58:33 node8 rgmanager[822219]: [pvevm] VM 208 is running
Feb 10 08:59:13 node8 rgmanager[822923]: [pvevm] VM 271 is running
Feb 10 08:59:13 node8 rgmanager[822928]: [pvevm] VM 174 is running

Since the problem appeared 3 of the nodes log the following, when i alter the cluster.conf with a new or a removed ha resource.

Code:

Feb 10 08:57:32 node3 rgmanager[473442]: [pvevm] VM 250 is running
Feb 10 08:57:51 node3 rgmanager[473748]: [pvevm] VM 157 is running
Feb 10 08:57:52 node3 rgmanager[473764]: [pvevm] VM 250 is running
Feb 10 08:58:02 node3 rgmanager[473969]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474117]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474130]: [pvevm] VM 157 is running
Feb 10 08:58:12 node3 pmxcfs[4042]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node3 corosync[4411]:   [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node3 pmxcfs[4042]: [status] notice: update cluster info (cluster name  cluster, version = 541)
Feb 10 08:58:31 node3 rgmanager[475404]: [pvevm] VM 157 is running
Feb 10 08:58:32 node3 rgmanager[475424]: [pvevm] VM 250 is running
Feb 10 08:58:42 node3 rgmanager[475598]: [pvevm] VM 250 is running
Feb 10 08:58:52 node3 rgmanager[475758]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476093]: [pvevm] VM 157 is running
Feb 10 08:59:12 node3 rgmanager[476113]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476133]: [pvevm] VM 250 is running
Feb 10 08:59:22 node3 rgmanager[476894]: [pvevm] VM 157 is running
Feb 10 08:59:32 node3 rgmanager[477274]: [pvevm] VM 197 is running
Feb 10 08:59:52 node3 rgmanager[477581]: [pvevm] VM 157 is running
Feb 10 08:59:52 node3 rgmanager[477601]: [pvevm] VM 250 is running
Feb 10 09:00:02 node3 rgmanager[477856]: [pvevm] VM 250 is running

Its missing the rgmannager reconfiguration.

The result is now that clustat doesnt export the same states in all nodes. These with the problem have the old conf. The good nodes report the change that happened.

I have alter my cluster conf accordin to some threads inside this forum relative to rgmannager and totem. Its working well for more than 6 months.
I added <totem token="54000" window_size="150"/> and <rm status_child_max="20">
Detailed cluster.conf

Code:

<?xml version="1.0"?>
<cluster config_version="544" name="cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <totem token="54000" window_size="150"/>
  <fencedevices>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence008" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence010" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence012" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence014" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence020" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence022" passwd="xxxxxxx"/>
    <fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence024" passwd="xxxxxxx"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="node2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence010"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node3" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence012"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node4" nodeid="4" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence014"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node7" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence008"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node8" nodeid="5" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence020"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node9" nodeid="6" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence022"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node11" nodeid="7" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence024"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm status_child_max="20">
    <pvevm autostart="1" vmid="139"/>
   ...
    <pvevm autostart="1" vmid="112"/>
  </rm>
</cluster>

Rgmmanager log doesnt help. Any hints? Enabling debug maybe help? What services will need restart? I would like to avoid stop-starts-migrates in this state.

dietmar · Feb 12, 2015

Do you increase version number inside cluster.conf on each update?

Sakis · Feb 12, 2015

Yes i always do when i make changes by hand in the cluster.conf that cant be done with gui.
Mostly i use proxmox gui for add/remove HA vm, that does it automatically.

Sakis · Feb 16, 2015

I try to restart rgmannager to failed nodes after migrating the HA machines from them. Rgmannager never stops. Now i have also in bad nodes these messages
rgmanager #52: Failed changing RG status

Search

Search

Adding - Deleting HA kvm fails half nodes

Sakis

Active Member

dietmar

Proxmox Staff Member

Sakis

Active Member

Sakis

Active Member