Hello,
I have a working cluster of 7 nodes. I have enabled HA for around 20 kvms. Everything works good until recently i tried to add HA to a kvm and i noticed strange logs.
All nodes report
pveversion
fence_tools ls
pvecm nodes (Inc numbers result vary from node to node)
When i add or delete a kvm this is the log of a correct behaviour.
Since the problem appeared 3 of the nodes log the following, when i alter the cluster.conf with a new or a removed ha resource.
Its missing the rgmannager reconfiguration.
The result is now that clustat doesnt export the same states in all nodes. These with the problem have the old conf. The good nodes report the change that happened.
I have alter my cluster conf accordin to some threads inside this forum relative to rgmannager and totem. Its working well for more than 6 months.
I added <totem token="54000" window_size="150"/> and <rm status_child_max="20">
Detailed cluster.conf
Rgmmanager log doesnt help. Any hints? Enabling debug maybe help? What services will need restart? I would like to avoid stop-starts-migrates in this state.
I have a working cluster of 7 nodes. I have enabled HA for around 20 kvms. Everything works good until recently i tried to add HA to a kvm and i noticed strange logs.
All nodes report
pveversion
Code:
proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-29-pve: 2.6.32-126
pve-kernel-2.6.32-31-pve: 2.6.32-132
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-9
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
fence_tools ls
Code:
fence domain
member count 7
victim count 0
victim now 0
master nodeid 1
wait state none
members 1 2 3 4 5 6 7
pvecm nodes (Inc numbers result vary from node to node)
Code:
Node Sts Inc Joined Name
1 M 1560 2015-01-30 16:26:07 node7
2 M 1520 2015-01-22 08:15:25 node2
3 M 1536 2015-01-22 08:31:09 node3
4 M 1568 2015-02-02 22:00:51 node4
5 M 1544 2015-01-27 16:00:40 node8
6 M 1400 2015-01-15 14:49:57 node9
7 M 1400 2015-01-15 14:49:57 node11
When i add or delete a kvm this is the log of a correct behaviour.
Code:
Feb 10 08:57:29 node8 rgmanager[818615]: [pvevm] VM 195 is running
Feb 10 08:57:29 node8 rgmanager[818614]: [pvevm] VM 198 is running
Feb 10 08:57:30 node8 rgmanager[818654]: [pvevm] VM 202 is running
Feb 10 08:57:30 node8 rgmanager[818659]: [pvevm] VM 208 is running
Feb 10 08:57:30 node8 pmxcfs[3958]: [status] notice: received log
Feb 10 08:57:40 node8 rgmanager[818864]: [pvevm] VM 208 is running
Feb 10 08:57:40 node8 rgmanager[818873]: [pvevm] VM 162 is running
Feb 10 08:57:49 node8 rgmanager[819047]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819226]: [pvevm] VM 271 is running
Feb 10 08:57:59 node8 rgmanager[819238]: [pvevm] VM 174 is running
Feb 10 08:58:00 node8 rgmanager[819266]: [pvevm] VM 202 is running
Feb 10 08:58:09 node8 rgmanager[819477]: [pvevm] VM 174 is running
Feb 10 08:58:09 node8 rgmanager[819497]: [pvevm] VM 195 is running
Feb 10 08:58:09 node8 rgmanager[819503]: [pvevm] VM 198 is running
Feb 10 08:58:10 node8 rgmanager[819537]: [pvevm] VM 208 is running
Feb 10 08:58:12 node8 pmxcfs[3958]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node8 corosync[4377]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node8 pmxcfs[3958]: [status] notice: update cluster info (cluster name cluster, version = 541)
Feb 10 08:58:12 node8 rgmanager[948566]: Status Child Max set to 20
Feb 10 08:58:12 node8 rgmanager[948566]: Reconfiguring
Feb 10 08:58:12 node8 rgmanager[948566]: Loading Service Data
Feb 10 08:58:16 node8 rgmanager[948566]: Stopping changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Restarting changed resources.
Feb 10 08:58:16 node8 rgmanager[948566]: Starting changed resources.
Feb 10 08:58:22 node8 rgmanager[821632]: [pvevm] VM 271 is running
Feb 10 08:58:22 node8 rgmanager[821654]: [pvevm] VM 195 is running
Feb 10 08:58:23 node8 rgmanager[821665]: [pvevm] VM 198 is running
Feb 10 08:58:23 node8 rgmanager[821675]: [pvevm] VM 174 is running
Feb 10 08:58:23 node8 rgmanager[821709]: [pvevm] VM 202 is running
Feb 10 08:58:23 node8 rgmanager[821710]: [pvevm] VM 162 is running
Feb 10 08:58:23 node8 rgmanager[821717]: [pvevm] VM 208 is running
Feb 10 08:58:32 node8 rgmanager[822122]: [pvevm] VM 271 is running
Feb 10 08:58:32 node8 rgmanager[822129]: [pvevm] VM 174 is running
Feb 10 08:58:33 node8 rgmanager[822162]: [pvevm] VM 195 is running
Feb 10 08:58:33 node8 rgmanager[822166]: [pvevm] VM 198 is running
Feb 10 08:58:33 node8 rgmanager[822195]: [pvevm] VM 202 is running
Feb 10 08:58:33 node8 rgmanager[822208]: [pvevm] VM 162 is running
Feb 10 08:58:33 node8 rgmanager[822219]: [pvevm] VM 208 is running
Feb 10 08:59:13 node8 rgmanager[822923]: [pvevm] VM 271 is running
Feb 10 08:59:13 node8 rgmanager[822928]: [pvevm] VM 174 is running
Since the problem appeared 3 of the nodes log the following, when i alter the cluster.conf with a new or a removed ha resource.
Code:
Feb 10 08:57:32 node3 rgmanager[473442]: [pvevm] VM 250 is running
Feb 10 08:57:51 node3 rgmanager[473748]: [pvevm] VM 157 is running
Feb 10 08:57:52 node3 rgmanager[473764]: [pvevm] VM 250 is running
Feb 10 08:58:02 node3 rgmanager[473969]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474117]: [pvevm] VM 197 is running
Feb 10 08:58:12 node3 rgmanager[474130]: [pvevm] VM 157 is running
Feb 10 08:58:12 node3 pmxcfs[4042]: [dcdb] notice: wrote new cluster config '/etc/cluster/cluster.conf'
Feb 10 08:58:12 node3 corosync[4411]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
Feb 10 08:58:12 node3 pmxcfs[4042]: [status] notice: update cluster info (cluster name cluster, version = 541)
Feb 10 08:58:31 node3 rgmanager[475404]: [pvevm] VM 157 is running
Feb 10 08:58:32 node3 rgmanager[475424]: [pvevm] VM 250 is running
Feb 10 08:58:42 node3 rgmanager[475598]: [pvevm] VM 250 is running
Feb 10 08:58:52 node3 rgmanager[475758]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476093]: [pvevm] VM 157 is running
Feb 10 08:59:12 node3 rgmanager[476113]: [pvevm] VM 197 is running
Feb 10 08:59:12 node3 rgmanager[476133]: [pvevm] VM 250 is running
Feb 10 08:59:22 node3 rgmanager[476894]: [pvevm] VM 157 is running
Feb 10 08:59:32 node3 rgmanager[477274]: [pvevm] VM 197 is running
Feb 10 08:59:52 node3 rgmanager[477581]: [pvevm] VM 157 is running
Feb 10 08:59:52 node3 rgmanager[477601]: [pvevm] VM 250 is running
Feb 10 09:00:02 node3 rgmanager[477856]: [pvevm] VM 250 is running
Its missing the rgmannager reconfiguration.
The result is now that clustat doesnt export the same states in all nodes. These with the problem have the old conf. The good nodes report the change that happened.
I have alter my cluster conf accordin to some threads inside this forum relative to rgmannager and totem. Its working well for more than 6 months.
I added <totem token="54000" window_size="150"/> and <rm status_child_max="20">
Detailed cluster.conf
Code:
<?xml version="1.0"?>
<cluster config_version="544" name="cluster">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<totem token="54000" window_size="150"/>
<fencedevices>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence008" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence010" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence012" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence014" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence020" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence022" passwd="xxxxxxx"/>
<fencedevice agent="fence_api" ipaddr="xxxx" login="xxxx" name="fence024" passwd="xxxxxxx"/>
</fencedevices>
<clusternodes>
<clusternode name="node2" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="off" name="fence010"/>
</method>
</fence>
</clusternode>
<clusternode name="node3" nodeid="3" votes="1">
<fence>
<method name="1">
<device action="off" name="fence012"/>
</method>
</fence>
</clusternode>
<clusternode name="node4" nodeid="4" votes="1">
<fence>
<method name="1">
<device action="off" name="fence014"/>
</method>
</fence>
</clusternode>
<clusternode name="node7" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="off" name="fence008"/>
</method>
</fence>
</clusternode>
<clusternode name="node8" nodeid="5" votes="1">
<fence>
<method name="1">
<device action="off" name="fence020"/>
</method>
</fence>
</clusternode>
<clusternode name="node9" nodeid="6" votes="1">
<fence>
<method name="1">
<device action="off" name="fence022"/>
</method>
</fence>
</clusternode>
<clusternode name="node11" nodeid="7" votes="1">
<fence>
<method name="1">
<device action="off" name="fence024"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm status_child_max="20">
<pvevm autostart="1" vmid="139"/>
...
<pvevm autostart="1" vmid="112"/>
</rm>
</cluster>
Rgmmanager log doesnt help. Any hints? Enabling debug maybe help? What services will need restart? I would like to avoid stop-starts-migrates in this state.