Fenceconfig no longer working

proxtest

Active Member
Mar 19, 2014
108
0
36
Month ago i have the can't reconnect problem after a networking error on ovh virtual rack happen.
On May 5 i got the same problem again but now fencing also no longer working! Node3 and his VM was still running till May 9 without connection to the other nodes! Ceph had recovert from the connection lost and was working fine.
Maybe the last updates change something like it was happen already before? Fencing was working fine when we installed it and we change nothing!

May 5 23:07:38 node1pv corosync[3427]: [TOTEM ] A processor failed, forming new configuration.
May 5 23:07:50 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:07:50 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:07:50 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:07:50 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:07:50 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:07:50 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.3)
May 5 23:07:50 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:07:50 node1pv corosync[3427]: [QUORUM] Members[2]: 1 2
May 5 23:07:50 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:07:50 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:07:50 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:07:50 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:07:50 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:07:50 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:07:50 node1pv corosync[3427]: [TOTEM ] A processor joined or left the membership and a
new membership was formed.
May 5 23:07:50 node1pv corosync[3427]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.1) ; m
embers(old:3 left:1)
May 5 23:07:50 node1pv corosync[3427]: [MAIN ] Completed service synchronization, ready to pro
vide service.
May 5 23:07:50 node1pv rgmanager[4296]: State change: node3pv DOWN
May 5 23:07:50 node1pv fenced[4000]: fencing node node3pv
May 5 23:07:51 node1pv fence_ovh: Parse error: Ignoring unknown option 'nodename=node3pv
May 5 23:07:51 node1pv fence_ovh: Parse error: Ignoring unknown option 'ipaddr=xxxx.xxxx.eu

Here the node connection comes back:

May 5 23:09:45 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:09:45 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:09:45 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:09:45 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.3)
May 5 23:09:45 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:09:45 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.3)
May 5 23:09:45 node1pv corosync[3427]: [TOTEM ] A processor joined or left the membership and a
new membership was formed.
May 5 23:09:45 node1pv corosync[3427]: [QUORUM] Members[3]: 1 2 3
May 5 23:09:45 node1pv corosync[3427]: [QUORUM] Members[3]: 1 2 3
May 5 23:09:45 node1pv corosync[3427]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.1) ; m
embers(old:2 left:0)
May 5 23:09:45 node1pv corosync[3427]: [MAIN ] Completed service synchronization, ready to pro
vide service.
May 5 23:09:45 node1pv rgmanager[4296]: State change: node3pv UP
May 5 23:09:48 node1pv fence_ovh: Parse error: Ignoring unknown option 'nodename=node3pv
May 5 23:09:48 node1pv fence_ovh: Parse error: Ignoring unknown option 'ipaddr=xxxx.xxxx.eu


And lost again:

May 5 23:10:07 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:10:07 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:10:07 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:10:07 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:10:07 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:10:07 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.3)
May 5 23:10:07 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:10:07 node1pv corosync[3427]: [QUORUM] Members[2]: 1 2
May 5 23:10:07 node1pv corosync[3427]: [CLM ] CLM CONFIGURATION CHANGE
May 5 23:10:07 node1pv corosync[3427]: [CLM ] New Configuration:
May 5 23:10:07 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.1)
May 5 23:10:07 node1pv corosync[3427]: [CLM ] #011r(0) ip(10.10.10.2)
May 5 23:10:07 node1pv corosync[3427]: [CLM ] Members Left:
May 5 23:10:07 node1pv corosync[3427]: [CLM ] Members Joined:
May 5 23:10:07 node1pv corosync[3427]: [TOTEM ] A processor joined or left the membership and a
new membership was formed.
May 5 23:10:07 node1pv rgmanager[4296]: State change: node3pv DOWN
May 5 23:10:07 node1pv corosync[3427]: [CPG ] chosen downlist: sender r(0) ip(10.10.10.1) ; m
embers(old:3 left:1)
May 5 23:10:07 node1pv corosync[3427]: [MAIN ] Completed service synchronization, ready to pro
vide service.
May 5 23:10:10 node1pv fence_ovh: Parse error: Ignoring unknown option 'nodename=node3pv
May 5 23:10:10 node1pv fence_ovh: Parse error: Ignoring unknown option 'ipaddr=xxx.xxxx.eu


And the desaster comes up with rgmanager, he starts crashing again and again till i restart on May 9!


May 5 23:10:14 node1pv kernel: rgmanager D ffff880562df6b80 0 652522 4295 0 0x000000
00
May 5 23:10:14 node1pv kernel: ffff8803bfcf1ca0 0000000000000086 0000000000000000 000000000000000
1
May 5 23:10:14 node1pv kernel: 0000000000000000 ffff8803bfcf1c48 ffffffff81056bb5 000000013d699df8
May 5 23:10:14 node1pv kernel: 000b7cc509da5b00 ffff880879066600 00000001c088e5c2 000000000000037b
May 5 23:10:14 node1pv kernel: Call Trace:
May 5 23:10:14 node1pv kernel: [<ffffffff81056bb5>] ? __wake_up_common+0x55/0x90
May 5 23:10:14 node1pv kernel: [<ffffffff81563fe5>] rwsem_down_failed_common+0x95/0x1e0
May 5 23:10:14 node1pv kernel: [<ffffffff81564186>] rwsem_down_read_failed+0x26/0x30
May 5 23:10:14 node1pv kernel: [<ffffffff812a0494>] call_rwsem_down_read_failed+0x14/0x30
May 5 23:10:14 node1pv kernel: [<ffffffff81563864>] ? down_read+0x24/0x2b
May 5 23:10:14 node1pv kernel: [<ffffffffa07a8033>] dlm_user_request+0x43/0x1d0 [dlm]
May 5 23:10:14 node1pv kernel: [<ffffffff810b1b95>] ? ktime_get+0x65/0xf0
May 5 23:10:14 node1pv kernel: [<ffffffff81198627>] ? kmem_cache_alloc_trace+0x1a7/0x1b0
May 5 23:10:14 node1pv kernel: [<ffffffffa07b2741>] device_write+0x5b1/0x710 [dlm]
May 5 23:10:14 node1pv kernel: [<ffffffff81563731>] ? do_nanosleep+0x91/0xc0
May 5 23:10:14 node1pv kernel: [<ffffffff810abb49>] ? hrtimer_nanosleep+0xb9/0x180
May 5 23:10:14 node1pv kernel: [<ffffffff811adf01>] vfs_write+0xa1/0x190
May 5 23:10:14 node1pv kernel: [<ffffffff81138d3a>] ? fire_user_return_notifiers+0x3a/0x50
May 5 23:10:14 node1pv kernel: [<ffffffff811ae25a>] sys_write+0x4a/0x90
May 5 23:10:14 node1pv kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
May 5 23:10:16 node1pv fence_ovh: Parse error: Ignoring unknown option 'nodename=node3pv
May 5 23:10:16 node1pv fence_ovh: Parse error: Ignoring unknown option 'ipaddr=xxx.xxxx.eu



proxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)
pve-kernel-2.6.32-37-pve: 2.6.32-150
pve-kernel-2.6.32-34-pve: 2.6.32-140
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.4-3
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-32
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

How can i get a working fencing back?
Why rgmanager always crashing if somethin happen?
Proxmox webinterface still accepting this config without errors:

<?xml version="1.0"?>
<cluster config_version="14473" name="RZ">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<clusternodes>
<clusternode name="node1pv" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="off" name="fenceNode1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2pv" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="off" name="fenceNode2"/>
</method>
</fence>
</clusternode>
<clusternode name="node3pv" nodeid="3" votes="1">
<fence>
<method name="1">
<device action="off" name="fenceNode3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ovh" email="support@xxx.xxx" ipaddr="xxx.xxx.eu" login="xxx.xxx-ovh" name="fenceNode1" passwd="xxx.xxx"/>
<fencedevice agent="fence_ovh" email="support@xxx.xxx" ipaddr="xxx.xxx.eu" login="xxx.xxx-ovh" name="fenceNode2" passwd="xxx.xxx"/>
<fencedevice agent="fence_ovh" email="support@xxx.xxx" ipaddr="xxx.xxx.eu" login="xxx.xxx-ovh" name="fenceNode3" passwd="xxx.xxx"/>
</fencedevices>
<rm/>
</cluster>
 
Last edited: