Fence Action Loggin

adamb

Famous Member
Mar 1, 2012
1,323
73
113
This issue has bothered me from day one but I never got around to pinning down why. When a fence action takes place it would be great if a reason for the action could be placed in the log.

For example I had a cluster fence its primary node this morning.

May 30 03:34:40 fenced fencing node fpracprox1
May 30 03:34:53 fenced fence fpracprox1 success

Syslog provides no indication of why

May 30 03:33:15 fpracprox1 rgmanager[879803]: [pvevm] VM 100 is running
May 30 03:40:55 fpracprox1 kernel: imklog 5.8.11, log source = /proc/kmsg started.

Is there any specific reason why a reason or some type of clue can't be in the log? Fencing is great, but if you can't figure out why, then its like finding a needle in a hay stack.
 
Last edited:
So I wonder if I add the following if it would provide more logging on fence actions.

<rm log_level="7">
<pvevm autostart="1" vmid="100"/>
 
Added this to my cluser.conf

<rm log_level="7">
<pvevm autostart="1" vmid="100"/>
</rm>
</cluster>

Defintiely no more input in logging in fenced.log

May 30 14:52:04 fenced fencing node testprox1
May 30 14:52:18 fenced fence testprox1 success

I just cringe when a cluster gets fenced and I have nothing real to go off of other than it was fenced. Obviously the decision to fence has some logic behind it, so that should be log able imo.
 
Have you configured any failover domains?

Show content of /etc/pve/cluster.conf

Nope no failover domains as I didn't see a need for them. Would they provide more input as to why the node is being fenced?

root@fpracprox1:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="15" name="fprac">
<cman expected_votes="3" keyfile="/var/lib/pve-cluster/corosync.authkey"/>
<quorumd allow_kill="0" interval="3" label="cluster_qdisk" master_wins="1" tko="10"/>
<totem token="54000"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.82" lanplus="1" login="fence" name="ipmi1" passwd="medent168" power_wait="5"/>
<fencedevice agent="fence_ipmilan" ipaddr="192.168.1.83" lanplus="1" login="fence" name="ipmi2" passwd="medent168" power_wait="5"/>
<fencedevice agent="fence_apc" ipaddr="192.168.1.85" login="device" name="apc" passwd="medent168"/>
</fencedevices>
<clusternodes>
<clusternode name="fpracprox1" nodeid="1" votes="1">
<fence>
<method name="acp">
<device name="apc" port="1,2" secure="on"/>
</method>
<method name="1">
<device name="ipmi1"/>
</method>
</fence>
</clusternode>
<clusternode name="fpracprox2" nodeid="2" votes="1">
<fence>
<method name="acp">
<device name="apc" port="3,4" secure="on"/>
</method>
<method name="1">
<device name="ipmi2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="100"/>
</rm>
</cluster>
 
For fencing to work you will need failover domains. If no failover domains is configured how should a vm be auto migrated?

You may have misinterpreted my question. Fencing works great, no issues there. I am looking for a way to increase verbosity to see the reason why the node was fenced.

/var/log/cluster/fenced.log simply has the following on a fence action.

May 30 14:52:04 fenced fencing node testprox1
May 30 14:52:18 fenced fence testprox1 success

It is successful, but I have no idea why. It would be great to see the logic and or reason in the log atleast pointing you in the direction of the issue which caused the fence action to be called.
 
I see. But don't you miss something to decide whether quorum is established or not?

Mine reads:
<quorumd allow_kill="0" interval="3" label="proxmox1_qdisk" tko="10" votes="1">
<heuristic interval="3" program="ping $GATEWAY -c1 -w1" score="1" tko="4"/>
<heuristic interval="3" program="ip addr | grep eth1 | grep -q UP" score="2" tko="3"/>
</quorumd>
 
You may have misinterpreted my question. Fencing works great, no issues there. I am looking for a way to increase verbosity to see the reason why the node was fenced.

That's right adamb, it will be fantastic if the fence module do a log very clear

If you or someone know how do it, please posting it here
 
I see. But don't you miss something to decide whether quorum is established or not?

Mine reads:
<quorumd allow_kill="0" interval="3" label="proxmox1_qdisk" tko="10" votes="1">
<heuristic interval="3" program="ping $GATEWAY -c1 -w1" score="1" tko="4"/>
<heuristic interval="3" program="ip addr | grep eth1 | grep -q UP" score="2" tko="3"/>
</quorumd>

I use master_wins in my cluster, I am pretty sure that disables the use of heuristics. I ran into a number of issues with heuristics in specific failover scenario's which caused odd and unpredictable results.
 
That's right adamb, it will be fantastic if the fence module do a log very clear

If you or someone know how do it, please posting it here

I agree, atleast log the logic behind calling the fence action. It can't be to difficult. I may have to open a ticket with support to see if I can get to the bottom of this.
 
From man qdisk:

log_level="4"
This controls the verbosity of the quorum daemon in the system logs. 0 = emergen‐
cies; 7 = debug. This option is deprecated.


log_facility="daemon"
This controls the syslog facility used by the quorum daemon when logging. For a com‐
plete list of available facilities, see syslog.conf(5). The default value for this
is 'daemon'. This option is deprecated.


status_file="/foo"
Write internal states out to this file periodically ("-" = use stdout). This is pri‐
marily used for debugging. The default value for this attribute is undefined. This
option can be changed while qdiskd is running.

So it seems you are looking for status_file?
 
Well I had high hopes for the status file but it doesn't really indicate much.
root@testprox1:/var/log/cluster# cat status
Time Stamp: Mon Jun 2 11:38:04 2014
Node ID: 1
Score: 1/1 (Minimum required = 1)
Current state: Master
Initializing Set: { }
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 1 2 }

The above was directly after a fence action took place and a node was fenced from the cluster.

I see rgmanager is a bit more verbose, but nothing indicating why the fence action took place. In this specific scenario I killed the corosync process on the node which was fenced.

rgmanager.log

Jun 02 11:31:24 rgmanager Membership Change Event
Jun 02 11:31:27 rgmanager Membership Change Event
Jun 02 11:31:27 rgmanager State change: testprox2 DOWN
Jun 02 11:31:53 rgmanager Marking pvevm:100 on down member 2 as stoppedJun 02 11:31:53 rgmanager Evaluating RG pvevm:100, state stopped, owner none
Jun 02 11:31:53 rgmanager Event (0:2:0) Processed
Jun 02 11:31:53 rgmanager No other nodes have seen pvevm:100
Jun 02 11:31:54 rgmanager Starting stopped service pvevm:100
Jun 02 11:31:54 rgmanager [pvevm] Move config for VM 100 to local node
Jun 02 11:31:55 rgmanager [pvevm] Task still active, waiting
Jun 02 11:31:56 rgmanager Service pvevm:100 started

Fenced.log doesn't offer much.

Jun 02 11:31:27 fenced fencing node testprox2
Jun 02 11:31:53 fenced fence testprox2 success
 
Well I came across this option.

<logging debug="on"/>

Which seems to provide significant more logging, but I can't seem to pin down any reason for a fence action still. This line looks interesting.

1401803544 add_change cg 2 remove nodeid 1 reason 3

I wonder what reason 3 is.

I would appreciate if the dev's could provide some input here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!