How to clean "zombie" fencing request in pacemaker after disaster recovery of a 8.3 cluster

Bruno SINOU

Active Member
Dec 19, 2018
9
0
41
47
Berlin
sinou.org
Hi,

I have a simple Proxmox 8.3 cluster with 3 nodes by Scaleway (ex-dedibox)
Due to server failure, I had to replace all the nodes (one after a hard crash), one after the other, keeping the cluster alive.

Now everything seems to be back to normal, but the pacemaker service still has these errors, like every second:

Bash:
Mar 13 20:27:58 host04 pacemaker-fenced[xxxx524]:  notice: Client stonith-api.1074640 wants to fence (reboot) 4 using any device
Mar 13 20:27:58 host04 pacemaker-fenced[xxxx524]:  notice: Requesting peer fencing (reboot) targeting host06
Mar 13 20:27:58 host04 pacemaker-fenced[xxxx524]:  notice: Couldn't find anyone to fence (reboot) host06 using any device
Mar 13 20:27:58 host04 pacemaker-fenced[xxxx524]:  error: Operation 'reboot' targeting host06 by unknown node for stonith-api.xxxx640@host04: Error occurred (No fence device)
Mar 13 20:27:58 host04 pacemaker-controld[xxxx528]:  notice: Peer host06 was not terminated (reboot) by the cluster on behalf of stonith-api.1074640@host04: No fence device

Fencing is off on all 3 nodes because I have no access to STONITH, but the question is:

How can i tell to pacemaker that everything is back to normal and that i should not receive such fencing request ?
 
Hi!

Proxmox VE doesn't use pacemaker, but has its own high availability solution called HA Manager. Is there a reason that you have set it up on the cluster?
 
  • Like
Reactions: Johannes S
Hi,

@dakralex thanks for the reply.

> Is there a reason that you have set it up on the cluster?

To be honest, I don't know. The cluster has been setup for a while now, and that's the solution we came to following on-line documentation that we found.
If you have a hint with documentation that describes how to rather use the HA manager, it would be greatly appreciated.

Thus said, I finally solved my issue by stopping dlm and pacemaker **on each node** at the same time and then restarting the services, DLM wasn't stuck anymore. And we could then get rid of the old events with:
Code:
pcs stonith history cleanup

In case it might help someone else...
 
Hi!

If you have a hint with documentation that describes how to rather use the HA manager, it would be greatly appreciated.

There's documentation for the PVE HA Manager in our admin guide at [0]. It's designed so that virtual machines and containers can be setup as highly available services, which will be automatically recovered to other nodes, e.g. in case of a failing node.

[0] https://pve.proxmox.com/pve-docs/chapter-ha-manager.html