[SOLVED] How to disable fencing for debugging ?

We have recently added a new node in our already existing cluster. Recently while moving a VM to the new node, the server failed unexpectedly. Since then, every time the server get started it get rebooted after a couple of minute. This prevent us to investigate and fix the problem on the node.

Looking at the IPMI log, it look like the IPMI watchdog is responsible of rebooting the server. It make me think proxmox fencing is responsible of the reboot.

So, I'm here to ask some help in regard to fencing. Is it possible to disable it ? Cause I don't see any other way to fix the problem on the node if it keep rebooting every couple of minute. I barely have the time to login to the server that it get rebooted.
 
Looking at the IPMI log, it look like the IPMI watchdog is responsible of rebooting the server. It make me think proxmox fencing is responsible of the reboot.

That normally can only happen if you whitelisted the IPMI watchdog module actively, as per default we blacklist all watchdog modules, so that an admin can whitelist the respective desired one. This is done as often multiple watchdogs are present, and some (e.g., HP ones) are really buggy. Fallback is always the Linux Kernel "softdog" (software watchdog).

Normally one would set the desired watchdog in /etc/default/pve-ha-manager, stop the pve-ha-lrm and pve-ha-crm service, then restart the watchdog-mux and start the HA services again.
But, you suggest that you did not changed that, then the softdog should have been used.
For debugging (and ONLY for that) it can be told to actually not reboot by adding the "soft_noboot" module parameter (see modinfo softdog).

Code:
# cat /etc/modprobe.d/softdog.conf
options softdog soft_noboot=1
 
Last edited:
Sorry, I cut corner while explaning the situation. I've configure ipmi watchdog first. Then disable the ipmi watchdog thinking it would disable the fencing. With your help now I've completely disable the ipmi watchdog and softdog.

Now, I can get a better understanding of what happening.

When I start the server with networking enabled, after 1-2 minutes the server hang completly.
When I start the server with networking disabled, the server keep running without issue.

I'm guessing it's related to the automatic startup of one VM that is still configured to run on this node. When the node get quorum, the vm get started and freeze the server.

Any recommendation how to debug this ?
 
Never mind, it's not a proxmox issue at all. It's a problem with the lspci command.

Running: lspci -v -s $(lspci | grep VGA | cut -d" " -f 1)
It killing the server ! Can you believe that ? Even worst considering this can be run with root permissions.

Anyhow, this command was run by our monitoring system that why it kills the server every time I plugged the network.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!