Ceph Cluster Fencing

yena

Renowned Member
Nov 18, 2011
373
4
83
I'm testing VM HA in a 4 nodes Ceph Cluster.
I See "Use Watchdog base Fancing".
I Have 4 Supermicro, is it better if i configure IPMI Fencing ?

Thanks
 
I am currently having a similar situation using Dell machines (offering IPMIv2) and older HP Machines with ILO2.
The question is what happens if one node fails and the network switch that the IPMI is connected to fails as well.

Further more I am missing some documentation how to activate the IPMI setting as the fencing device. I followed the docu and set WATCHDOG_MODULE=ipmi_watchdog in /etc/default/pve-ha-manager and set the GRUB option to nmi_watchdog=0.

Kernel modules are loaded:
Code:
root@node4:~# lsmod |grep ipmi
ipmi_watchdog          28672  1
ipmi_ssif              24576  0
ipmi_si                57344  1
ipmi_msghandler        49152  3 ipmi_ssif,ipmi_watchdog,ipmi_si
What else has to be done? There must be some config to tell Proxmox about the credentials, right?
Should the be done according to the old documentation within /etc/pve/cluster.conf?
 
Last edited:
  • Like
Reactions: El Tebe
I found out, that I had a wrong understanding of how the ipmi watchdog is used. Basically it needs a driver to talk to a piece of hardware within the baseboard management controller (BMC). There is no communication via LAN, but direct access to the IPMI/BMC hardware, that gets polled and commands can be set to shutdown the machine.

There is no need to submit credentials and no problem if network fails. Nothing has to be done except making sure, that the ipmi driver is loaded, proxmox is told which watchdog to use and the GRUB setting is made. You can check your IPMI configuration with:

Code:
ipmitool mc watchdog get

you should get something like:

Code:
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      10 sec
Present Countdown:      9 sec

If you want to simulate the proper fencing, execute the following and the node should reset within a few seconds.

Code:
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock


Nevertheless I experienced problems with HP ILO2. Those machines did not fence correctly in my test setting. So I used the soft watchdog for those machines.

Hope that helps.
 
  • Like
Reactions: El Tebe

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!