[SOLVED] Watchdog fence for physical nodes

e100

Renowned Member
Nov 6, 2010
1,268
45
88
Columbus, Ohio
ulbuilder.wordpress.com
In Proxmox 3.x I setup fencing using apc pdus.
I did not have any HA VMs setup but if one of the Proxmox nodes locked up or crashed the node would fenced.

Is it possible to replicate this behavior In 4.x and 5.x?
I'm fine with the watchdog as the method of fencing just don't see a way to make it work without having an HA VM setup on the node.

I tried setting up a group that contained only one Proxmox node and then assign one VM as HA on that node into that group. But when the node comes back up the HA VM is in error state requiring manual intervention.
 
I changed restricted tag so my group looks like this:

Code:
group: vm16
        nodes vm16
        nofailback 0
        restricted 0

This *might* work except I only have local storage.
VM16 node got fenced, HA tries to start the VM on VM17 node but fails because it does not have the disks.
Then HA tries to do a migrate to another node, that fails because it wants to clone the local disks that don't exist.
Code:
Task started by HA resource agent
May 12 13:40:08 starting migration of VM 107 to node 'vm16' (x.x.x.x)
May 12 13:40:08 found local disk 'local-zfs:vm-107-disk-1' (in current VM config)
May 12 13:40:08 found local disk 'local-zfs:vm-107-disk-2' (in current VM config)
May 12 13:40:08 copying disk images
cannot open 'rpool/data/vm-107-disk-1': dataset does not exist
usage:
snapshot|snap [-r] [-o property=value] ... <filesystem|volume>@<snap> ...

For the property list, run: zfs set|get

For the delegated permission list, run: zfs allow|unallow
May 12 13:40:08 ERROR: Failed to sync data - command 'zfs snapshot rpool/data/vm-107-disk-1@__migration__' failed: exit code 2
May 12 13:40:08 aborting phase 1 - cleanup resources
May 12 13:40:08 ERROR: found stale volume copy 'local-zfs:vm-107-disk-1' on node 'vm16'
May 12 13:40:08 ERROR: migration aborted (duration 00:00:01): Failed to sync data - command 'zfs snapshot rpool/data/vm-107-disk-1@__migration__' failed: exit code 2
TASK ERROR: migration aborted

After repeated migration failures it ends up in error condition.
 
Each node has a dedicated group that looks like this:
Code:
group: NodeName
        nodes Node_Name
        nofailback 0
        restricted 0

Each node has a diskless VM like this:
Code:
bootdisk: scsi0                                  
cores: 1                                        
freeze: 1                                        
ide2: none,media=cdrom                          
memory: 1                                        
name: NodeName-HA                                    
numa: 0                                          
ostype: l26                                      
scsihw: virtio-scsi-pci                          
smbios1: uuid=6c1ab0d6-2ab3-46e4-9677-74f7e60894d8
sockets: 1

Then each diskless VM is setup as an HA resource like this:
Code:
vm: 916
        comment Server HA NodeName
        group NodeName
        state started

Now the node gets fenced when it loses quorum and when it starts back up the diskless VM is moved to it.

While this works it is not an ideal solution.
Would be nice if Proxmox had a simple way to setup physical server fencing without needing to setup a diskless fake VM to do so.
 
Would be nice if Proxmox had a simple way to setup physical server fencing without needing to setup a diskless fake VM to do so.

I guess this would be easy to implement, but I am quite unsure if many people want that feature ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!