HA failover not restarting guest

p0lar

Member
Mar 1, 2019
7
0
21
59
I have a cluster of 3 hosts and about 6 guests in an HA cluster. when I try to simulate a failure of one of the hosts by unplugging the ethernet nothing happens. I was expecting the guests to restart on the other nodes. after a few minutes of nothing I plug the ethernet back in and THEN the guests restart on another host.

a group created with all hosts in it and the guests are all resources
max restarts set to 1
max relocate set to 1
all guest storage is on a shared NFS system

what am I missing?
 
Hi,
what is the status of the VMs in the `ha-manager status` once the corresponding host is unplugged? What is the status of the cluster `pvecm status`?
 
Before unplug

root@pve30:~# pvecm status
Quorum information
------------------
Date: Mon Mar 11 08:52:08 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 2/60
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.1.1.30 (local)
0x00000003 1 10.1.1.31
0x00000001 1 10.1.1.32
root@pve30:~#

AFTER unplug of 10.1.1.32

Quorum information
------------------
Date: Mon Mar 11 08:53:09 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2/64
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.1.1.30 (local)
0x00000003 1 10.1.1.31
 
as soon as I plugged 32 back in i got this below and host 105 started on another host.

Task viewer: Start all VMs and Containers

OutputStatus

Stop
waiting for quorum ...
got quorum
Starting VM 105
TASK OK
 
Ok,
so for corosync the node goes down, the other two remain... What's the status for the VMs in ha-manager after unplugging?
 
they don't move and eventually go offline. only AFTER I plug node 32 back in did 1 move.

I was expecting that when I unplugged node 32 that any guests on 32 would restart on the other hosts quickly.

Right?
maybe I am missing a config or something?
 
Ol I discovered something. I'm not waiting enough time. the failover DID occur. after about 5 minutes of downtime. any way to change that? make it a bit faster?

Good to know it works. just slow
 
This may have to do with:
- time to recognize that node is down
- time & # of attempts to restart the VM's on the given node
- ------ once restart timeout/attempts ends , vm's are moved over to next node

On my system is takes a couple of minutes before the VM's are moved over and fully restarted
- I am not sure how many minutes exactly, but at min 3min (maybe I should actually time this....)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!