HA is not working with 3-node-cluster - resources are NOT failing over

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,
I have setup a 3-node-cluster that is working like charm, means I can migrate any VM or CT from one node to the other.
The same nodes are using a shared storage provided by Ceph storage.

I followed instructions and created HA groups + resources:
root@ld4257:~# more /etc/pve/ha/groups.cfg
group: web
comment PVE HA for Web Applications
nodes ld4465,ld4464
nofailback 0
restricted 0
group: lve
comment PVE HA for LVE Services
nodes ld4465,ld4257,ld4464
nofailback 0
restricted 0
root@ld4257:~# more /etc/pve/ha/resources.cfg
ct: 206
group lve
state started
ct: 204
group lve
state started
ct: 200
group lve
state started
vm: 113
group web
state started
vm: 114
group web
state started
vm: 115
group web
state started


For maintenance I triggered reboot of node ld4465 and was expecting that all CTs (204 + 206) running on this node would be migrated to node ld4464.
However this failover did not work.
instead all VMs were stopped and CTs were running but not accessible.
Finally node ld4465 was rebooting and all VMs + CTs remains there.

Please check the attached screenshots documenting this.

Why was HA failover not working?
 

Attachments

  • 2018-10-24_15-25-13.png
    2018-10-24_15-25-13.png
    168.3 KB · Views: 10
  • 2018-10-24_15-21-27.png
    2018-10-24_15-21-27.png
    230.6 KB · Views: 11
Hi,
I have setup a 3-node-cluster that is working like charm, means I can migrate any VM or CT from one node to the other.
The same nodes are using a shared storage provided by Ceph storage.

I followed instructions and created HA groups + resources:
root@ld4257:~# more /etc/pve/ha/groups.cfg
group: web
comment PVE HA for Web Applications
nodes ld4465,ld4464
nofailback 0
restricted 0
group: lve
comment PVE HA for LVE Services
nodes ld4465,ld4257,ld4464
nofailback 0
restricted 0
root@ld4257:~# more /etc/pve/ha/resources.cfg
ct: 206
group lve
state started
ct: 204
group lve
state started
ct: 200
group lve
state started
vm: 113
group web
state started
vm: 114
group web
state started
vm: 115
group web
state started


For maintenance I triggered reboot of node ld4465 and was expecting that all CTs (204 + 206) running on this node would be migrated to node ld4464.
However this failover did not work.
instead all VMs were stopped and CTs were running but not accessible.
Finally node ld4465 was rebooting and all VMs + CTs remains there.

Please check the attached screenshots documenting this.

Why was HA failover not working?

I don't agree with the dev's logic at all, but that is be design. Their logic is, reboot=planned, so no failover. A better test would be to kill the corosync process, or pull the power from one front end.

I would still love to have this as an option, but they don't seem to care to much.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!