HA is not working with 3-node-cluster - resources are NOT failing over

cmonty14 · Oct 24, 2018

Hi,
I have setup a 3-node-cluster that is working like charm, means I can migrate any VM or CT from one node to the other.
The same nodes are using a shared storage provided by Ceph storage.

I followed instructions and created HA groups + resources:
root@ld4257:~# more /etc/pve/ha/groups.cfg
group: web
comment PVE HA for Web Applications
nodes ld4465,ld4464
nofailback 0
restricted 0
group: lve
comment PVE HA for LVE Services
nodes ld4465,ld4257,ld4464
nofailback 0
restricted 0
root@ld4257:~# more /etc/pve/ha/resources.cfg
ct: 206
group lve
state started
ct: 204
group lve
state started
ct: 200
group lve
state started
vm: 113
group web
state started
vm: 114
group web
state started
vm: 115
group web
state started

For maintenance I triggered reboot of node ld4465 and was expecting that all CTs (204 + 206) running on this node would be migrated to node ld4464.
However this failover did not work.
instead all VMs were stopped and CTs were running but not accessible.
Finally node ld4465 was rebooting and all VMs + CTs remains there.

Please check the attached screenshots documenting this.

Why was HA failover not working?

adamb · Oct 24, 2018

c.monty said:
Hi,
I have setup a 3-node-cluster that is working like charm, means I can migrate any VM or CT from one node to the other.
The same nodes are using a shared storage provided by Ceph storage.

I followed instructions and created HA groups + resources:
root@ld4257:~# more /etc/pve/ha/groups.cfg
group: web
comment PVE HA for Web Applications
nodes ld4465,ld4464
nofailback 0
restricted 0
group: lve
comment PVE HA for LVE Services
nodes ld4465,ld4257,ld4464
nofailback 0
restricted 0
root@ld4257:~# more /etc/pve/ha/resources.cfg
ct: 206
group lve
state started
ct: 204
group lve
state started
ct: 200
group lve
state started
vm: 113
group web
state started
vm: 114
group web
state started
vm: 115
group web
state started

For maintenance I triggered reboot of node ld4465 and was expecting that all CTs (204 + 206) running on this node would be migrated to node ld4464.
However this failover did not work.
instead all VMs were stopped and CTs were running but not accessible.
Finally node ld4465 was rebooting and all VMs + CTs remains there.

Please check the attached screenshots documenting this.

Why was HA failover not working?

I don't agree with the dev's logic at all, but that is be design. Their logic is, reboot=planned, so no failover. A better test would be to kill the corosync process, or pull the power from one front end.

I would still love to have this as an option, but they don't seem to care to much.

Search

Search

HA is not working with 3-node-cluster - resources are NOT failing over

cmonty14

Well-Known Member

Attachments

adamb

Famous Member