Node died but VM failed to migrate with HA

Kei · Feb 14, 2017

Hello,
I have a three nodes cluster and all seems to be running ok. All VM's are connected with iSCSI shared storage and I can live migrate from a node to another without problems in few milliseconds.
However, the other day a node rebooted without apparent reason and the only VM managed by HA that used to run on that node did not migrate. All it did was powering off in harsh manner and turning back on as soon as the node was back online again.
At this point I went looking on the logs (/var/log/syslog) but found nothing relevant. If possible I would kindly ask for some help to understand why the node rebooted with no reason and why the VM did not migrate at all.

adamb · Feb 14, 2017

Kei said:
Hello,
I have a three nodes cluster and all seems to be running ok. All VM's are connected with iSCSI shared storage and I can live migrate from a node to another without problems in few milliseconds.
However, the other day a node rebooted without apparent reason and the only VM managed by HA that used to run on that node did not migrate. All it did was powering off in harsh manner and turning back on as soon as the node was back online again.
At this point I went looking on the logs (/var/log/syslog) but found nothing relevant. If possible I would kindly ask for some help to understand why the node rebooted with no reason and why the VM did not migrate at all.

My guess would be the fence action failed. What are you using for fencing?

Kei · Feb 14, 2017

adamb said:
My guess would be the fence action failed. What are you using for fencing?

The cluster is supposed to use software watchdog and fencing, that is to say, I assumed it would work out of the box with no configuration. Am I missing something?

adamb · Feb 14, 2017

Kei said:
The cluster is supposed to use software watchdog and fencing, that is to say, I assumed it would work out of the box with no configuration. Am I missing something?

I test fencing on every cluster before going into production. I don't know what your issue is, but I would start by testing fencing if possible.

Kei · Feb 14, 2017

Thank you. Is shutting down all the network ports on the switch related to a node a good test for this purpose? I assume that a test for fencing is not actually unplugging the power cord of a server

adamb · Feb 14, 2017

Kei said:
Thank you. Is shutting down all the network ports on the switch related to a node a good test for this purpose? I assume that a test for fencing is not actually unplugging the power cord of a server

If that is the network which the cluster is running on then yep. If its the LAN connection then it won't failover. Pulling power should also work. Another option is to kill the corosync process on a server and it should get fenced and the VM started on another node.

Kei · Feb 14, 2017

adamb said:
the network which the cluster is running on

Are you referring to the NIC used to create the cluster, for example pvecm add 192.168.50.155 ?

adamb · Feb 14, 2017

Yep the one the cluster traffic is running on. That would be the one.

Search

Search

Node died but VM failed to migrate with HA

Kei

Active Member

adamb

Famous Member

Kei

Active Member

adamb

Famous Member

Kei

Active Member

adamb

Famous Member

Kei

Active Member

adamb

Famous Member