Node died but VM failed to migrate with HA

Kei

Active Member
May 29, 2016
88
2
28
37
Hello,
I have a three nodes cluster and all seems to be running ok. All VM's are connected with iSCSI shared storage and I can live migrate from a node to another without problems in few milliseconds.
However, the other day a node rebooted without apparent reason and the only VM managed by HA that used to run on that node did not migrate. All it did was powering off in harsh manner and turning back on as soon as the node was back online again.
At this point I went looking on the logs (/var/log/syslog) but found nothing relevant. If possible I would kindly ask for some help to understand why the node rebooted with no reason and why the VM did not migrate at all.
 
Hello,
I have a three nodes cluster and all seems to be running ok. All VM's are connected with iSCSI shared storage and I can live migrate from a node to another without problems in few milliseconds.
However, the other day a node rebooted without apparent reason and the only VM managed by HA that used to run on that node did not migrate. All it did was powering off in harsh manner and turning back on as soon as the node was back online again.
At this point I went looking on the logs (/var/log/syslog) but found nothing relevant. If possible I would kindly ask for some help to understand why the node rebooted with no reason and why the VM did not migrate at all.

My guess would be the fence action failed. What are you using for fencing?
 
My guess would be the fence action failed. What are you using for fencing?
The cluster is supposed to use software watchdog and fencing, that is to say, I assumed it would work out of the box with no configuration. Am I missing something?
 
The cluster is supposed to use software watchdog and fencing, that is to say, I assumed it would work out of the box with no configuration. Am I missing something?

I test fencing on every cluster before going into production. I don't know what your issue is, but I would start by testing fencing if possible.
 
Thank you. Is shutting down all the network ports on the switch related to a node a good test for this purpose? I assume that a test for fencing is not actually unplugging the power cord of a server :)
 
Thank you. Is shutting down all the network ports on the switch related to a node a good test for this purpose? I assume that a test for fencing is not actually unplugging the power cord of a server :)

If that is the network which the cluster is running on then yep. If its the LAN connection then it won't failover. Pulling power should also work. Another option is to kill the corosync process on a server and it should get fenced and the VM started on another node.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!