Understanding why a node fenced

Oct 29, 2018
4
0
6
53
Hi. I'm on a 4 node 5.2 cluster.

I was testing VM on-line migrations (with the VM's on iSCSI shared storage).

While migrating one VM guest, online, to another VM host node, the origin node got fenced and broke the migration. I just wish to know why the node got fenced mid way through online migration?

The setup:

vmhost1, vmhost2 (also vmhost3 and 4 but not relevant to this explanation at the moment)

vmhost1 has vmguest1
vmhost2 has vmguest2

vmguest1 and vmguest2 both on iSCSI shared storage.

vmbr0 on public (internet) network
vmbr1 on internal management network - 192.168.x.x

corosync on management network
migration takes place on management network

vmguest1, vmguest2 both in HA, with vmguest1 on vmhost1, and vmguest2 on vmhost2 preferred.

I've done this a number of times without failure, but it did fail in this test which is why I'm raising this to understand why.

I kicked off an on-line migration from vmhost2 / vmguest2 to migrate online to vmhost1.

Mid-way through the migration, vmhost2 got fenced and rebooted.

After 5-6 minutes when the vmhost2 came back up (Dell servers take ages to physically boot - not sure how to speed up their boot process), the vmguest2 was left in a lock migration state and couldn't be started.

I had to "qm unlock vmguest2" to get it back, although I wasn't sure if the data was in-tact because it didn't complete the migration. It did start up with some failures, after another reboot it started properly but I wasn't sure if data was in tact.

For fencing, I use:

tail /etc/modules
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
iTCO-wdt

and loaded lsmod:

iTCO_wdt 16384 1
iTCO_vendor_support 16384 1 iTCO_wdt

Time is sync'ed between cluster nodes by normal default config of Proxmox.

My suspicions are:

1. because I use the management network on vmbr1, which corosync also uses, the migration can take all the bandwidth?

I think I could fix this by:

a) limiting the NIC vmbr1 bandwidth on vmguest2 to something like 20Mbps leaving room for corosync?

The above is only my suspicions, but I'd love to hear from the community your ideas.

Thanks.

Michael.
 
Sharing the corosync network with your migration and storage network is probably the cause of the fencing.
Check the journal (`journalctl -r`) around the time the fencing happened - and look for messages from corosync.
(Also check the other nodes' logs for entries related to pve-ha-crm and pve-ha-lrm)

If possible at all put the corosync traffic on its own interface.

Check our documentation for the recommendations: https://pve.proxmox.com/pve-docs/chapter-pvecm.html
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!