[SOLVED] issues after HA test

luphi

Renowned Member
Nov 9, 2015
82
5
73
All,

today I did some HA testing on a 4 node cluster, version 5.4.
I configured a HA group including all nodes and added to VMs tho the group.
The two VMS were running on node 1 and node 2.
I also set shutdown_policy=failover to initiate the failover by simply rebooting a node.
Shortly after rebooting node 1, I reveiced two E-mails with the following subjects:

FENCE: Try to fence node 'pve1'
SUCCEED: fencing: acknowledged - got agent lock for node 'pve1'

The VM running on node 1 failed over to node 0.
Everthing looked fine so far.

But once node 1 was back online, it didn't join the cluster anymore.
Do I have to manually remove the fence?

On the network I just see the following communication:
node3.5404 --> node0.5405
node0.5404 --> node2.5405
node1.5404 --> 239.192.204.105.5405
node2.5404 --> 239.192.91.165.5405

Any help for further troubleshooting is really aprechiated.

Cheers,
luphi
 
before I rebooted node 1, I updated it, so some versions are different between node1 and the rest of the cluster.
Not sure if this is related

Code:
< proxmox-ve: 5.4-1 (running kernel: 4.15.18-9-pve)
< pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
< pve-kernel-4.15: 5.3-3
---
> proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
> pve-manager: 5.4-4 (running version: 5.4-4/97a96833)
> pve-kernel-4.15: 5.4-1
5c5,6
< pve-kernel-4.15.18-12-pve: 4.15.18-35
---
> pve-kernel-4.15.18-13-pve: 4.15.18-37
> pve-kernel-4.15.18-12-pve: 4.15.18-36
12d12
< pve-kernel-4.13.13-5-pve: 4.13.13-38
24c24
< libpve-common-perl: 5.0-50
---
> libpve-common-perl: 5.0-51
33c33
< proxmox-widget-toolkit: 1.0-25
---
> proxmox-widget-toolkit: 1.0-26
38c38
< pve-firewall: 3.0-19
---
> pve-firewall: 3.0-20

Cheers,
luphi
 
I updated my Proxmox cluster today and ended up in a broken cluster state which looked similar to what you describe, coincidentally it happened after updating to the same version pve-manager and pve-kernel.

I've updated a few nodes one by one with no VM's and rebooted them without an issue until I reached the last two that did contain VM's and are part of a HA group.

While updating the node (pve02) I moved away the VM's to the last to be updated node (pve01) and I've noticed the installation slowed down until my ssh session terminated because the machine rebooted, at this time pve01 starting sending fence mails about pve02.

My then master node (pve01) was the one that still needed updates but showed up as 'dead' in Proxmox's HA status, I figured it would be resolved by installing the Proxmox updates or restarting the pve ha services but that made this node force-ably reboot as well, possibly because of some timeout triggering the watchdog.

While I'm glad I follow the enterprise repository for my production cluster I hope someone can confirm if this was bad luck or the update that caused the issue, if the latter this should not hit the enterprise repo.
 
Thank you for your reply, Menno.
It turned out, that the problem was not caused by the HA test but by the update of node1.
I was not able the get node1 back into the cluster, but I was able to move all remaining nodes into the new cluster formed by node1 by updating/rebooting the remaining nodes one by one.

Cheers,
luphi
 
Glad you got it sorted!

I still wonder if something in the update caused it and what would be the appropriate way of installing updates on a cluster, I always assumed updating nodes one by one would be the safest way.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!