[SOLVED] issues after HA test

luphi · Apr 26, 2019

All,

today I did some HA testing on a 4 node cluster, version 5.4.
I configured a HA group including all nodes and added to VMs tho the group.
The two VMS were running on node 1 and node 2.
I also set shutdown_policy=failover to initiate the failover by simply rebooting a node.
Shortly after rebooting node 1, I reveiced two E-mails with the following subjects:

FENCE: Try to fence node 'pve1'
SUCCEED: fencing: acknowledged - got agent lock for node 'pve1'

The VM running on node 1 failed over to node 0.
Everthing looked fine so far.

But once node 1 was back online, it didn't join the cluster anymore.
Do I have to manually remove the fence?

On the network I just see the following communication:
node3.5404 --> node0.5405
node0.5404 --> node2.5405
node1.5404 --> 239.192.204.105.5405
node2.5404 --> 239.192.91.165.5405

Any help for further troubleshooting is really aprechiated.

Cheers,
luphi

luphi · Apr 26, 2019

before I rebooted node 1, I updated it, so some versions are different between node1 and the rest of the cluster.
Not sure if this is related

Code:

< proxmox-ve: 5.4-1 (running kernel: 4.15.18-9-pve)
< pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
< pve-kernel-4.15: 5.3-3
---
> proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
> pve-manager: 5.4-4 (running version: 5.4-4/97a96833)
> pve-kernel-4.15: 5.4-1
5c5,6
< pve-kernel-4.15.18-12-pve: 4.15.18-35
---
> pve-kernel-4.15.18-13-pve: 4.15.18-37
> pve-kernel-4.15.18-12-pve: 4.15.18-36
12d12
< pve-kernel-4.13.13-5-pve: 4.13.13-38
24c24
< libpve-common-perl: 5.0-50
---
> libpve-common-perl: 5.0-51
33c33
< proxmox-widget-toolkit: 1.0-25
---
> proxmox-widget-toolkit: 1.0-26
38c38
< pve-firewall: 3.0-19
---
> pve-firewall: 3.0-20

Cheers,
luphi

luphi · Apr 30, 2019

nobody?

Menno · Apr 30, 2019

I updated my Proxmox cluster today and ended up in a broken cluster state which looked similar to what you describe, coincidentally it happened after updating to the same version pve-manager and pve-kernel.

I've updated a few nodes one by one with no VM's and rebooted them without an issue until I reached the last two that did contain VM's and are part of a HA group.

While updating the node (pve02) I moved away the VM's to the last to be updated node (pve01) and I've noticed the installation slowed down until my ssh session terminated because the machine rebooted, at this time pve01 starting sending fence mails about pve02.

My then master node (pve01) was the one that still needed updates but showed up as 'dead' in Proxmox's HA status, I figured it would be resolved by installing the Proxmox updates or restarting the pve ha services but that made this node force-ably reboot as well, possibly because of some timeout triggering the watchdog.

While I'm glad I follow the enterprise repository for my production cluster I hope someone can confirm if this was bad luck or the update that caused the issue, if the latter this should not hit the enterprise repo.

luphi · Apr 30, 2019

Thank you for your reply, Menno.
It turned out, that the problem was not caused by the HA test but by the update of node1.
I was not able the get node1 back into the cluster, but I was able to move all remaining nodes into the new cluster formed by node1 by updating/rebooting the remaining nodes one by one.

Cheers,
luphi

Menno · May 1, 2019

Glad you got it sorted!

I still wonder if something in the update caused it and what would be the appropriate way of installing updates on a cluster, I always assumed updating nodes one by one would be the safest way.

Search

Search

[SOLVED] issues after HA test

luphi

Renowned Member

luphi

Renowned Member

luphi

Renowned Member

Menno

Member

luphi

Renowned Member

Menno

Member