[SOLVED] HA only auto-migrating one way: is this expected behaviour?

Turrican3

New Member
May 10, 2024
6
0
1
Italy
Hey there, Proxmox newbie here so I apologize if the question feels dumb.

I have setup a 2 node cluster with external voter as per documentation, plus two virtual machines HA with ZFS (local) replication. Corosync traffic happens via dedicated/separate ethernet connection.

The issue I am encountering (but as per thread title, I'm not sure if this is the expected behaviour or not) is that HA automatic migration of the aforementioned VMs appears to be happening only from node 1 to node 2, not viceversa.

Manual migrate and replication works fine.

(I am simulating issues by disabling the corosync ethernet port, but as I said HA is apparently only being triggered when I disable corosync port on node 1, not on node 2)

I thought that in this scenario, any failing node would lead to auto migration of the HA-protected VMs to the working one but perhaps I am missing something?
 
Hi,
can you share the output of ha-manager status --verbose before and after inducing the failure? Please monitor the journal by running journalctl -f on both nodes before inducing the failure and share the output here.
 
  • Like
Reactions: Turrican3
Thanks for the quick reply, I have to apologize but I was unable to produce the requested files last week.

I'm providing them now.

This is the scenario I want to fix, if possible: VM(s) are running on node2 but not automatically migrating to node1 when corosync link is manually forced down by me, thus simulating an actual failure.

Node1 to Node2 migration, as stated before, does seem to work fine (and ZFS replication, too) though.
 

Attachments

  • journal_node1.txt
    4.5 KB · Views: 3
  • journal_node2.txt
    3.5 KB · Views: 1
  • node1_before-failure.txt
    1.6 KB · Views: 2
  • node1_post-failure.txt
    1.7 KB · Views: 1
  • node2_before-failure.txt
    1.6 KB · Views: 1
  • node2_post-failure.txt
    1.6 KB · Views: 1
The logs indicate that node2 still has quorum together with the QDevice
Code:
May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [TOTEM ] A new membership (1.65) was formed. Members left: 2
May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [TOTEM ] Failed to receive the leave message. failed: 2
May 13 08:59:26 proxmoxsisinfo2 pmxcfs[378341]: [dcdb] notice: members: 1/378341
May 13 08:59:26 proxmoxsisinfo2 pmxcfs[378341]: [status] notice: members: 1/378341
May 13 08:59:27 proxmoxsisinfo2 corosync[378347]:   [QUORUM] Members[1]: 1
May 13 08:59:27 proxmoxsisinfo2 corosync[378347]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:59:37 proxmoxsisinfo2 pve-ha-crm[19172]: node 'proxmoxsisinfo1': state changed from 'online' => 'unknown'
while node1 doesn't
Code:
May 13 08:59:27 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] crit: can't initialize service
May 13 08:59:27 proxmoxsisinfo1 pve-ha-crm[2931]: status change slave => wait_for_quorum
May 13 08:59:27 proxmoxsisinfo1 pve-ha-lrm[2945]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmoxsisinfo1/lrm_status.tmp.2945' - Permission denied
May 13 08:59:33 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] notice: members: 2/2764
May 13 08:59:33 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] notice: all data is up to date
May 13 09:00:11 proxmoxsisinfo1 pvescheduler[1297100]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
so the services continue running on node2.
 
  • Like
Reactions: Turrican3
Ok so I tried a different thing: I completely shut down node2.

Services then moved automatically to node1 as I expected.

So I took a moment to think about this and, with the help of bash command history, I'm now suspecting an awful (or at least potentially misleading) error on my part: a possible Qdevice misconfiguration.

Thing is, when I created the cluster I setup a dedicated network (192.168.x.x). This is the link I manually put down to simulate issues with a node.

But when I setup the external voter environment via command line I entered the actual "main" network (172.16.x.x) on both the arbiter itself and the cluster nodes.

Am I onto something?
Were the dedicated network addresses (192.x.x.x i.e. the same main cluster network) supposed to be used instead?
 
Last edited:
Hello,

Could you please share with us the state of the cluster as reported by

Code:
pvecm status

and your Corosync config? The later is located at

Code:
/etc/pve/corosync.conf
 
  • Like
Reactions: Turrican3
Hello,

Could you please share with us the state of the cluster [...]

Of course, please check attached files.

Yes, I do think it's better to use the same network for all Corosync communication. But you can also define multiple networks to be used as fallbacks: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy

Got it, it's currently not planned but it might be implemented later.

Can I simply remove and reinstall the external arbiter daemon on both nodes and the third server or is something more complex needed?
 

Attachments

  • pvecmstatus.txt
    739 bytes · Views: 1
  • corosync.txt
    595 bytes · Views: 2
  • Like
Reactions: Turrican3
Excellent, thanks!

I'll make the required config modifications tomorrow, I'm confident this way after putting the cluster (arbiter) link down VMs will properly move between nodes automatically regardless of the simulated failure, as I expected at first.

Will report back then.
 
Marking the thread as "solved".

I managed to reconfigure the arbiter but it was WAY harder than expected, as apparently there were issues in removing the QDevice from the cluster configuration. After a few (well to be honest more than a few!) trial and error attempts I decided to rebuild the cluster from scratch (i.e. removed everything qdevice/qnetd related from all involved systems and brought back both nodes to stand-alone configuration first, then rebuilt the cluster and re-added the arbiter) and now everything seems to be working fine.

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!