[SOLVED] HA only auto-migrating one way: is this expected behaviour?

Turrican3 · May 10, 2024

Hey there, Proxmox newbie here so I apologize if the question feels dumb.

I have setup a 2 node cluster with external voter as per documentation, plus two virtual machines HA with ZFS (local) replication. Corosync traffic happens via dedicated/separate ethernet connection.

The issue I am encountering (but as per thread title, I'm not sure if this is the expected behaviour or not) is that HA automatic migration of the aforementioned VMs appears to be happening only from node 1 to node 2, not viceversa.

Manual migrate and replication works fine.

(I am simulating issues by disabling the corosync ethernet port, but as I said HA is apparently only being triggered when I disable corosync port on node 1, not on node 2)

I thought that in this scenario, any failing node would lead to auto migration of the HA-protected VMs to the working one but perhaps I am missing something?

fiona · May 10, 2024

Hi,
can you share the output of ha-manager status --verbose before and after inducing the failure? Please monitor the journal by running journalctl -f on both nodes before inducing the failure and share the output here.

Turrican3 · May 13, 2024

Thanks for the quick reply, I have to apologize but I was unable to produce the requested files last week.

I'm providing them now.

This is the scenario I want to fix, if possible: VM(s) are running on node2 but not automatically migrating to node1 when corosync link is manually forced down by me, thus simulating an actual failure.

Node1 to Node2 migration, as stated before, does seem to work fine (and ZFS replication, too) though.

fiona · May 13, 2024

The logs indicate that node2 still has quorum together with the QDevice

Code:

May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [TOTEM ] A new membership (1.65) was formed. Members left: 2
May 13 08:59:26 proxmoxsisinfo2 corosync[378347]:   [TOTEM ] Failed to receive the leave message. failed: 2
May 13 08:59:26 proxmoxsisinfo2 pmxcfs[378341]: [dcdb] notice: members: 1/378341
May 13 08:59:26 proxmoxsisinfo2 pmxcfs[378341]: [status] notice: members: 1/378341
May 13 08:59:27 proxmoxsisinfo2 corosync[378347]:   [QUORUM] Members[1]: 1
May 13 08:59:27 proxmoxsisinfo2 corosync[378347]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 13 08:59:37 proxmoxsisinfo2 pve-ha-crm[19172]: node 'proxmoxsisinfo1': state changed from 'online' => 'unknown'

while node1 doesn't

Code:

May 13 08:59:27 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] crit: can't initialize service
May 13 08:59:27 proxmoxsisinfo1 pve-ha-crm[2931]: status change slave => wait_for_quorum
May 13 08:59:27 proxmoxsisinfo1 pve-ha-lrm[2945]: unable to write lrm status file - unable to open file '/etc/pve/nodes/proxmoxsisinfo1/lrm_status.tmp.2945' - Permission denied
May 13 08:59:33 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] notice: members: 2/2764
May 13 08:59:33 proxmoxsisinfo1 pmxcfs[2764]: [dcdb] notice: all data is up to date
May 13 09:00:11 proxmoxsisinfo1 pvescheduler[1297100]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!

so the services continue running on node2.

Turrican3 · May 13, 2024

Ok so I tried a different thing: I completely shut down node2.

Services then moved automatically to node1 as I expected.

So I took a moment to think about this and, with the help of bash command history, I'm now suspecting an awful (or at least potentially misleading) error on my part: a possible Qdevice misconfiguration.

Thing is, when I created the cluster I setup a dedicated network (192.168.x.x). This is the link I manually put down to simulate issues with a node.

But when I setup the external voter environment via command line I entered the actual "main" network (172.16.x.x) on both the arbiter itself and the cluster nodes.

Am I onto something?
Were the dedicated network addresses (192.x.x.x i.e. the same main cluster network) supposed to be used instead?

Maximiliano · May 13, 2024

Hello,

Could you please share with us the state of the cluster as reported by

Code:

pvecm status

and your Corosync config? The later is located at

Code:

/etc/pve/corosync.conf

fiona · May 13, 2024

Yes, I do think it's better to use the same network for all Corosync communication. But you can also define multiple networks to be used as fallbacks: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy

Turrican3 · May 13, 2024

Maximiliano said:
Hello,

Could you please share with us the state of the cluster [...]

Of course, please check attached files.

fiona said:
Yes, I do think it's better to use the same network for all Corosync communication. But you can also define multiple networks to be used as fallbacks: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy

Got it, it's currently not planned but it might be implemented later.

Can I simply remove and reinstall the external arbiter daemon on both nodes and the third server or is something more complex needed?

fiona · May 13, 2024

Turrican3 said:
Can I simply remove and reinstall the external arbiter daemon on both nodes and the third server or is something more complex needed?

Yes, that should be enough as described here (minus the adding/removing nodes in your case): https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_frequently_asked_questions

Turrican3 · May 13, 2024

Excellent, thanks!

I'll make the required config modifications tomorrow, I'm confident this way after putting the cluster (arbiter) link down VMs will properly move between nodes automatically regardless of the simulated failure, as I expected at first.

Will report back then.

Turrican3 · May 15, 2024

Marking the thread as "solved".

I managed to reconfigure the arbiter but it was WAY harder than expected, as apparently there were issues in removing the QDevice from the cluster configuration. After a few (well to be honest more than a few!) trial and error attempts I decided to rebuild the cluster from scratch (i.e. removed everything qdevice/qnetd related from all involved systems and brought back both nodes to stand-alone configuration first, then rebuilt the cluster and re-added the arbiter) and now everything seems to be working fine.

Thanks!

Search

Search

[SOLVED] HA only auto-migrating one way: is this expected behaviour?

Turrican3

New Member

fiona

Proxmox Staff Member

Turrican3

New Member

Attachments

fiona

Proxmox Staff Member

Turrican3

New Member

Maximiliano

Proxmox Staff Member

fiona

Proxmox Staff Member

Turrican3

New Member

Attachments

fiona

Proxmox Staff Member

Turrican3

New Member

Turrican3

New Member

We value your privacy