detect failover of VMs and containers

hr556 · May 27, 2023

What is the best way to detect a cluster failover, meaning that my replicated VMs get started on another node? In /var/log/syslog I found the following maybe relevant messages, but don't know on which message to look after:

Bash:

May 27 17:10:22 bohr corosync[2084]:   [MAIN  ] Completed service synchronization, ready to provide service

May 27 17:13:22 bohr pve-ha-lrm[2416]: successfully acquired lock 'ha_agent_bohr_lock'
May 27 17:13:22 bohr pve-ha-lrm[2416]: watchdog active
May 27 17:13:22 bohr pve-ha-lrm[2416]: status change wait_for_agent_lock => active

fiona · May 30, 2023

Hi,
the failover is handled by pve-ha-crm. The log on the node that was HA master at the time (I can't tell you which one it is) will contain log messages about the failover/recovery operation. Why do you want to detect the single failover, i.e. what is the actual issue you're trying to solve?

Note also that there is a mail notification when a node gets fenced (but not for every single guest failover).

hr556 · May 30, 2023

fiona said:
Hi,
the failover is handled by pve-ha-crm. The log on the node that was HA master at the time (I can't tell you which one it is) will contain log messages about the failover/recovery operation. Why do you want to detect the single failover, i.e. what is the actual issue you're trying to solve?

Note also that there is a mail notification when a node gets fenced (but not for every single guest failover).

Thanks for the info! Aim is to get notified when VMs/Containers get spawn up on other nodes due to failover, for montioring reasons. No issues so far, just getting insights of the PVE cluster.

What I see on the former active node "einstein":

Bash:

May 27 17:59:25 einstein pve-ha-crm[3437]: starting server
May 27 17:59:25 einstein pve-ha-crm[3437]: status change startup => wait_for_quorum
May 27 18:00:00 einstein pve-ha-crm[3437]: status change wait_for_quorum => slave
May 27 18:04:15 einstein pve-ha-crm[3437]: received signal TERM
May 27 18:04:15 einstein pve-ha-crm[3437]: server received shutdown request
May 27 18:04:16 einstein pve-ha-crm[3437]: server stopped
May 27 18:04:17 einstein systemd[1]: pve-ha-crm.service: Succeeded.
May 27 18:04:51 einstein pve-ha-crm[2907]: starting server
May 27 18:04:51 einstein pve-ha-crm[2907]: status change startup => wait_for_quorum
May 27 18:05:36 einstein pve-ha-crm[2907]: status change wait_for_quorum => slave

At the same time, on the node "bohr" which got active:

Bash:

May 27 18:36:57 bohr pve-ha-crm[3694]: starting server
May 27 18:36:57 bohr pve-ha-crm[3694]: status change startup => wait_for_quorum
May 27 18:38:37 bohr pve-ha-crm[3694]: status change wait_for_quorum => slave

So it seems that "pve-ha-crm" is for getting to know when cluster-nodes start/stop, but not when VMs get active due to replication / node failure. Otherwise there would be some event around "May 27 17:10..."

fiona · May 31, 2023

hr556 said:
Thanks for the info! Aim is to get notified when VMs/Containers get spawn up on other nodes due to failover, for montioring reasons. No issues so far, just getting insights of the PVE cluster.

Maybe you can just monitor the HA status API endpoint /cluster/ha/status/current? There you see the LRM status and service status and node, so you can e.g. correlate if a service changes nodes and the LRM of the previous node is not active anymore.

hr556 said:

What I see on the former active node "einstein":

Bash:

May 27 17:59:25 einstein pve-ha-crm[3437]: starting server
May 27 17:59:25 einstein pve-ha-crm[3437]: status change startup => wait_for_quorum
May 27 18:00:00 einstein pve-ha-crm[3437]: status change wait_for_quorum => slave
May 27 18:04:15 einstein pve-ha-crm[3437]: received signal TERM
May 27 18:04:15 einstein pve-ha-crm[3437]: server received shutdown request
May 27 18:04:16 einstein pve-ha-crm[3437]: server stopped
May 27 18:04:17 einstein systemd[1]: pve-ha-crm.service: Succeeded.
May 27 18:04:51 einstein pve-ha-crm[2907]: starting server
May 27 18:04:51 einstein pve-ha-crm[2907]: status change startup => wait_for_quorum
May 27 18:05:36 einstein pve-ha-crm[2907]: status change wait_for_quorum => slave

At the same time, on the node "bohr" which got active:

Bash:

May 27 18:36:57 bohr pve-ha-crm[3694]: starting server
May 27 18:36:57 bohr pve-ha-crm[3694]: status change startup => wait_for_quorum
May 27 18:38:37 bohr pve-ha-crm[3694]: status change wait_for_quorum => slave

So it seems that "pve-ha-crm" is for getting to know when cluster-nodes start/stop, but not when VMs get active due to replication / node failure. Otherwise there would be some event around "May 27 17:10..."

Both of these were not the master at the time, but slaves. Please also check your other nodes.

hr556 · May 31, 2023

Got it, many thanks! The third cluster node "planck" became master:

Bash:

May 27 17:12:07 planck pve-ha-crm[1424]: successfully acquired lock 'ha_manager_lock'
May 27 17:12:07 planck pve-ha-crm[1424]: watchdog active
May 27 17:12:07 planck pve-ha-crm[1424]: status change slave => master
May 27 17:12:07 planck pve-ha-crm[1424]: node 'einstein': state changed from 'online' => 'unknown'
May 27 17:13:07 planck pve-ha-crm[1424]: service 'ct:100002': state changed from 'started' to 'fence'
May 27 17:13:07 planck pve-ha-crm[1424]: node 'einstein': state changed from 'unknown' => 'fence'
May 27 17:13:17 planck pve-ha-crm[1424]: successfully acquired lock 'ha_agent_einstein_lock'
May 27 17:13:17 planck pve-ha-crm[1424]: fencing: acknowledged - got agent lock for node 'einstein'
May 27 17:13:17 planck pve-ha-crm[1424]: node 'einstein': state changed from 'fence' => 'unknown'
May 27 17:13:17 planck pve-ha-crm[1424]: service 'ct:100002': state changed from 'fence' to 'recovery'

Search

Search

detect failover of VMs and containers

hr556

Member

fiona

Proxmox Staff Member

hr556

Member

fiona

Proxmox Staff Member

hr556

Member

We value your privacy