Get notified on a node failure

ricardo.cogolludo

New Member
Jan 5, 2024
2
0
1
At first, this is my HA cluster architecture:

- Node 'pve01'
- Node 'pve02'
- RaspberryPi as Qdevice

By default, VMs are running on pve01; when it comes offline, VMs auto migrate to pve02, and I get emails about succeed fencing. But when the node pve02 gets offline, I don't receive anything. I would like to be noticed if any of the nodes has a problem, even the Qdevice. I'm supossing 3 different scenarios:

- pve01 working / pve02 working / Qdevice working >> pve01 fails: VMs are migrated from pve01 to pve02 and i get notifications. OK.
- pve01 working / pve02 offline / Qdevice working: >> pve01 fails: HA loses quorum and no working nodes are available, because i've not been notified on pve02 failure.
- pve01 working / pve02 working / Qdevice offline>> pve01 fails: HA loses quorum and cluster brokes, because i've not been notified on Qdevice failure.

I think I'm not completely understanding the HA cluster setup. I've tried replacing the Qdevice with a 3rd complete PVE node, but I have the same situation when not main node fails; when a second machine fails, everything goes down.

Is it possible to get this architecture working.

Regards.
 
Normally, you'd use a monitoring solution like Icinga, Nagios, CheckMK, Zabbix, <your-favorite-tool-here> and do alarming from there. If you don't want to run another computer (maybe more, monitoring should be HA too), you can run it inside of a VM in pve01. It'll monitor everything and only if pve01 fails, you will get the message a little bit later, yet still get it. In all other cases, you'll get the alarm almost instantly.
 
@LnxBil thanks for your help!

I was supossing that Proxmox HA offered a complete solution for monitoring, but now I'm considering to deploy any external solution.