notification when qdevice missing

ballybob · Sep 1, 2024

What is the best way to handle qdevice going offline?
I have a two node cluster with 1 qdevice to maintain quorum, and I just realized the qdevice has been offline all day, but never got a notification or any warning there is missing a vote. Outside of monitor server uptime via ping and such, is there another recommended option to know when there is missing votes?

sw-omit · Sep 1, 2024

Monitoring the q-device itself anyway would probably be good, also to "be ahead" on any problems that are able to be seen coming, for example some process writing logs and filling up the disk, smart-errors of the disk, a process getting stuck at 100% cpu, etc.
That said, one other route would be to monitor pvecm status and matching "Expected Votes" to "Total Votes", which would alert to most (important) quorum related issues, but also give an alert if something (unexpectedly) disconnects/reboots.

ballybob · Sep 1, 2024

sw-omit said:
Monitoring the q-device itself anyway would probably be good, also to "be ahead" on any problems that are able to be seen coming, for example some process writing logs and filling up the disk, smart-errors of the disk, a process getting stuck at 100% cpu, etc.
That said, one other route would be to monitor pvecm status and matching "Expected Votes" to "Total Votes", which would alert to most (important) quorum related issues, but also give an alert if something (unexpectedly) disconnects/reboots.

There is nothing built in to do that last part, I'd have to do it myself right?

esi_y · Sep 1, 2024

ballybob said:
There is nothing built in to do that last part, I'd have to do it myself right?

There's quite limited support & testing of Q device by Proxmox, so the priority would be quite low to add it for you.

If you want to add it yourself, it would be much better to rely on corosync-qdevice-tool(8) output which will not be subject to change as arbitrary pvecm scripts.

One major problem however is, that you might get zero benefit from such "notification" - consider scenario, where Q device is on some network together with the nodes, the services provided are to the outside world, i.e. where you are. Should a network fault occur, you would never get any notification from those walled-in that they are cut off.

Much better implementation would be to have the Q device external to the nodes network and detection on both sides, i.e. if Q device is down, the nodes will tell on it, if nodes are unreachable, you utilise your Q device to tell on them.

ballybob · Sep 1, 2024

esi_y said:
There's quite limited support & testing of Q device by Proxmox, so the priority would be quite low to add it for you.

If you want to add it yourself, it would be much better to rely on corosync-qdevice-tool(8) output which will not be subject to change as arbitrary pvecm scripts.

One major problem however is, that you might get zero benefit from such "notification" - consider scenario, where Q device is on some network together with the nodes, the services provided are to the outside world, i.e. where you are. Should a network fault occur, you would never get any notification from those walled-in that they are cut off.

Much better implementation would be to have the Q device external to the nodes network and detection on both sides, i.e. if Q device is down, the nodes will tell on it, if nodes are unreachable, you utilise your Q device to tell on them.

This can be solved with a healthcheck instead of poll.

esi_y · Sep 1, 2024

ballybob said:
This can be solved with a healthcheck instead of poll.

I am not sure what you mean, healthcheck is polling. And you wanted a notification, i.e. pushed?

ballybob · Sep 1, 2024

esi_y said:
I am not sure what you mean, healthcheck is polling. And you wanted a notification, i.e. pushed?

A healthcheck script run on PVE box, would send an update to the remote monitoring box (healthchecks.io or uptime kuma for example) and if they don't get the heart beat within grace period, it reports it as dead.

esi_y · Sep 1, 2024

ballybob said:
A healthcheck script run on PVE box, would send an update to the remote monitoring box (healthchecks.io or uptime kuma for example) and if they don't get the heart beat within grace period, it reports it as dead.

Correct, with a third system receiving hearbeat and reporting on it all, you get the desired outcome. Which is better than notifications originating from the system(s). (EDIT: For me that's "monitoring". Health check in this scenario would be to have my monitoring system poll the monitored system instead of receiving its hearbeat.)

ballybob · Sep 1, 2024

esi_y said:
Correct, with a third system receiving hearbeat and reporting on it all, you get the desired outcome. Which is better than notifications originating from the system(s). (EDIT: For me that's "monitoring". Health check in this scenario would be to have my monitoring system poll the monitored system instead of receiving its hearbeat.)

For corosync-qdevice-tool,
am I doing.

corosync-qdevice-tool -s

then looking for

State: Connected
?

esi_y · Sep 1, 2024

ballybob said:
For corosync-qdevice-tool,
am I doing.

corosync-qdevice-tool -s

then looking for

State: Connected
?

Precisely. This is looking from the nodes. If you look from the Q device, you can check corosync-qnetd-tool (8) - this is what is running on the Q device itself. Do note that Q device can be casting votes to mutliple clusters, so you need to account for that and be checking the list of members it sees instead. There's already a heartbeat going on between your QD and the cluster.

esi_y · Sep 1, 2024

Just want to say, if you are considering third system, you are best off with simply syslog aggregator (from all nodes AND the QD) and have rules on those.

notification when qdevice missing

ballybob

Member

sw-omit

Well-Known Member

ballybob

Member

esi_y

Renowned Member

ballybob

Member

esi_y

Renowned Member

ballybob

Member

esi_y

Renowned Member

ballybob

Member

esi_y

Renowned Member

esi_y

Renowned Member

We value your privacy