Fenced node emails, hundreds of them

hepo · Nov 14, 2021

We had an issue on one of our nodes today, it rebooted during VM migration (local-zfs volume)... this happens for the second time now.

Since then I am getting hundreds emails like the following

I see no HA issues, rebooted the host once again, not sure where to start... HELP

hepo · Nov 14, 2021

This appears to happen approximately every hour.

pve12 appears to think it's the quorum master, but this is not correct.

found 3000 messages in the postfix queue of pve12.
also looks like the email server is throttling us hence may explain the "every hour phenomenon".
purged the queue and will continue monitoring.

fabian · Nov 15, 2021

please get the following from each node

pveversion -v
pvecm status
ha-manager status

hepo · Nov 15, 2021

See attached file.
Since I have purged the postfix queue I don't receive new emails.
Yesterday when I reported this, there was nothing weird on the cluster, no VM migrations, just the email which were a lot.

I am using this script to configure email - https://forum.proxmox.com/threads/script-for-proxmox-email-notifications-configuration.69003/

Thanks for looking into this!

fabian · Nov 16, 2021

that looks okay. what about the logs around the time mentioned in the mails (from pve11 and the node that was CRM master at the time?)

hepo · Nov 16, 2021

logs are too big to attach in the forum, hence uploaded here - link
as far I see, pve12 is the one generating the events.

Thanks for looking into this.

fabian · Nov 17, 2021

could you also post your HA config (/etc/pve/ha/resources.cfg and /etc/pve/datacenter.cfg)? the logs do look like there is a bug or edge case not handled well..

hepo · Nov 17, 2021

Nothing extraordinary there. Same link from my previous reply.

fabian · Nov 17, 2021

thanks - sorry to ask once more, but I forgot the /etc/pve/ha/groups.cfg file

hepo · Nov 17, 2021

all except pve31 which is in 3rd DC and is only used for maintaining quorum.
cfg file uploaded in the same location.

I am happy to stop this here and would like to thank for the investigation.

We are currently having performance problems with ceph (details here) and running most VMs on local (zfs mirror) disk.
We have experience node reboots already twice when migrating VMs from local disk.
In the event the reboot was in the middle of the migration, the VM went into failed HA state in which I was not able to start it.
Had to remove the VM from HA which allowed me to start it.
The node reboot is the most weird phenomena in this story, again it happened twice already.

Thanks!

fabian · Nov 18, 2021

thansk! I'll double check to see whether there is some way to trigger this behaviour in our HA state machine, the configs look sane.

Fenced node emails, hundreds of them

hepo

Well-Known Member

hepo

Well-Known Member

fabian

Proxmox Staff Member

hepo

Well-Known Member

Attachments

fabian

Proxmox Staff Member

hepo

Well-Known Member

fabian

Proxmox Staff Member

hepo

Well-Known Member

fabian

Proxmox Staff Member

hepo

Well-Known Member

fabian

Proxmox Staff Member

We value your privacy