Fenced node emails, hundreds of them

hepo

Well-Known Member
Mar 27, 2021
106
26
48
44
We had an issue on one of our nodes today, it rebooted during VM migration (local-zfs volume)... this happens for the second time now.

Since then I am getting hundreds emails like the following

I see no HA issues, rebooted the host once again, not sure where to start... HELP
 
Last edited:
This appears to happen approximately every hour.

1636906428117.png

pve12 appears to think it's the quorum master, but this is not correct.

1636906547587.png

found 3000 messages in the postfix queue of pve12.
also looks like the email server is throttling us hence may explain the "every hour phenomenon".
purged the queue and will continue monitoring.
 
Last edited:
please get the following from each node

  • pveversion -v
  • pvecm status
  • ha-manager status
 

Attachments

that looks okay. what about the logs around the time mentioned in the mails (from pve11 and the node that was CRM master at the time?)
 
logs are too big to attach in the forum, hence uploaded here - link
as far I see, pve12 is the one generating the events.

Thanks for looking into this.
 
could you also post your HA config (/etc/pve/ha/resources.cfg and /etc/pve/datacenter.cfg)? the logs do look like there is a bug or edge case not handled well..
 
thanks - sorry to ask once more, but I forgot the /etc/pve/ha/groups.cfg file
 
all except pve31 which is in 3rd DC and is only used for maintaining quorum.
cfg file uploaded in the same location.

I am happy to stop this here and would like to thank for the investigation.

We are currently having performance problems with ceph (details here) and running most VMs on local (zfs mirror) disk.
We have experience node reboots already twice when migrating VMs from local disk.
In the event the reboot was in the middle of the migration, the VM went into failed HA state in which I was not able to start it.
Had to remove the VM from HA which allowed me to start it.
The node reboot is the most weird phenomena in this story, again it happened twice already.

Thanks!
 
thansk! I'll double check to see whether there is some way to trigger this behaviour in our HA state machine, the configs look sane.
 
  • Like
Reactions: hepo