HA error state on some VMs (due to pve-cluster service crash?)

gradinaruvasile · Aug 30, 2019

After upgrading to Proxmox 6 i observed a random event - randomly VMs get into HA error states with a red circle above the VM icon.
Their HA State becomes "error". The cluster seems fine otherwise.
The logs indicate a "Main process exited, code=killed, status=6/ABRT" for the pve-cluster service which is subsequently restarted automatically.

This is a problem since these VMs will not respawn anymore (i assume?) if the node goes down.
The clunky solution, without rebooting the VM, is to go to the cluster wide HA config and remove and re-add the VM into HA. Unpleasant if there are multiple of them (i had this happen once with around 10 vms from all over the cluster).

Right now i have only one VM, id 134.
Error logs:

pve-ha-lrm service logs (reverse):

Code:

Aug 30 10:40:55 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:40:54 server pve-ha-lrm[18799]: updating service status from manager failed: Connection refused
Aug 30 10:40:14 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:39:25 server pve-ha-lrm[21140]: service vm:134 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Aug 30 10:39:14 server pve-ha-lrm[21090]: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[3] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[2] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[1] failed: Connection refused

and phe-ha-crm logs:

Code:

Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134' got unrecoverable error (exit code 255))
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134': state changed from 'started' to 'error'

pve-cluster log (this one is in reverse):

Code:

Aug 30 10:39:14 server systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 30 10:39:14 server systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 11.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Failed with result 'signal'.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Main process exited, code=killed, status=6/ABRT
Aug 30 10:26:49 server pmxcfs[23824]: [dcdb] notice: data verification successful

Chris · Aug 30, 2019

Hi,
if your VMs go into error state they won't be touched by the HA stack anymore. In order to recover, please follow these steps https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_error_recovery

gradinaruvasile · Aug 30, 2019

That means killing the VM as far as i know, but the VM works well, it has absolutely no issues other than the cluster thinking it has issues. But the VM works well, it can be managed, migrated, etc. I can remove it from HA and re add it and it works, that is not the issue here.

The issue is that randomly VMs drop out of HA without having any actual issues and if someone is not there to micromanage them, HA services could be disabled for good with for no reason other than the cluster internal sync is temporarily off for some reason (i'd suggest bug).
In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.

Chris · Aug 30, 2019

gradinaruvasile said:
That means killing the VM as far as i know

This is correct.

gradinaruvasile said:
In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.

There was a bug in pmxcfs which could cause the restart you are seeing and therefore your inconsistent cluster state, we will push a new version soon.

gradinaruvasile · Aug 30, 2019

So we wait for the pve-cluster package to be updated?

davidand · Oct 10, 2021

I'm getting the same 'Connection refused' error on:

Linux proxmox-2 5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) x86_64 GNU/Linux

PVE does not start, cannot do anything with it

fiona · Oct 11, 2021

Hi,

davidand said:
I'm getting the same 'Connection refused' error on:

Linux proxmox-2 5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) x86_64 GNU/Linux

PVE does not start, cannot do anything with it

next time, please don't answer in such old threads but create a new one.

What is the output of systemctl status pve-cluster.service? If there is an error, the full log can be seen with journalctl -b0 -u pve-cluster.service.

Is there anything interesting in /var/log/syslog? Please also share the output of pveversion -v.

HA error state on some VMs (due to pve-cluster service crash?)

gradinaruvasile

Renowned Member

Chris

Proxmox Staff Member

gradinaruvasile

Renowned Member

Chris

Proxmox Staff Member

gradinaruvasile

Renowned Member

davidand

Well-Known Member

fiona

Proxmox Staff Member

We value your privacy