HA error state on some VMs (due to pve-cluster service crash?)

gradinaruvasile

Active Member
Oct 22, 2015
66
9
28
After upgrading to Proxmox 6 i observed a random event - randomly VMs get into HA error states with a red circle above the VM icon.
Their HA State becomes "error". The cluster seems fine otherwise.
The logs indicate a "Main process exited, code=killed, status=6/ABRT" for the pve-cluster service which is subsequently restarted automatically.

This is a problem since these VMs will not respawn anymore (i assume?) if the node goes down.
The clunky solution, without rebooting the VM, is to go to the cluster wide HA config and remove and re-add the VM into HA. Unpleasant if there are multiple of them (i had this happen once with around 10 vms from all over the cluster).

Right now i have only one VM, id 134.
Error logs:

pve-ha-lrm service logs (reverse):
Code:
Aug 30 10:40:55 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:40:54 server pve-ha-lrm[18799]: updating service status from manager failed: Connection refused
Aug 30 10:40:14 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:39:25 server pve-ha-lrm[21140]: service vm:134 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Aug 30 10:39:14 server pve-ha-lrm[21090]: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[3] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[2] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[1] failed: Connection refused
and phe-ha-crm logs:
Code:
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134' got unrecoverable error (exit code 255))
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134': state changed from 'started' to 'error'
pve-cluster log (this one is in reverse):
Code:
Aug 30 10:39:14 server systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 30 10:39:14 server systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 11.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Failed with result 'signal'.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Main process exited, code=killed, status=6/ABRT
Aug 30 10:26:49 server pmxcfs[23824]: [dcdb] notice: data verification successful
 

gradinaruvasile

Active Member
Oct 22, 2015
66
9
28
That means killing the VM as far as i know, but the VM works well, it has absolutely no issues other than the cluster thinking it has issues. But the VM works well, it can be managed, migrated, etc. I can remove it from HA and re add it and it works, that is not the issue here.

The issue is that randomly VMs drop out of HA without having any actual issues and if someone is not there to micromanage them, HA services could be disabled for good with for no reason other than the cluster internal sync is temporarily off for some reason (i'd suggest bug).
In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.
 

Chris

Proxmox Staff Member
Jan 2, 2019
598
60
28
That means killing the VM as far as i know
This is correct.

In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.
There was a bug in pmxcfs which could cause the restart you are seeing and therefore your inconsistent cluster state, we will push a new version soon.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!