HA error state on some VMs (due to pve-cluster service crash?)

gradinaruvasile

Well-Known Member
Oct 22, 2015
83
11
48
After upgrading to Proxmox 6 i observed a random event - randomly VMs get into HA error states with a red circle above the VM icon.
Their HA State becomes "error". The cluster seems fine otherwise.
The logs indicate a "Main process exited, code=killed, status=6/ABRT" for the pve-cluster service which is subsequently restarted automatically.

This is a problem since these VMs will not respawn anymore (i assume?) if the node goes down.
The clunky solution, without rebooting the VM, is to go to the cluster wide HA config and remove and re-add the VM into HA. Unpleasant if there are multiple of them (i had this happen once with around 10 vms from all over the cluster).

Right now i have only one VM, id 134.
Error logs:

pve-ha-lrm service logs (reverse):
Code:
Aug 30 10:40:55 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:40:54 server pve-ha-lrm[18799]: updating service status from manager failed: Connection refused
Aug 30 10:40:14 server pve-ha-lrm[18799]: temporary inconsistent cluster state (cfs restart?), skip round
Aug 30 10:39:25 server pve-ha-lrm[21140]: service vm:134 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Aug 30 10:39:14 server pve-ha-lrm[21090]: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[3] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[2] failed: Connection refused
Aug 30 10:39:14 server pve-ha-lrm[21090]: ipcc_send_rec[1] failed: Connection refused
and phe-ha-crm logs:
Code:
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134' got unrecoverable error (exit code 255))
Aug 30 10:39:20 server pve-ha-crm[18796]: service 'vm:134': state changed from 'started' to 'error'
pve-cluster log (this one is in reverse):
Code:
Aug 30 10:39:14 server systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 30 10:39:14 server systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 11.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Failed with result 'signal'.
Aug 30 10:39:14 server systemd[1]: pve-cluster.service: Main process exited, code=killed, status=6/ABRT
Aug 30 10:26:49 server pmxcfs[23824]: [dcdb] notice: data verification successful
 
That means killing the VM as far as i know, but the VM works well, it has absolutely no issues other than the cluster thinking it has issues. But the VM works well, it can be managed, migrated, etc. I can remove it from HA and re add it and it works, that is not the issue here.

The issue is that randomly VMs drop out of HA without having any actual issues and if someone is not there to micromanage them, HA services could be disabled for good with for no reason other than the cluster internal sync is temporarily off for some reason (i'd suggest bug).
In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.
 
That means killing the VM as far as i know
This is correct.

In the logs above you can see that in the moment the pve cluster service is crashed (or killed for some reason?), the VM in question is marked as in error state, coupled with "temporary inconsistent cluster state" messages as the service is restarted.
There was a bug in pmxcfs which could cause the restart you are seeing and therefore your inconsistent cluster state, we will push a new version soon.
 
I'm getting the same 'Connection refused' error on:

Linux proxmox-2 5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) x86_64 GNU/Linux

PVE does not start, cannot do anything with it :(
 
Hi,
I'm getting the same 'Connection refused' error on:

Linux proxmox-2 5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +0200) x86_64 GNU/Linux

PVE does not start, cannot do anything with it :(
next time, please don't answer in such old threads but create a new one.

What is the output of systemctl status pve-cluster.service? If there is an error, the full log can be seen with journalctl -b0 -u pve-cluster.service.

Is there anything interesting in /var/log/syslog? Please also share the output of pveversion -v.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!