HA state "error" but I cannot see why

deniswitt

Member
Nov 17, 2017
7
0
6
46
Hi,

the HA state for all our VMs on our proxmox test environment is "error" since this morning and the backup has been failed. The odd part, als VMs are running perfectly. The state for our containers are fine.

How can I remove the error state without restarting the VMs? And more importantly how can I figure out why the error state has been raised, as I'm unable to find any clue in the logs, yet?

Thanks in advance.
 
I've just got the same "problem" this morning for a single VM. It was in "error" state but still up and running fine ...

My problem was I've added a backup schedule that was conflicting with the one already in place (same time, different location) and one backup task failed to lock the VM. I guess it's where/why the error state came from ... nothing useful in logs except this entry (journalctl -u pve-ha-lrm -l):

Dec 06 00:30:34 pve2 pve-ha-lrm[24176]: service vm:110 is in an error state and needs manual intervention. Look up 'ERROR REC

But I've not been able to clean the error state without changing the State to "disabled" in HA Resources (which stopped the VM ...) and to "started" again ...

state: <disabled | enabled | ignored | started | stopped> (default = started)
[...]
disabled

The CRM tries to put the resource in stopped state, but does not try to relocate the resources on node failures. The main purpose of this state is error recovery, because it is the only way to move a resource out of the error state.
Maybe you could just "remove" the VM from HA Resources list and "add" it again but I've not tried it before using the previous solution.
 
Hi Belokan,

the funny thing is, the backup error was for the group testing, the errors are reported for the group hosting.

Anyway, your proposed solution works fine, but is a pain in the ass when you have more than 10 VMs. Is there a command line solution? pvecm only seems to have the option to remove a node.
 
Hi,

we do have the same problem in a 4 node cluster with one node already running kernel 4.13.
virtual machines sporadic turn in ha error mode, but are still running without any problems.

@deniswitt yes you can use command line for setting ha states by:
# ha-manager set <service> --state <started|stopped|disabled>
 
Hi baerm,

we are using 4.13 as well.

ha-manager doesn't seem to help, as the only working option is "disabled", which will stop the machines which is what we try to avoid. Thanks anyway.
 
Hi deniswitt,

what seems to be working apart from disable -> start, is setting the state to ignore (not sure if this is possible via commandline, but i would guess so), then migrate vm and then set the ha state back to started. not sure if the migration is necessary at all, but this was our workflow at least.

We do have another issue with debian stretch vms and migrations, so this didnt work with all our error state vms, but the jessie vms worked well ;)
 
Hi baerm,

unfortunately "ignore" doesn't work: service 'vm:102' in error state, must be disabled and fixed first (500)

Same for migration:

Requesting HA migration for VM 111 to node hosting2
service 'vm:111' in error state, must be disabled and fixed first
TASK ERROR: command 'ha-manager migrate vm:111 hosting2' failed: exit code 255
 
Got another VM in "error" this morning and not due to backup lock this time.
Is there a location where to find a clear log about what happened to end with an error state ? Same as last time, VM was up and running without any issue ...

Thanks.
 
Up ...

Same "problem" this morning, a VM in error state, I've removed it from HA resources and then added it again in order to avoid a useless restart (changing state from error to stopped to started).

Where can I find any clue regarding why it has been stated as "error" ?

Thanks in advance.
 
It appears to be happening here as well, only if a VM is set to some kind of HA State and we attempt to use the Backup feature.

edit: Kind of a hack-y "fix" is to remove the affected VM's from HA, then add them back in. Not really an optimal option.
 
Last edited:
  • Like
Reactions: iBlaze
Wondering if there are any others that are seeing this issue:

The situation for us is that we have a cluster that is set up with HA, and VMs on the nodes are in an HA group (nothing fancy, default settings). Backups are set up for various VMs (no compression, snapshot) to a storage location on a host outside of the cluster via NFS.

The backups end up running, but afterwards the VMs that were backed up have an HA state that reads "error". Backups that run for VMs that are not in an HA group do not have any issues (no HA state to be broken).

The "fix" for affected VMs is to remove them from their respective HA group and re-adding them, which is not a "fix" but a bad band-aid.

Any kind of assistance or further information would be wonderful. Thanks!
 
Is there any logfile where we can see wat HA is doing and why VM's go in an error state?
Some sort of error log?
 
Got the same issue..

journalctl -u pve-ha-lrm -l was the closes I could find, but its shallow logging, nothing but exit code 255
 
I ran into this thread when searching Google. I've been trying my best to break our test-bed. The error I was getting was
service 'vm:100' in error state, must be disabled and fixed first.

The fix for me was to remove VM 100's HA from Datacenter > HA. After this it was stuck in a migrate state - which I assume it what killed it in the first place. Obviously at this point the migrate state was no longer appliciable so unlocked it from the PVE command line with qm unlock 100.

Once this was done I was able to add the VM back into HA and it booted successfully.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!