Clear HA error status from the GUI

grin

Renowned Member
Dec 8, 2008
172
21
83
Hungary
grin.hu
It would be neat to have a HA function on the gui to actually clear the error status of the given CT/VM.

HA have the bad tendency to put stuff in error when it had timeout, or generally holding some grudge against me. It'd do no harm: if it's still bad, it'll go into error again anyway.

ha-manager remove <ctid> ; wait_until_ha_forgets <ctid> ; ha-manager add <ctid>

As a sidenote it would also be very kind to have a mean to add a ct/vm to the ha from within the ct/vm page, and not from the HA summary page.
 
My understanding is that you move the "requested state" to "disabled" for the VM in HA resources, then back to "start". That should clear the error.
 
My understanding is that you move the "requested state" to "disabled" for the VM in HA resources, then back to "start". That should clear the error.

1) Almost. First, you have to go to and from the host tab. Second, you have to wait until the error gets cleared by the ha-manager, then can you add it again. Do it too fast and it won't clear. This may require you to go there and back again, again. :)

2) This requires at least four, most possibly closed to a dozen clicks, including the scrolling and the clicking back and forth. The point of my request was to require one.

3) …right now it's so cumbersome that I actually do it rather on the console.
 
My understanding is that you move the "requested state" to "disabled" for the VM in HA resources, then back to "start". That should clear the error.

Exactly.


1) Almost. First, you have to go to and from the host tab. Second, you have to wait until the error gets cleared by the ha-manager, then can you add it again. Do it too fast and it won't clear. This may require you to go there and back again, again. :)

2) This requires at least four, most possibly closed to a dozen clicks, including the scrolling and the clicking back and forth. The point of my request was to require one.

3) …right now it's so cumbersome that I actually do it rather on the console.

Some things are not ideal from user interaction perspective.
The rationale while we (or at least I, for that matter) do not see a major problem with how the resetting of a erroneous Services is realized is:
* The VM/CT summary view is not the place where we should manage its HA state - i.e., if we add one-click recovery there then one could argue that also an 'add, remove to HA' among other things - making an already crowded view more so crowded and less intuitive. I'm not against a convenience helper but here I'm a bit vary, as:
* clearing HA errors should really not happen that often in a production environment, and once they happen we do not want to encourage an user to just clear the error and think everything is good again.
As the docs state you should always check what and why an error happened and resolve this problem before clearing the error. And here is the HA's Overview panel a good place as you *have* an overview here over the state of the whole cluster and all its services, with information that is lacking in the VM overview tab.
* Chance is that when in a production setup one service fails as part of an outage (storage, network, ...) more fail - so there the HA view is the better place too.
* If you are evaluating or testing (i.e. a non-production) then yeah, I can get why its a bit cumbersome, here you may get into errors more often and the clustered HA manager needs a few seconds until all requested states were propagated and in effect, but for that there is the CLI Tool `ha-manager` or also the API itself.

For 3), that's a mute point CLI is always faster and better - just not all know this fact :p But also there you need to wait for that the state is cleared as mostly the same happens as with a WebUI request so 1) does not gets solved by 3)..

It'd do no harm: if it's still bad, it'll go into error again anyway.

This is generalizing and takes there may be situations (probably the one you're experiencing yourself) where it could be in fact OK to do so, but that may not always be the case else we would have no error state at all.
That does not mean that I'm totally content with how it is currently handled, but just a one-click clear button is not the solution here, IMO.
May I ask what happens that you get into the error state so often?
We could change the meaning for max_restar/Max_relocate so that if those are set to zero it means infinity tries - thus circumventing most cases where the error state could be even reached.

I've a patch lying around which would allow to add a VM/CT to ha through the create wizard, would that be a help for you or do you add/remove always existing VM/CTs to HA?
And if so, can you explain your rationale for me to understand why often adding/removing of the same VM is needed?
 
Some things are not ideal from user interaction perspective.
The rationale while we (or at least I, for that matter) do not see a major problem with how the resetting of a erroneous Services is realized is:
* The VM/CT summary view is not the place where we should manage its HA state - i.e., if we add one-click recovery there then one could argue that also an 'add, remove to HA' among other things - making an already crowded view more so crowded and less intuitive.
Right now there is no convenient way to add a CT/VM to the HA: you go to the HA view, and have to type the id number, which you should either memorize or scroll back and forth the list; the point is: I hardly ever care WHAT the ctid is, so I don't [intend to] memorize it. HA view only show the numbers, which are, at leadt for me, meaningless. So this part of the gui almost never comes up on my screen.

And thus the same goes to clering the error state: there is no button to "remove, wait, add again", so
  1. I write down the number with a pencil on a A0 paper made from dead trees
  2. I remove it, it disappear
  3. I go and check the CT, and wait until the error clears
  4. go back
  5. find my A0 paper, and read the number
  6. copy it to the add field.
For me this is not ideal, but I can accept that I'm eccentric and way too lazy. ;-)


May I ask what happens that you get into the error state so often?
Indeed you could: "migrate all stuff from this node to another one" always end up having some CTs in error state, since they have to be shut down (well I miss this great opportunity to complain about criu still not available, or maybe I haven't), transferred and restarted, and mass shutdowns seem to get on proxmox's nerves (and timeout defaults I guess).

The other dead-sure case is when I try to migrate a VM with local disk attached. Obviously it cannot be moved but I see no reason to put it on error since in majority case a failed migration results a non-error non-migrated CT. Just not in this case. But it's justified, don't get me wrong, but it's cumbersome to clear.

No, not really happen that often, you're right, but when it does it's mildly annoying.

* clearing HA errors should really not happen that often in a production environment, and once they happen we do not want to encourage an user to just clear the error and think everything is good again.
As the docs state you should always check what and why an error happened and resolve this problem before clearing the error. And here is the HA's Overview panel a good place as you *have* an overview here over the state of the whole cluster and all its services, with information that is lacking in the VM overview tab.
So you see most of the time I'm very aware of what happened. But I'm just one kind of use case, I accept that.

* Chance is that when in a production setup one service fails as part of an outage (storage, network, ...) more fail - so there the HA view is the better place too.
That panel is not really my friend. Just the opposite: for me it looks very unfriendly. Just checked, and realised that, for example, I would need to mass edit "max reloc" or "max restart" I should do it 8 clicks per entry. There are IDs only, no info on the CTs, their states, etc, only the HA state.

If I ever wanted to see the GUI while/after an outage (which I, personally, probably wouldn't since console is 100 times faster) I would go to the datacenter or host summaries, but possibly not even there, just look on the server view pane.

I don't want to convince you, I just share how I use it.

* If you are evaluating or testing (i.e. a non-production) then yeah, I can get why its a bit cumbersome, here you may get into errors more often and the clustered HA manager needs a few seconds until all requested states were propagated and in effect, but for that there is the CLI Tool `ha-manager` or also the API itself.
Not this time. For evaluation I do lots of console stuff; my colleagues are from the click generation. ;-)


We could change the meaning for max_restar/Max_relocate so that if those are set to zero it means infinity tries - thus circumventing most cases where the error state could be even reached.
This one sounds an exceptionally bad idea. :) Infinite restarts would technically make recovery every way harder.

I've a patch lying around which would allow to add a VM/CT to ha through the create wizard, would that be a help for you or do you add/remove always existing VM/CTs to HA?
Possibly, but as you see I'm not generally satisfied with the HA screen. I don't have a clear mental view how I would like to have it, haven't thought about it so far.
 
Right now there is no convenient way to add a CT/VM to the HA: you go to the HA view, and have to type the id number, which you should either memorize or scroll back and forth the list; the point is: I hardly ever care WHAT the ctid is, so I don't [intend to] memorize it. HA view only show the numbers, which are, at leadt for me, meaningless. So this part of the gui almost never comes up on my screen.

Fair point, this is far from ideal and seen as an issue for me too, other things got in the way (and I often use CLI).
This will be tracked under: https://bugzilla.proxmox.com/show_bug.cgi?id=1517

nd thus the same goes to clering the error state: there is no button to "remove, wait, add again", so
  1. I write down the number with a pencil on a A0 paper made from dead trees
  2. I remove it, it disappear
  3. I go and check the CT, and wait until the error clears
  4. go back
  5. find my A0 paper, and read the number
  6. copy it to the add field.

Hmm, would you not be faster if you just go to "Datacenter -> HA", sort by the state column (this gets remembered so you have to do it just once) and edit those which show "error" and set them to disabled?
Not perfect but at least no trees must die for this one... ;)


well I miss this great opportunity to complain about criu still not available, or maybe I haven't

If it could migrate more than a container acting as a potato it would be probably integrated in PVE, but as it cannot do this and each kernel patch may break also the potato-migrate-capability in shambles it is not.
But that was discussed often enough and is indeed to off topic for this thread.

Indeed you could: "migrate all stuff from this node to another one" always end up having some CTs in error state, since they have to be shut down, transferred and restarted, and mass shutdowns seem to get on proxmox's nerves (and timeout defaults I guess).

Hmm, OK that should simply not be the case - I'll try to reproduce this here.

That panel is not really my friend. Just the opposite: for me it looks very unfriendly. Just checked, and realised that, for example, I would need to mass edit "max reloc" or "max restart" I should do it 8 clicks per entry. There are IDs only, no info on the CTs, their states, etc, only the HA state.

Mass editing in GUIs is often a PITA (no idea how they got that popular) and that's why we try to just use simple text configuration files so that this can be easily scripted/done with sed/awk or also with our CLI tools or also configuration management systems. The one who need mass editing run often bigger deployments (in terms of service counts) where admins often even prefer to do it this way - but you know that all already.

To really make this sane in the webUI a Filter/Action component would be needed, i.e. take all VM/CT with this criteria (vmid range, host, ...) and do something with them (add to HA, set max_relocate, ...).
But that's a bit much for now, I keep it in the back of my head as I general I agree with you - when editing more than a few services without having a CLI available it gets a little annoying fast...

I don't want to convince you, I just share how I use it.

That's totally fine for me, I asked for it and am glad for you sharing it.

This one sounds an exceptionally bad idea. :) Infinite restarts would technically make recovery every way harder.

Yeah, that quick idea wasn't the best - now that I think about it I remember that I even introduced the max_ settings myself, as I disliked that it previously was handled like this.

Possibly, but as you see I'm not generally satisfied with the HA screen. I don't have a clear mental view how I would like to have it, haven't thought about it so far.

I see and acknowledge the problems mentioned, I try to tackle a few smaller sub problems in the upcoming weeks. I also opened https://bugzilla.proxmox.com/show_bug.cgi?id=1518 to track the "I do not want to leave my VM/CT summary for HA" issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!