Some things are not ideal from user interaction perspective.
The rationale while we (or at least I, for that matter) do not see a major problem with how the resetting of a erroneous Services is realized is:
* The VM/CT summary view is not the place where we should manage its HA state - i.e., if we add one-click recovery there then one could argue that also an 'add, remove to HA' among other things - making an already crowded view more so crowded and less intuitive.
Right now there is no convenient way to add a CT/VM to the HA: you go to the HA view, and have to type the id number, which you should either memorize or scroll back and forth the list; the point is: I hardly ever care WHAT the ctid is, so I don't [intend to] memorize it. HA view only show the numbers, which are, at leadt for me, meaningless. So this part of the gui almost never comes up on my screen.
And thus the same goes to clering the error state: there is no button to "remove, wait, add again", so
- I write down the number with a pencil on a A0 paper made from dead trees
- I remove it, it disappear
- I go and check the CT, and wait until the error clears
- go back
- find my A0 paper, and read the number
- copy it to the add field.
For me this is not ideal, but I can accept that I'm eccentric and way too lazy. ;-)
May I ask what happens that you get into the error state so often?
Indeed you could: "migrate all stuff from this node to another one" always end up having some CTs in error state, since they have to be shut down (well I miss this great opportunity to complain about
criu still not available, or maybe I haven't), transferred and restarted, and mass shutdowns seem to get on proxmox's nerves (and timeout defaults I guess).
The other dead-sure case is when I try to migrate a VM with local disk attached. Obviously it cannot be moved but I see no reason to put it on error since in majority case a failed migration results a non-error non-migrated CT. Just not in this case. But it's justified, don't get me wrong, but it's cumbersome to clear.
No, not really happen that often, you're right, but when it does it's mildly annoying.
* clearing HA errors should really not happen that often in a production environment, and once they happen we do not want to encourage an user to just clear the error and think everything is good again.
As the docs state you should always check what and why an error happened and resolve this problem before clearing the error. And here is the HA's Overview panel a good place as you *have* an overview here over the state of the whole cluster and all its services, with information that is lacking in the VM overview tab.
So you see most of the time I'm very aware of what happened. But I'm just one kind of use case, I accept that.
* Chance is that when in a production setup one service fails as part of an outage (storage, network, ...) more fail - so there the HA view is the better place too.
That panel is not really my friend. Just the opposite: for me it looks very unfriendly. Just checked, and realised that, for example, I would need to mass edit "max reloc" or "max restart" I should do it 8 clicks per entry. There are IDs only, no info on the CTs, their states, etc, only the HA state.
If I ever wanted to see the GUI while/after an outage (which I, personally, probably wouldn't since console is 100 times faster) I would go to the datacenter or host summaries, but possibly not even there, just look on the server view pane.
I don't want to convince you, I just share how I use it.
* If you are evaluating or testing (i.e. a non-production) then yeah, I can get why its a bit cumbersome, here you may get into errors more often and the clustered HA manager needs a few seconds until all requested states were propagated and in effect, but for that there is the CLI Tool `ha-manager` or also the API itself.
Not this time. For evaluation I do lots of console stuff; my colleagues are from the click generation. ;-)
We could change the meaning for max_restar/Max_relocate so that if those are set to zero it means infinity tries - thus circumventing most cases where the error state could be even reached.
This one sounds an exceptionally bad idea.
Infinite restarts would technically make recovery every way harder.
I've a patch lying around which would allow to add a VM/CT to ha through the create wizard, would that be a help for you or do you add/remove always existing VM/CTs to HA?
Possibly, but as you see I'm not generally satisfied with the HA screen. I don't have a clear mental view how I would like to have it, haven't thought about it so far.