Cannot delete HA resources since PVE 5 to 6 update

ApisD · Sep 18, 2019

Hello,

I come accross a difficulty with the HA management since we have updated our 3-nodes PVE cluster from 5th to 6th version. We did of course follow the Upgrade from 5.x to 6.0 wiki page, so deactivated both pve-ha-lrm and pve-ha-crm services on all nodes before the update. The pve5to6 script gave all signals on green.
Once upgraded, everything is working properly, the HA services restarted successfully, but the HA status of all VMs/CTs didn't come back to "started", it stays on "ignored". On the VM summary, the status is "none". So we tried to delete the entire HA entries (HA group + HA ressources) to redo the settings. The HA group is now deleted but impossible to delete ressources :

When we try to edit :

Do you know what's happening ?

Thank you in advance, and apologies for bad english.

t.lamprecht · Sep 27, 2019

The firs error seems like an old issue already fixed in PVE 5.x: https://git.proxmox.com/?p=pve-ha-manager.git;a=commit;h=60d1b1cacf2173f3cea84db41dd740e35c7a10d9
But the second seems weird, never had that one..

ApisD said:
Do you know what's happening ?

Can you please post the output of following commands:

Code:

cat /etc/pve/ha/resources.cfg
cat /etc/pve/ha/manager_status
ha-manager status

ApisD · Sep 27, 2019

t.lamprecht said:
Can you please post the output of following commands:

Code:

cat /etc/pve/ha/resources.cfg cat /etc/pve/ha/manager_status ha-manager status

There the result of the three commands :

Bash:

root@Jupiter:~# cat /etc/pve/ha/resources.cfg

(no return, empty file)

Bash:

root@Jupiter:~# cat /etc/pve/ha/manager_status
{"master_node":"Uranus","timestamp":1568096440,"node_status":{"Saturne":"online","Uranus":"online","Jupiter":"online"},"service_status":{"ct:102":{"state":"freeze","uid":"7P/qDG1O/Yp+NH3LpyR6hA","node":"Jupiter"},"vm:201":{"state":"freeze","node":"Jupiter","uid":"ZAShvDPOMU+Ax01w4FAsrg"},"vm:202":{"state":"freeze","node":"Jupiter","uid":"TN/+JOVbI5RR3h0SssFBMQ"},"ct:150":{"uid":"yoZPTOO+nWoXbtgrJYMdjA","node":"Jupiter","state":"freeze"},"vm:203":{"uid":"JbQ6Pxfw3ipkNjsnkNJvTA","node":"Jupiter","state":"freeze"},"ct:101":{"uid":"32xug6v3hOU2IMlRyVb4kg","node":"Jupiter","state":"freeze"},"vm:254":{"uid":"N6WN376x1WNOyGzDEkkZ2g","node":"Jupiter","state":"freeze"},"ct:100":{"state":"freeze","node":"Jupiter","uid":"/kAVFB+uMVPlOXVwK0Wflw"},"ct:103":{"node":"Jupiter","uid":"ADOGg0+5EnbcwKjp6TEW0w","state":"freeze"},"vm:204":{"node":"Jupiter","uid":"p7n8w0PEGUYrfGm9NkK3ww","state":"freeze"},"vm:200":{"node":"Jupiter","uid":"UZavf7tDHB9VulJ2LnrtwQ","state":"freeze"}}}

Bash:

root@Jupiter:~# ha-manager status
quorum OK
master Uranus (idle, Tue Sep 10 08:20:40 2019)
lrm Jupiter (idle, Fri Sep 27 15:26:07 2019)
lrm Saturne (idle, Fri Sep 27 15:26:07 2019)
lrm Uranus (idle, Fri Sep 27 15:26:07 2019)
service ct:100 (Jupiter, ignored)
service ct:101 (Jupiter, ignored)
service ct:102 (Jupiter, ignored)
service ct:103 (Jupiter, ignored)
service ct:150 (Jupiter, ignored)
service vm:200 (Jupiter, ignored)
service vm:201 (Jupiter, ignored)
service vm:202 (Jupiter, ignored)
service vm:203 (Jupiter, ignored)
service vm:204 (Jupiter, ignored)
service vm:254 (Jupiter, ignored)

t.lamprecht · Sep 28, 2019

Huh, strange all your Services are in frozen state, which normally only happens during a package update or a quick reboot, after that they should be then picked up again by their respecive LRM (all services are on Jupiter in your case).
But as the resource config, which is the base of what services are HA managed and which state they should have is empty, so no LRM/CRM starts up from idle as they think there's nothing to do anyway..

That is maybe an edge case not produced often, thus was never noticed here, and may be a real bug.

To workaround this issue you can do either of two things:

Variant A: Reset state completely

execute systemctl stop pve-ha-crm on all nodes, this stops all cluster managers
on a single node do: rm /etc/pve/ha/manager_status this resets the manager status, can only be done if no manager is active - as this is normally just representative and gets only read newly if a CRM becomes the manager, that's the reason we had to stop all CRMs before doing this.
execute systemctl star pve-ha-crm on all nodes, this starts all cluster managers again

You then have a clean state and now service will show up as "ignored" HA managed without being able to delete it.
You can then re-add the desired services again, the HA manager will pick them up and work normally.

Variant B: Fixup resource configuration

As told earlier, this glitch highly probably comes from the fact that resource.cfg is empty and all services from the manager_status are frozen. So we could also just add the services to the resource config again.

You could try to start with one: ha-manager add ct:100 as this doe not checks the manager_status, but only the resources.cfg (which is empty anyway) it should work just fine. They will be added with request_state started (default), if they already run the will keep running, else they will be started (you can change this by also specifying "--state stopped", for example)

If this works and unfreezes the respective service you can re-add the rest, can also be done quicker:

Code:

for sid in {ct:100 ct:101 ct:102 ct:103 ct:150 vm:200 vm:201 vm:202 vm:203 vm:204 vm:254}; do
    ha-manager add "$sid"
done

At the end both Variant are the same, just that if you want to have all those HA manage you do not need save some operations with Variant B. Also if you use B it would be nice to have some feedback, to see if my theory about this issue can be confirmed.

ApisD · Sep 30, 2019

I done the Variant B, it unfreezes successfully the services !

ha-manager add ct:100 eliminates all ignored-state resources in HA manager :

So I added the rest of VMs/CTs and lrm services of the three nodes get the active status.

I re-check the resources.cfg, manager-status and ha-manager status :

Bash:

cat /etc/pve/ha/resources.cfg
ct: 100
        group HA1
        state started
ct: 101
        group HA1
        state started
ct: 102
        group HA1
        state started
ct: 103
        group HA1
        state started
ct: 150
        group HA1
        state started
vm: 200
        group HA1
        state started
vm: 201
        group HA1
        state started
vm: 202
        group HA1
        state started
vm: 203
        group HA1
        state started
vm: 204
        group HA1
        state started
vm: 254
        group HA1
        state started

Bash:

cat /etc/pve/ha/manager_status
{"timestamp":1569836633,"master_node":"Uranus","node_status":{"Jupiter":"online","Saturne":"online","Uranus":"online"},"service_status":{"vm:204":{"node":"Jupiter","state":"started","uid":"jrQNiU4nIL7M8MI/eOHKAA","running":1},"ct:102":{"uid":"Xrr118Y4EPs6Abee01COqw","running":1,"node":"Saturne","state":"started"},"vm:200":{"state":"started","node":"Uranus","running":1,"uid":"xm0nVgIGE/sfioBxwb0UZg"},"vm:201":{"uid":"j18hadzKy1b29g3MgGZT6Q","running":1,"node":"Saturne","state":"started"},"ct:103":{"node":"Saturne","state":"started","uid":"qwDQNCXifvBoyVzKo+RMdQ","running":1},"vm:254":{"uid":"A6b8n1VUgeTwS5+ZOLVPzA","running":1,"node":"Saturne","state":"started"},"vm:202":{"state":"started","node":"Uranus","running":1,"uid":"koOwsrN0yP28WYBDZlwwOg"},"ct:150":{"node":"Uranus","state":"started","uid":"4t1SEm8s25yzcIKezOX0jA","running":1},"vm:203":{"node":"Jupiter","state":"started","uid":"oLeR1b3fy4HurEaHZWQQ2w","running":1},"ct:101":{"state":"started","node":"Jupiter","running":1,"uid":"gm9M7NQqB/rSG5snwru+2g"},"ct:100":{"uid":"0jAi02yMmQo+CUWYP96buA","running":1,"node":"Jupiter","state":"started"}}}

Bash:

ha-manager status
quorum OK
master Uranus (active, Mon Sep 30 11:47:43 2019)
lrm Jupiter (active, Mon Sep 30 11:47:38 2019)
lrm Saturne (active, Mon Sep 30 11:47:39 2019)
lrm Uranus (active, Mon Sep 30 11:47:38 2019)
service ct:100 (Jupiter, started)
service ct:101 (Jupiter, started)
service ct:102 (Saturne, started)
service ct:103 (Saturne, started)
service ct:150 (Uranus, started)
service vm:200 (Uranus, started)
service vm:201 (Saturne, started)
service vm:202 (Uranus, started)
service vm:203 (Jupiter, started)
service vm:204 (Jupiter, started)
service vm:254 (Saturne, started)

It seems good
Thank you for your help !

t.lamprecht · Sep 30, 2019

ApisD said:
It seems good
Thank you for your help !

Great, thanks for reporting back! Seems like we need to handle this edge case so that the manager_state is cleared, or at least log/show clearer what the cause of this could be.

t.lamprecht · Sep 30, 2019

FYI: opened https://bugzilla.proxmox.com/show_bug.cgi?id=2393 to track this.

ApisD · Oct 15, 2019

Hello,
I come back to you because we got a maybe related bug with our cluster. Yesterday, about 20:05 (local time), all nodes suddenly restarted simultaneously.

You can find attached the syslog of each nodes.

The first error seems to occur on third node (Uranus) at 20h05:29 with the lost of the corosync link. Then, several warnings concerning HA services comes up in logs of the three nodes. They crash nearly at the same time around 20:06:32-37. Shortly before, some errors concern the watchdog service.

Of course, if this is not related with our prior bug, I can make a new dedicated thread.

Thank you in advance for your help.
Regards.

t.lamprecht · Oct 16, 2019

Hi,

ApisD said:
Of course, if this is not related with our prior bug, I can make a new dedicated thread.

No it's not related, I think, the next time just open a new thread - for now let's keep it here...

from the logs:

Oct 14 20:05:29 Uranus corosync[1723]: [KNET ] link: host: 2 link: 0 is down
Oct 14 20:05:29 Uranus corosync[1723]: [KNET ] link: host: 1 link: 0 is down
Oct 14 20:05:29 Uranus corosync[1723]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 14 20:05:29 Uranus corosync[1723]: [KNET ] host: host: 1 has no active links
Oct 14 20:05:30 Uranus corosync[1723]: [TOTEM ] Token has not been received in 61 ms

So knet/corosync (cluster communication) seems to get broken, both links are down so nothing could be sent anymore.
Then all nodes loose quorum, and the HA watchdog self-fences, that is OK on itself.
The interesting questions is: Why did corosync had issues communicating and thus lost quorum.

It could maybe be a result of a known bug ( https://bugzilla.proxmox.com/show_bug.cgi?id=2326#c78 ) where we currently are deploying a fix, it showed especially if the network links where saturated, e.g., if backup or other high traffic job ran over this network during that time. You could participate in testing this, see the bugzilla link for details how to install the fix.
Or you wait until we move it up the repository chain, once it has proven to be stable with no regression caused.

Search

Search

Cannot delete HA resources since PVE 5 to 6 update

ApisD

New Member

t.lamprecht

Proxmox Staff Member

ApisD

New Member

t.lamprecht

Proxmox Staff Member

ApisD

New Member

t.lamprecht

Proxmox Staff Member

t.lamprecht

Proxmox Staff Member

ApisD

New Member

Attachments

t.lamprecht

Proxmox Staff Member