Node stuck in wait_for_agent_lock

GandR · Feb 23, 2021

I had an update on node2 fail and now HA is pretty much unusable. I have read as much in the forums as I can find but nothing I have tried seems to work. In my attempts to fix things, all of the nodes have been rebooted, which is very disruptive with no HA. Currently I have

ha-manager status
quorum OK
master agree-90 (idle, Tue Feb 23 14:17:38 2021)
lrm agree-90 (active, Tue Feb 23 16:15:35 2021)
lrm agree-91 (active, Tue Feb 23 16:15:35 2021)
lrm agree-92 (wait_for_agent_lock, Tue Feb 23 16:15:39 2021)
service vm:100 (agree-90, deleting)
service vm:101 (agree-90, deleting)
service vm:102 (agree-91, deleting)
...
...

I have attempted to remove all of the VMs from HA and they all say 'deleting' but remain on the screen.
node2 currently has no running VMs but I need to get HA back up again.

Cluster information
-------------------
Name: godzone
Config Version: 5
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Feb 23 16:14:14 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000003
Ring ID: 1.100
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Assistance would be very gratefully received.

t.lamprecht · Feb 23, 2021

Hi,

GodZone said:
I had an update on node2 fai

What failed exactly? Do you have any task log (or else log from /var/log/apt/history.log and/or /var/log/apt/term.log)

GodZone said:
I have attempted to remove all of the VMs from HA and they all say 'deleting' but remain on the screen.

What was the status of the LRM and the services before that?

Can you please post the output of the following commands in [CODE]<output here>[/CODE] tags:

Bash:

systemctl status pve-ha-lrm
cat /etc/pve/ha/manager_status
cat /etc/pve/nodes/agree-92/lrm_status

GandR · Feb 23, 2021

apt dist-upgrade got stuck at 97% on pve-ha-manager. I killed the upgrade, I have subsequently tried to update one of the other nodes and it did exactly the same thing, I managed to kill the upgrade by killing the dpkg processes so it did carry on and update the remainder of the packages that it could.

Code:

Setting up pve-firewall (4.1-3) ...
Setting up pve-container (3.3-3) ...
Setting up pve-ha-manager (3.1-1) ...
dpkg: error processing package pve-ha-manager (--configure):
installed pve-ha-manager package post-installation script subprocess was killed by signal (Terminated)
dpkg: dependency problems prevent configuration of pve-manager:
pve-manager depends on pve-ha-manager; however:
  Package pve-ha-manager is not configured yet.
dpkg: error processing package pve-manager (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of qemu-server:
qemu-server depends on pve-ha-manager (>= 3.0-9); however:
  Package pve-ha-manager is not configured yet.
dpkg: error processing package qemu-server (--configure):
dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of proxmox-ve:
proxmox-ve depends on pve-manager; however:
  Package pve-manager is not configured yet.
proxmox-ve depends on qemu-server; however:
  Package qemu-server is not configured yet.
dpkg: error processing package proxmox-ve (--configure):
dependency problems - leaving unconfigured
Processing triggers for dbus (1.12.20-0+deb10u1) ...
Processing triggers for mime-support (3.62) ...
Processing triggers for initramfs-tools (0.133+deb10u1) ...

Code:

root@agree-92:/var/log/apt# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-02-23 15:30:32 NZDT; 4h 9min ago
  Process: 2503 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
Main PID: 2507 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   Memory: 92.6M
   CGroup: /system.slice/pve-ha-lrm.service
           └─2507 pve-ha-lrm
Feb 23 15:30:31 agree-92 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Feb 23 15:30:32 agree-92 pve-ha-lrm[2507]: starting server
Feb 23 15:30:32 agree-92 pve-ha-lrm[2507]: status change startup => wait_for_agent_lock
Feb 23 15:30:32 agree-92 systemd[1]: Started PVE Local HA Resource Manager Daemon.

JSON:

{
  "node_status": {
    "agree-91": "online",
    "agree-92": "fence",
    "agree-90": "online"
  },
  "master_node": "agree-90",
  "timestamp": 1614043058,
  "service_status": {
    "vm:130": {
      "state": "started",
      "running": 1,
      "uid": "aKGfAAviaxC7Aqgpe4Tucw",
      "node": "agree-91"
    },
    "vm:102": {
      "node": "agree-91",
      "uid": "0e6klasbh4jDK5p6TiVP9g",
      "running": 1,
      "state": "started"
    },
    "vm:118": {
      "uid": "mfqwIjX4LQBeQhk0oDMFRw",
      "node": "agree-92",
      "state": "fence"
    },
    "vm:129": {
      "node": "agree-90",
      "uid": "e4ptDYI5mUE7lSgZMw1/bw",
      "state": "started",
      "running": 1
    },
    "vm:109": {
      "node": "agree-92",
      "uid": "SimuKxxu6CYnq4iWrwCWGA",
      "state": "request_stop"
    },
    "vm:116": {
      "node": "agree-91",
      "uid": "g9LK6Wv7y8B4tDSLSee3SQ",
      "running": 1,
      "state": "started"
    },
    "vm:113": {
      "uid": "2qMr4UGkZYlOuF65K8af+A",
      "node": "agree-91",
      "state": "started",
      "running": 1
    },
    "vm:117": {
      "running": 1,
      "state": "started",
      "node": "agree-91",
      "uid": "iu6/2M8mVqW+VPxOdPQOJQ"
    },
    "vm:145": {
      "state": "fence",
      "uid": "oSHeFqgXPc1PAkzLnZTrnQ",
      "node": "agree-92"
    },
    "vm:115": {
      "uid": "USf/0gxfpOmk1S43GuxqrQ",
      "node": "agree-91",
      "running": 1,
      "state": "started"
    },
    "vm:139": {
      "state": "started",
      "running": 1,
      "uid": "UWR3OQvCoNcneSe0Z3uDHA",
      "node": "agree-90"
    },
    "vm:100": {
      "uid": "V/MvQ39eyRu4p7Ta+o2v/w",
      "node": "agree-90",
      "running": 1,
      "state": "started"
    },
    "vm:120": {
      "node": "agree-90",
      "uid": "ZMlJRQ9x5StxNn0EGfwFcQ",
      "running": 1,
      "state": "started"
    },
    "vm:110": {
      "state": "started",
      "running": 1,
      "node": "agree-91",
      "uid": "R9CWhsPzmx2i/O8nJkKFoQ"
    },
    "vm:105": {
      "state": "started",
      "running": 1,
      "uid": "3pVpE5ZC7vWWzl/+JqtDBA",
      "node": "agree-91"
    },
    "vm:106": {
      "uid": "h484P4BZ+lbnJzyenzng9Q",
      "node": "agree-92",
      "state": "fence"
    },
    "vm:123": {
      "node": "agree-90",
      "uid": "TBZJDXc/pPBXmG6ZdkkVAg",
      "state": "started",
      "running": 1
    },
    "vm:103": {
      "running": 1,
      "state": "started",
      "node": "agree-91",
      "uid": "76w6334km3jxq2/QrIskRQ"
    },
    "vm:126": {
      "running": 1,
      "state": "started",
      "uid": "s6+CP05FT10bFrzF9L8G4Q",
      "node": "agree-91"
    },
    "vm:127": {
      "running": 1,
      "state": "started",
      "node": "agree-90",
      "uid": "NRvrNckUCvIYrGHgV7lgHQ"
    },
    "vm:107": {
      "state": "started",
      "running": 1,
      "node": "agree-90",
      "uid": "IjxIILxq2ZwmqgPYFbP5PA"
    },
    "vm:112": {
      "running": 1,
      "state": "started",
      "node": "agree-91",
      "uid": "LDCKEbt6V4kLYAFsBjVbeQ"
    },
    "vm:108": {
      "running": 1,
      "state": "started",
      "uid": "LeLH7vegWtLmN2XYXhaSGQ",
      "node": "agree-91"
    },
    "vm:128": {
      "running": 1,
      "state": "started",
      "uid": "40vj7PXkDxNNirOVzlOOVA",
      "node": "agree-91"
    },
    "vm:119": {
      "state": "started",
      "running": 1,
      "uid": "1+915hs7LOs+XC1NBDjxgQ",
      "node": "agree-91"
    },
    "vm:142": {
      "node": "agree-92",
      "uid": "7YjFoGDkIYCgvX9VYS6cgw",
      "state": "fence"
    },
    "vm:131": {
      "node": "agree-90",
      "uid": "+xtZMobwq3U2Qm7PZZ5Chw",
      "running": 1,
      "state": "started"
    },
    "vm:114": {
      "running": 1,
      "state": "started",
      "uid": "HU4PynaH70gZXTO0j2qgfg",
      "node": "agree-90"
    },
    "vm:101": {
      "node": "agree-90",
      "uid": "CZXFD/Qm9vC97DEvhWxpdg",
      "state": "started",
      "running": 1
    },
    "vm:121": {
      "state": "started",
      "running": 1,
      "uid": "B0Xs4LzApfQEGaEOxdWorg",
      "node": "agree-90"
    },
    "vm:111": {
      "node": "agree-90",
      "uid": "AgPqBjgipfD9CtVtyElqCA",
      "running": 1,
      "state": "started"
    },
    "vm:134": {
      "state": "started",
      "running": 1,
      "uid": "j0VaUTNn6IHu10UYt9l1rg",
      "node": "agree-90"
    },
    "vm:124": {
      "state": "started",
      "running": 1,
      "uid": "fw1MyR7Lhovuei/RQtmp7g",
      "node": "agree-90"
    }
  }
}

Code:

root@agree-92:/var/log/apt# cat /etc/pve/nodes/agree-92/lrm_status
{"results":{},"timestamp":1614062486,"mode":"active","state":"wait_for_agent_lock"}

t.lamprecht · Feb 23, 2021

GodZone said:
installed pve-ha-manager package post-installation script subprocess was killed by signal (Terminated)

This seems not an error from the update itself, but an external triggered one.
Did you hit CTRL+C or aborted the update task in any other way or was this an out of memory situation?

First try if simply finishing the configuration of the update does the trick:

Bash:

dpkg --configure -a

Due to that your node was going to be fenced, but it seems your manual intervention of resource deletion may have changed things a bit. I can give directions to reset the HA state, but I'd first like to know that the broken update gets finished.

GandR · Feb 23, 2021

I did do my best to complete the update using the command you have indicated.

Code:

root@agree-92:/var/log/apt# dpkg --configure -a


root@agree-92:/var/log/apt# apt update


Hit:1 http://security.debian.org/debian-security buster/updates InRelease       


Hit:2 http://ftp.nz.debian.org/debian buster InRelease                          


Get:3 http://ftp.nz.debian.org/debian buster-updates InRelease [51.9 kB]


Hit:4 http://download.proxmox.com/debian/pve buster InRelease            


Fetched 51.9 kB in 2s (25.8 kB/s)      


Reading package lists... Done


Building dependency tree       


Reading state information... Done


All packages are up to date.

I think all my efforts to fix things just made them worse.

GandR · Feb 23, 2021

The external termination was me killing the dpkg process that had hung. I spent quite along time trying to determine why the process was hung. It also hung in exactly the same way on agree-90 as well. I havent yet tried agree-91 as I cant easy migrate everything off.

t.lamprecht · Feb 23, 2021

GodZone said:
The external termination was me killing the dpkg process that had hung. I spent quite along time trying to determine why the process was hung.

Hmm, did you check the syslog from around that time for any error messages?

GodZone said:
I did do my best to complete the update using the command you have indicated.

Above seems quite, OK, you can do a final:

Bash:

apt install -f

For clearing that bogus HA state stop the CRM service on all nodes, either via the webinterface (Node -> System) or by executing:

Bash:

systemctl stop pve-ha-crm.service

(generally it's good to stop the current master at last)

Once it was stopped on all nodes do on a single node:

Bash:

rm -f /etc/pve/ha/manager_status

Then start all CRMs again, via webinterface or again using:

Bash:

systemctl start pve-ha-crm.service

GandR · Feb 23, 2021

Thank you for those instructions, HA is now sorted. I am still not sure whether to go ahead and try the upgrade on agree-91, definitely NOT tonight

Search

Search

Node stuck in wait_for_agent_lock

GandR

Well-Known Member

t.lamprecht

Proxmox Staff Member

GandR

Well-Known Member

t.lamprecht

Proxmox Staff Member

GandR

Well-Known Member

GandR

Well-Known Member

t.lamprecht

Proxmox Staff Member

GandR

Well-Known Member

We value your privacy