new node pve-ha-lrm stuck on wait_for_agent_lock

alexskysilk · Sep 26, 2017

I added a new node to a cluster as I would normally. make sure that all IPs are pingable, that the cluster hosts file is present, etc. the node added normally and shows up normally in pvecm status.

HOWEVER, it refuses to join the HA crm/lrm, and I cant figure out why.

Code:

# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
   Active: active (running) since Mon 2017-09-25 17:10:08 PDT; 5min ago
  Process: 21362 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
  Process: 21366 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 21369 (pve-ha-lrm)
   CGroup: /system.slice/pve-ha-lrm.service
           └─21369 pve-ha-lrm

Sep 25 17:10:08 sky25 pve-ha-lrm[21369]: starting server
Sep 25 17:10:08 sky25 pve-ha-lrm[21369]: status change startup => wait_for_agent_lock
Sep 25 17:10:08 sky25 systemd[1]: Started PVE Local HA Ressource Manager Daemon.

Code:

# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)
   Active: active (running) since Mon 2017-09-25 17:10:15 PDT; 5min ago
  Process: 21378 ExecStop=/usr/sbin/pve-ha-crm stop (code=exited, status=0/SUCCESS)
  Process: 21439 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
 Main PID: 21441 (pve-ha-crm)
   CGroup: /system.slice/pve-ha-crm.service
           └─21441 pve-ha-crm

Sep 25 17:10:15 sky25 pve-ha-crm[21441]: starting server
Sep 25 17:10:15 sky25 pve-ha-crm[21441]: status change startup => wait_for_quorum
Sep 25 17:10:15 sky25 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Sep 25 17:10:20 sky25 pve-ha-crm[21441]: status change wait_for_quorum => slave

Code:

# pvecm status
Quorum information
------------------
Date:             Mon Sep 25 17:16:23 2017
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          1/12912
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.18.20.20
0x00000002          1 10.18.20.21
0x00000003          1 10.18.20.22
0x00000004          1 10.18.20.23
0x00000005          1 10.18.20.25 (local)

Code:

# ha-manager status
quorum OK
master sky21 (active, Mon Sep 25 17:16:57 2017)
lrm sky20 (active, Mon Sep 25 17:17:04 2017)
lrm sky21 (active, Mon Sep 25 17:16:55 2017)
lrm sky22 (active, Mon Sep 25 17:16:56 2017)
lrm sky23 (active, Mon Sep 25 17:16:56 2017)

What is going on?

dietmar · Sep 26, 2017

I guess there is simply no resources scheduled to run on sky25?

alexskysilk · Sep 26, 2017

dietmar said:
I guess there is simply no resources scheduled to run on sky25?

Correct. When I try to add resources to the node it throws an error 500 resource does not exist.

alexskysilk · Sep 26, 2017

After deciding this dead horse didnt need any further beating, I evicted the node- except now the node refuses to leave the GUI; its gone from the cluster (verified via pvecm status), its no longer in /etc/pve/nodes or /etc/pve/.members, but its still stubbornly showing as a red node in the GUI.

its mocking me.

how do I make it go away?!

Search

Search

new node pve-ha-lrm stuck on wait_for_agent_lock

alexskysilk

Distinguished Member

dietmar

Proxmox Staff Member

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member