new node pve-ha-lrm stuck on wait_for_agent_lock

alexskysilk

Distinguished Member
Oct 16, 2015
1,531
265
153
Chatsworth, CA
www.skysilk.com
I added a new node to a cluster as I would normally. make sure that all IPs are pingable, that the cluster hosts file is present, etc. the node added normally and shows up normally in pvecm status.

HOWEVER, it refuses to join the HA crm/lrm, and I cant figure out why.

Code:
# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled)
   Active: active (running) since Mon 2017-09-25 17:10:08 PDT; 5min ago
  Process: 21362 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
  Process: 21366 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 21369 (pve-ha-lrm)
   CGroup: /system.slice/pve-ha-lrm.service
           └─21369 pve-ha-lrm

Sep 25 17:10:08 sky25 pve-ha-lrm[21369]: starting server
Sep 25 17:10:08 sky25 pve-ha-lrm[21369]: status change startup => wait_for_agent_lock
Sep 25 17:10:08 sky25 systemd[1]: Started PVE Local HA Ressource Manager Daemon.

Code:
# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled)
   Active: active (running) since Mon 2017-09-25 17:10:15 PDT; 5min ago
  Process: 21378 ExecStop=/usr/sbin/pve-ha-crm stop (code=exited, status=0/SUCCESS)
  Process: 21439 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
 Main PID: 21441 (pve-ha-crm)
   CGroup: /system.slice/pve-ha-crm.service
           └─21441 pve-ha-crm

Sep 25 17:10:15 sky25 pve-ha-crm[21441]: starting server
Sep 25 17:10:15 sky25 pve-ha-crm[21441]: status change startup => wait_for_quorum
Sep 25 17:10:15 sky25 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Sep 25 17:10:20 sky25 pve-ha-crm[21441]: status change wait_for_quorum => slave

Code:
# pvecm status
Quorum information
------------------
Date:             Mon Sep 25 17:16:23 2017
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          1/12912
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.18.20.20
0x00000002          1 10.18.20.21
0x00000003          1 10.18.20.22
0x00000004          1 10.18.20.23
0x00000005          1 10.18.20.25 (local)

Code:
# ha-manager status
quorum OK
master sky21 (active, Mon Sep 25 17:16:57 2017)
lrm sky20 (active, Mon Sep 25 17:17:04 2017)
lrm sky21 (active, Mon Sep 25 17:16:55 2017)
lrm sky22 (active, Mon Sep 25 17:16:56 2017)
lrm sky23 (active, Mon Sep 25 17:16:56 2017)

What is going on?
 
After deciding this dead horse didnt need any further beating, I evicted the node- except now the node refuses to leave the GUI; its gone from the cluster (verified via pvecm status), its no longer in /etc/pve/nodes or /etc/pve/.members, but its still stubbornly showing as a red node in the GUI.

its mocking me.

how do I make it go away?!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!