Node stuck in "idle" HA-state - lost lock ha_agent_proxmox-01_lock

omgzwhitepeople · Thursday at 23:01

I have a 4 host cluster running PVE 9.1.1.
I wanted to test HA. So I disconnected the network port of one of the nodes with VMs on it. They all were restarted on other hosts. I waited 10 min, then plugged the host back in. But the VMs never fell back to the node. I found that in the datacenter HA status, the host showed "idle"
Looking at the following logs I see this

Code:

# sudo journalctl -u pve-cluster -n 30 --no-pager
May 07 16:31:22 proxmox-01.home.internal pmxcfs[1037]: [status] notice: members: 1/1037
May 07 16:31:22 proxmox-01.home.internal pmxcfs[1037]: [status] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: members: 1/1037, 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: starting data syncronisation
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: cpg_send_message retried 1 times
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: node has quorum
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: members: 1/1037, 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: starting data syncronisation
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: received sync request (epoch 1/1037/00000002)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received sync request (epoch 1/1037/00000002)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: received all states
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: leader is 2/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: synced members: 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: waiting for updates from leader
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: dfsm_deliver_queue: queue length 4
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received all states
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: dfsm_deliver_queue: queue length 16
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: update complete - trying to commit (got 14 inode updates)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 4
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-qad5On/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-6oCuYD/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-4brPiC/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-DcC7Fn/qb): Unknown error -1 (-1)
May 07 16:47:43 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received log

Code:

root@proxmox-01:~# sudo journalctl -u pve-cluster -n 30 --no-pager
May 07 16:31:22 proxmox-01.home.internal pmxcfs[1037]: [status] notice: members: 1/1037
May 07 16:31:22 proxmox-01.home.internal pmxcfs[1037]: [status] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: members: 1/1037, 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: starting data syncronisation
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: cpg_send_message retried 1 times
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: node has quorum
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: members: 1/1037, 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: starting data syncronisation
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: received sync request (epoch 1/1037/00000002)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received sync request (epoch 1/1037/00000002)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: received all states
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: leader is 2/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: synced members: 2/1015, 3/1015, 4/1015
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: waiting for updates from leader
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: dfsm_deliver_queue: queue length 4
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received all states
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [status] notice: dfsm_deliver_queue: queue length 16
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: update complete - trying to commit (got 14 inode updates)
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: all data is up to date
May 07 16:35:39 proxmox-01.home.internal pmxcfs[1037]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 4
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-qad5On/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-6oCuYD/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-4brPiC/qb): Unknown error -1 (-1)
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [ipcs] crit: connection from bad user 1000! - rejected
May 07 16:45:51 proxmox-01.home.internal pmxcfs[1037]: [libqb] error: Error in connection setup (/dev/shm/qb-1037-4149-33-DcC7Fn/qb): Unknown error -1 (-1)
May 07 16:47:43 proxmox-01.home.internal pmxcfs[1037]: [status] notice: received log

Code:

# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/usr/lib/systemd/system/pve-ha-lrm.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-07 16:32:41 EDT; 21min ago
 Invocation: dc0b44bfaf094f719e6707a66cbd8de5
   Main PID: 1509 (pve-ha-lrm)
      Tasks: 1 (limit: 17735)
     Memory: 114.6M (peak: 133.5M)
        CPU: 1.059s
     CGroup: /system.slice/pve-ha-lrm.service
             └─1509 pve-ha-lrm

May 07 16:32:40 proxmox-01.home.internal systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
May 07 16:32:41 proxmox-01.home.internal pve-ha-lrm[1509]: starting server
May 07 16:32:41 proxmox-01.home.internal pve-ha-lrm[1509]: status change startup => wait_for_agent_lock
May 07 16:32:41 proxmox-01.home.internal systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.

ha-status looks fine

Code:

root@proxmox-01:~# pvecm status
ha-manager status
Cluster information
-------------------
Name:             home-01
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed May  6 19:47:54 2026
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.15f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.50.51 (local)
0x00000002          1 192.168.50.52
0x00000003          1 192.168.50.53
0x00000004          1 192.168.50.54
quorum OK
master proxmox-02 (active, Wed May  6 19:47:47 2026)
lrm proxmox-01 (active, Wed May  6 19:47:50 2026)
lrm proxmox-02 (idle, Wed May  6 19:47:53 2026)
lrm proxmox-03 (idle, Wed May  6 19:47:53 2026)
lrm proxmox-04 (idle, Wed May  6 19:47:51 2026)
service vm:100 (proxmox-01, stopped)
root@proxmox-01:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       100 template-ubuntu-24-04 stopped    8192              50.00 0
       101 kube-01              running    8192              50.00 1678
       200 ucs-01               running    4096               0.00 1903
root@proxmox-01:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       100 template-ubuntu-24-04 stopped    8192              50.00 0
       101 kube-01              running    8192              50.00 1678
       200 ucs-01               running    4096               0.00 1903
root@proxmox-01:~# ha-manager status
quorum OK
master proxmox-02 (active, Wed May  6 19:50:57 2026)
lrm proxmox-01 (active, Wed May  6 19:51:00 2026)
lrm proxmox-02 (active, Wed May  6 19:50:58 2026)
lrm proxmox-03 (active, Wed May  6 19:50:58 2026)
lrm proxmox-04 (active, Wed May  6 19:51:02 2026)
service vm:100 (proxmox-01, stopped)
service vm:101 (proxmox-01, started)
service vm:102 (proxmox-02, started)
service vm:103 (proxmox-03, started)
service vm:104 (proxmox-04, started)
service vm:200 (proxmox-01, started)

1. Why is the host stuck in idle?
2. How can I get the host back to active?

UPDATE: So this is by design. I thought that VMs with fallback turned on would remigrate to the stale host when it came back online. Apprently this is not the case. Have to manually migrate hosts back. I think I can create HA groups and set priority, thats where fallback is used, but without ha groups, its never used. So once at least one vm is migrated back, the host became active again.

dakralex · Friday at 11:38

omgzwhitepeople said:
So this is by design. I thought that VMs with fallback turned on would remigrate to the stale host when it came back online. Apprently this is not the case. Have to manually migrate hosts back. I think I can create HA groups and set priority, thats where fallback is used, but without ha groups, its never used. So once at least one vm is migrated back, the host became active again.

Right, the failback flag does only control if an HA resource is in a node affinity rule (formerly, HA groups) and is in a lower-priority node class. As soon as the higher-priority node class gets back online again, the HA resource will automatically move back to that node. Otherwise, it will stay on the recovery node as there aren't any preferences set by the affinity rules.

Search

Search

Node stuck in "idle" HA-state - lost lock ha_agent_proxmox-01_lock

omgzwhitepeople

New Member

dakralex

Proxmox Staff Member

We value your privacy