[SOLVED] ha-crm and ha-lrm fail to load after a reboot

Oliver1 · Apr 7, 2024

Hi there,

I've extensively looked through the posts on this forum and tried several ideas, but I failed to solve my issue. I'm hoping someone might have the golden idea where to look, or how to debug.

Setup: I'm running a 2 node Cluster (Nuc7 and Nuc8) with a qdevice (Pi) for a home setup with HA.

Issue: Whenever I reboot node#2 (Nuc7), it fails to load ha-crm and ha-lrm. In the dashboard the node is reported as 'old timestamp - dead?'.
I'm wondering if it's watchdog related or something else. Whenever I manually start the services, everything is working correctly again, and remains to do so.
The weird thing is that a reboot on node#1 works as expected and the issue does not occur.

Running the following command on node#2:

Code:

systemctl status -n 50 pve-ha-lrm.service pve-ha-crm.service pve-cluster.service

Results in:

Code:

○ pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; disabled; preset: enabled)
     Active: inactive (dead)

○ pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; disabled; preset: enabled)
     Active: inactive (dead)

● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Sun 2024-04-07 12:56:06 CEST; 21min ago
    Process: 925 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 929 (pmxcfs)
      Tasks: 7 (limit: 38290)
     Memory: 44.9M
        CPU: 1.531s
     CGroup: /system.slice/pve-cluster.service
             └─929 /usr/bin/pmxcfs

Apr 07 12:56:05 pve2 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: quorum_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: cmap_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [dcdb] crit: cpg_initialize failed: 2

And using

Code:

journalctl -u corosync.service

Shows the following log:

Code:

Mar 08 10:13:59 pve2 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Mar 08 10:13:59 pve2 corosync[2829]:   [MAIN  ] Corosync Cluster Engine  starting up
Mar 08 10:13:59 pve2 corosync[2829]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim >
Mar 08 10:13:59 pve2 corosync[2829]:   [TOTEM ] Initializing transport (Kronosnet).
Mar 08 10:13:59 pve2 corosync[2829]:   [TOTEM ] totemknet initialized
Mar 08 10:13:59 pve2 corosync[2829]:   [KNET  ] pmtud: MTU manually set to: 0
Mar 08 10:13:59 pve2 corosync[2829]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronos>
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cmap
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cfg
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 >
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cpg
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] Watchdog not enabled by configuration
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] resource load_15min missing a recovery key.
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] resource memory_used missing a recovery key.

Does anyone have an idea where I would be able to find the rootcause to my problem? Or where I can look further to debug?

Oliver1 · Apr 8, 2024

It might have been as easy as re-enabling the services with:

Code:

systemctl enable pve-ha-lrm
systemctl enable pve-ha-crm

Rebooting node#2 worked correctly this time. I will continue to monitor if that did the trick.

Search

Search

[SOLVED] ha-crm and ha-lrm fail to load after a reboot

Oliver1

New Member

Oliver1

New Member