[SOLVED] ha-crm and ha-lrm fail to load after a reboot

Oliver1

New Member
Apr 7, 2024
2
0
1
Hi there,

I've extensively looked through the posts on this forum and tried several ideas, but I failed to solve my issue. I'm hoping someone might have the golden idea where to look, or how to debug.

Setup: I'm running a 2 node Cluster (Nuc7 and Nuc8) with a qdevice (Pi) for a home setup with HA.

Issue: Whenever I reboot node#2 (Nuc7), it fails to load ha-crm and ha-lrm. In the dashboard the node is reported as 'old timestamp - dead?'.
I'm wondering if it's watchdog related or something else. Whenever I manually start the services, everything is working correctly again, and remains to do so.
The weird thing is that a reboot on node#1 works as expected and the issue does not occur.

Running the following command on node#2:
Code:
systemctl status -n 50 pve-ha-lrm.service pve-ha-crm.service pve-cluster.service
Results in:
Code:
○ pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; disabled; preset: enabled)
     Active: inactive (dead)

○ pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
     Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; disabled; preset: enabled)
     Active: inactive (dead)

● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Sun 2024-04-07 12:56:06 CEST; 21min ago
    Process: 925 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 929 (pmxcfs)
      Tasks: 7 (limit: 38290)
     Memory: 44.9M
        CPU: 1.531s
     CGroup: /system.slice/pve-cluster.service
             └─929 /usr/bin/pmxcfs

Apr 07 12:56:05 pve2 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: quorum_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: cmap_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [dcdb] crit: cpg_initialize failed: 2

And using
Code:
journalctl -u corosync.service
Shows the following log:
Code:
Mar 08 10:13:59 pve2 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Mar 08 10:13:59 pve2 corosync[2829]:   [MAIN  ] Corosync Cluster Engine  starting up
Mar 08 10:13:59 pve2 corosync[2829]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim >
Mar 08 10:13:59 pve2 corosync[2829]:   [TOTEM ] Initializing transport (Kronosnet).
Mar 08 10:13:59 pve2 corosync[2829]:   [TOTEM ] totemknet initialized
Mar 08 10:13:59 pve2 corosync[2829]:   [KNET  ] pmtud: MTU manually set to: 0
Mar 08 10:13:59 pve2 corosync[2829]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronos>
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cmap
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cfg
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 >
Mar 08 10:13:59 pve2 corosync[2829]:   [QB    ] server name: cpg
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Mar 08 10:13:59 pve2 corosync[2829]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] Watchdog not enabled by configuration
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] resource load_15min missing a recovery key.
Mar 08 10:13:59 pve2 corosync[2829]:   [WD    ] resource memory_used missing a recovery key.

Does anyone have an idea where I would be able to find the rootcause to my problem? Or where I can look further to debug?
 
It might have been as easy as re-enabling the services with:
Code:
systemctl enable pve-ha-lrm
systemctl enable pve-ha-crm

Rebooting node#2 worked correctly this time. I will continue to monitor if that did the trick.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!