Hi there,
I've extensively looked through the posts on this forum and tried several ideas, but I failed to solve my issue. I'm hoping someone might have the golden idea where to look, or how to debug.
Setup: I'm running a 2 node Cluster (Nuc7 and Nuc8) with a qdevice (Pi) for a home setup with HA.
Issue: Whenever I reboot node#2 (Nuc7), it fails to load ha-crm and ha-lrm. In the dashboard the node is reported as 'old timestamp - dead?'.
I'm wondering if it's watchdog related or something else. Whenever I manually start the services, everything is working correctly again, and remains to do so.
The weird thing is that a reboot on node#1 works as expected and the issue does not occur.
Running the following command on node#2:
Results in:
And using
Shows the following log:
Does anyone have an idea where I would be able to find the rootcause to my problem? Or where I can look further to debug?
I've extensively looked through the posts on this forum and tried several ideas, but I failed to solve my issue. I'm hoping someone might have the golden idea where to look, or how to debug.
Setup: I'm running a 2 node Cluster (Nuc7 and Nuc8) with a qdevice (Pi) for a home setup with HA.
Issue: Whenever I reboot node#2 (Nuc7), it fails to load ha-crm and ha-lrm. In the dashboard the node is reported as 'old timestamp - dead?'.
I'm wondering if it's watchdog related or something else. Whenever I manually start the services, everything is working correctly again, and remains to do so.
The weird thing is that a reboot on node#1 works as expected and the issue does not occur.
Running the following command on node#2:
Code:
systemctl status -n 50 pve-ha-lrm.service pve-ha-crm.service pve-cluster.service
Code:
○ pve-ha-lrm.service - PVE Local HA Resource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; disabled; preset: enabled)
Active: inactive (dead)
○ pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; disabled; preset: enabled)
Active: inactive (dead)
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
Active: active (running) since Sun 2024-04-07 12:56:06 CEST; 21min ago
Process: 925 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 929 (pmxcfs)
Tasks: 7 (limit: 38290)
Memory: 44.9M
CPU: 1.531s
CGroup: /system.slice/pve-cluster.service
└─929 /usr/bin/pmxcfs
Apr 07 12:56:05 pve2 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[925]: [main] notice: resolved node name 'pve2' to '10.0.10.102' for default node IP address
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: quorum_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [quorum] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: cmap_initialize failed: 2
Apr 07 12:56:05 pve2 pmxcfs[929]: [confdb] crit: can't initialize service
Apr 07 12:56:05 pve2 pmxcfs[929]: [dcdb] crit: cpg_initialize failed: 2
And using
Code:
journalctl -u corosync.service
Code:
Mar 08 10:13:59 pve2 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Mar 08 10:13:59 pve2 corosync[2829]: [MAIN ] Corosync Cluster Engine starting up
Mar 08 10:13:59 pve2 corosync[2829]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim >
Mar 08 10:13:59 pve2 corosync[2829]: [TOTEM ] Initializing transport (Kronosnet).
Mar 08 10:13:59 pve2 corosync[2829]: [TOTEM ] totemknet initialized
Mar 08 10:13:59 pve2 corosync[2829]: [KNET ] pmtud: MTU manually set to: 0
Mar 08 10:13:59 pve2 corosync[2829]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronos>
Mar 08 10:13:59 pve2 corosync[2829]: [SERV ] Service engine loaded: corosync configuration map access [0]
Mar 08 10:13:59 pve2 corosync[2829]: [QB ] server name: cmap
Mar 08 10:13:59 pve2 corosync[2829]: [SERV ] Service engine loaded: corosync configuration service [1]
Mar 08 10:13:59 pve2 corosync[2829]: [QB ] server name: cfg
Mar 08 10:13:59 pve2 corosync[2829]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 >
Mar 08 10:13:59 pve2 corosync[2829]: [QB ] server name: cpg
Mar 08 10:13:59 pve2 corosync[2829]: [SERV ] Service engine loaded: corosync profile loading service [4]
Mar 08 10:13:59 pve2 corosync[2829]: [SERV ] Service engine loaded: corosync resource monitoring service [6]
Mar 08 10:13:59 pve2 corosync[2829]: [WD ] Watchdog not enabled by configuration
Mar 08 10:13:59 pve2 corosync[2829]: [WD ] resource load_15min missing a recovery key.
Mar 08 10:13:59 pve2 corosync[2829]: [WD ] resource memory_used missing a recovery key.
Does anyone have an idea where I would be able to find the rootcause to my problem? Or where I can look further to debug?