I'm having trouble with HA and I noticed:
This is on HP DL360p Gen8's, and I am using the hpwdt watchdog.
What can I do to further troubleshoot this?
Code:
root@pve01:/etc# systemctl list-units --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● nut-monitor.service loaded failed failed Network UPS Tools - power device monitor and shutdown controller
● pve-ha-crm.service loaded failed failed PVE Cluster HA Resource Manager Daemon
● watchdog-mux.service loaded failed failed Proxmox VE watchdog multiplexer
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
3 loaded units listed.
Code:
root@pve01:/etc# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon
Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2022-06-10 02:59:46 EDT; 11h ago
Process: 732969 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
Main PID: 732972 (code=exited, status=255/EXCEPTION)
CPU: 2.979s
Jun 10 02:56:36 pve01 pve-ha-crm[732972]: status change startup => wait_for_quorum
Jun 10 02:56:36 pve01 systemd[1]: Started PVE Cluster HA Resource Manager Daemon.
Jun 10 02:56:41 pve01 pve-ha-crm[732972]: status change wait_for_quorum => slave
Jun 10 02:59:46 pve01 pve-ha-crm[732972]: successfully acquired lock 'ha_manager_lock'
Jun 10 02:59:46 pve01 pve-ha-crm[732972]: ERROR: unable to open watchdog socket - No such file or directory
Jun 10 02:59:46 pve01 pve-ha-crm[732972]: server received shutdown request
Jun 10 02:59:46 pve01 pve-ha-crm[732972]: server stopped
Jun 10 02:59:46 pve01 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/EXCEPTION
Jun 10 02:59:46 pve01 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'.
Jun 10 02:59:46 pve01 systemd[1]: pve-ha-crm.service: Consumed 2.979s CPU time.
Code:
root@pve01:/etc# systemctl status watchdog-mux
● watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: failed (Result: exit-code) since Fri 2022-06-10 13:59:04 EDT; 10min ago
Process: 2360328 ExecStart=/usr/sbin/watchdog-mux (code=exited, status=1/FAILURE)
Main PID: 2360328 (code=exited, status=1/FAILURE)
CPU: 4ms
Jun 10 13:59:04 pve01 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jun 10 13:59:04 pve01 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 13:59:04 pve01 watchdog-mux[2360328]: watchdog active - unable to restart watchdog-mux
Jun 10 13:59:04 pve01 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
This is on HP DL360p Gen8's, and I am using the hpwdt watchdog.
Code:
root@pve01:/etc# cat /etc/default/pve-ha-manager
# select watchdog module (default is softdog)
WATCHDOG_MODULE=hpwdt
Code:
-- Boot 609e561812e6490885c7a95d92b1e6c5 --
Jun 04 21:04:34 pve01 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jun 04 21:04:34 pve01 watchdog-mux[1819]: Loading watchdog module 'hpwdt'
Jun 04 21:04:34 pve01 watchdog-mux[1819]: Watchdog driver 'HPE iLO2+ HW Watchdog Timer', version 0
Jun 09 22:31:29 pve01 watchdog-mux[1819]: client watchdog expired - disable watchdog updates
Jun 09 22:32:08 pve01 watchdog-mux[1819]: exit watchdog-mux with active connections
Jun 09 22:32:08 pve01 systemd[1]: watchdog-mux.service: Succeeded.
Jun 09 22:32:08 pve01 systemd[1]: watchdog-mux.service: Consumed 25.671s CPU time.
Jun 10 02:43:15 pve01 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jun 10 02:43:15 pve01 watchdog-mux[699575]: watchdog active - unable to restart watchdog-mux
Jun 10 02:43:15 pve01 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 02:43:15 pve01 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Jun 10 02:56:33 pve01 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jun 10 02:56:33 pve01 watchdog-mux[732968]: watchdog active - unable to restart watchdog-mux
Jun 10 02:56:33 pve01 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 02:56:33 pve01 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
Jun 10 13:59:04 pve01 systemd[1]: Started Proxmox VE watchdog multiplexer.
Jun 10 13:59:04 pve01 systemd[1]: watchdog-mux.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 13:59:04 pve01 watchdog-mux[2360328]: watchdog active - unable to restart watchdog-mux
Jun 10 13:59:04 pve01 systemd[1]: watchdog-mux.service: Failed with result 'exit-code'.
What can I do to further troubleshoot this?
Last edited: