Hi,
I try new proxmox 41 in test environment without hardware watchdogs.
Unforunately, I can not make the failed node restart.
I tried few commands I have found on this forum like
ifconfig vmbr1 down
kill -9 corosync
Any node got disconnected, but it stay turned on. Sometimes I got these errors:
watchdog update failed - Broken pipe
pve-ha-lrm lost lock 'ha_agent_sun_lock - can't get cfs lock
unable to write lrm status file - unable to open file '/etc/pve/nodes/sun/lrm_status.tmp.2610' - Device or resource busy
or these
<code>
Jan 5 12:35:04 sun pve-ha-lrm[3028]: successfully acquired lock 'ha_agent_sun_lock'
Jan 5 12:35:04 sun pve-ha-lrm[3028]: watchdog active
Jan 5 12:35:04 sun pve-ha-lrm[3028]: status change wait_for_agent_lock => active
Jan 5 12:35:04 sun watchdog-mux[5077]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5080]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun pve-ha-lrm[5078]: starting service ct:161
Jan 5 12:35:04 sun watchdog-mux[5082]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5085]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>
When I tried
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
I got error: socat[9367] E connect(5, AF=1 "/var/run/watchdog-mux.sock", 28): Connection refused
Where can I found more information why is a node not restarting?
Thanks a lot for your help
I try new proxmox 41 in test environment without hardware watchdogs.
Unforunately, I can not make the failed node restart.
I tried few commands I have found on this forum like
ifconfig vmbr1 down
kill -9 corosync
Any node got disconnected, but it stay turned on. Sometimes I got these errors:
watchdog update failed - Broken pipe
pve-ha-lrm lost lock 'ha_agent_sun_lock - can't get cfs lock
unable to write lrm status file - unable to open file '/etc/pve/nodes/sun/lrm_status.tmp.2610' - Device or resource busy
or these
<code>
Jan 5 12:35:04 sun pve-ha-lrm[3028]: successfully acquired lock 'ha_agent_sun_lock'
Jan 5 12:35:04 sun pve-ha-lrm[3028]: watchdog active
Jan 5 12:35:04 sun pve-ha-lrm[3028]: status change wait_for_agent_lock => active
Jan 5 12:35:04 sun watchdog-mux[5077]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5080]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun pve-ha-lrm[5078]: starting service ct:161
Jan 5 12:35:04 sun watchdog-mux[5082]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
Jan 5 12:35:04 sun watchdog-mux[5085]: watchdog set timeout: Invalid argument
Jan 5 12:35:04 sun systemd[1]: watchdog-mux.service: main process exited, code=exited, status=1/FAILURE
Jan 5 12:35:04 sun systemd[1]: Unit watchdog-mux.service entered failed state.
</code>
When I tried
echo "A" | socat - UNIX-CONNECT:/var/run/watchdog-mux.sock
I got error: socat[9367] E connect(5, AF=1 "/var/run/watchdog-mux.sock", 28): Connection refused
Where can I found more information why is a node not restarting?
Thanks a lot for your help