Proxmox cluster member server rebooted itself

oyusupov · Sep 23, 2024

Hi Proxmox community,

I encountered a critical issue today with one of my Proxmox nodes, where the host rebooted by itself without any clear reason. I’m reaching out to seek your advice on identifying the root cause and possible solutions to prevent this from happening again.

Incident Details:

• Host: [Insert Host Details, like model or configuration]
• Proxmox Version: 7.4-16
• Kernel Version: 6.2.16-11-bpo11-pve
• Symptoms: No prior warning or errors in the Proxmox GUI, but increased I/O delay and server load in the hours leading up to the incident. The system became unresponsive, it rebooted on its own. We think cause of this "https://forum.proxmox.com/threads/node-instability.61157/". But not sure.

Logs & Observations:
From the logs, I noticed repeated timeout messages from various services, including pve-ha-lrm and pve-ha-crm:
Jun 05 07:14:35 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Aug 01 14:26:00 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Sep 13 11:08:14 uz5-srv03 pve-ha-crm[3541]: status change wait_for_quorum => slave
Sep 23 09:15:18 uz5-srv03 pve-ha-crm[3541]: loop take too long (62 seconds)

Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: watchdog active
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: status change wait_for_agent_lock => active
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: starting service vm:641
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686380]: start VM 641: UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: <root@pam> starting task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: <root@pam> end task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam: OK
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: service status vm:641 started
Sep 23 09:15:18 uz5-srv03 pve-ha-lrm[15193]: loop take too long (70 seconds)
Steps I’ve Taken:

• Checked /var/log/syslog and journalctl for any kernel panic or hardware-related errors—nothing conclusive.
• Verified that no scheduled tasks or updates were configured during the incident window.

Questions:

1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?
2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?
3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?

esi_y · Sep 23, 2024

oyusupov said:
Questions:

1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?

Check proper full log, e.g. journalctl --list-boots, then journalctl -b <bootid> -n 200.

oyusupov said:
2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?

They can, it is so common I put it into a "tutorial" (I may need to update it on lost quorum information):
https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

oyusupov said:
3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?

You never mention if you actually use HA (but I suppose so). If it is not watchdog related, you are back to square 1, but full boot logs need to be examined first for "mysterious" reboots to at least make an attempt.

Search

Search

Proxmox cluster member server rebooted itself

oyusupov

New Member

esi_y

Renowned Member