Hi Proxmox community,
I encountered a critical issue today with one of my Proxmox nodes, where the host rebooted by itself without any clear reason. I’m reaching out to seek your advice on identifying the root cause and possible solutions to prevent this from happening again.
Incident Details:
• Host: [Insert Host Details, like model or configuration]
• Proxmox Version: 7.4-16
• Kernel Version: 6.2.16-11-bpo11-pve
• Symptoms: No prior warning or errors in the Proxmox GUI, but increased I/O delay and server load in the hours leading up to the incident. The system became unresponsive, it rebooted on its own. We think cause of this "https://forum.proxmox.com/threads/node-instability.61157/". But not sure.
Logs & Observations:
From the logs, I noticed repeated timeout messages from various services, including pve-ha-lrm and pve-ha-crm:
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: watchdog active
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: status change wait_for_agent_lock => active
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: starting service vm:641
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686380]: start VM 641: UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: <root@pam> starting task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: <root@pam> end task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam: OK
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: service status vm:641 started
Sep 23 09:15:18 uz5-srv03 pve-ha-lrm[15193]: loop take too long (70 seconds)
Steps I’ve Taken:
• Checked /var/log/syslog and journalctl for any kernel panic or hardware-related errors—nothing conclusive.
• Verified that no scheduled tasks or updates were configured during the incident window.
Questions:
1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?
2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?
3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?
I encountered a critical issue today with one of my Proxmox nodes, where the host rebooted by itself without any clear reason. I’m reaching out to seek your advice on identifying the root cause and possible solutions to prevent this from happening again.
Incident Details:
• Host: [Insert Host Details, like model or configuration]
• Proxmox Version: 7.4-16
• Kernel Version: 6.2.16-11-bpo11-pve
• Symptoms: No prior warning or errors in the Proxmox GUI, but increased I/O delay and server load in the hours leading up to the incident. The system became unresponsive, it rebooted on its own. We think cause of this "https://forum.proxmox.com/threads/node-instability.61157/". But not sure.
Logs & Observations:
From the logs, I noticed repeated timeout messages from various services, including pve-ha-lrm and pve-ha-crm:
Jun 05 07:14:35 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Aug 01 14:26:00 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Sep 13 11:08:14 uz5-srv03 pve-ha-crm[3541]: status change wait_for_quorum => slave
Sep 23 09:15:18 uz5-srv03 pve-ha-crm[3541]: loop take too long (62 seconds)
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: watchdog active
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: status change wait_for_agent_lock => active
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: starting service vm:641
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686380]: start VM 641: UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: <root@pam> starting task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: <root@pam> end task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam: OK
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: service status vm:641 started
Sep 23 09:15:18 uz5-srv03 pve-ha-lrm[15193]: loop take too long (70 seconds)
Steps I’ve Taken:
• Checked /var/log/syslog and journalctl for any kernel panic or hardware-related errors—nothing conclusive.
• Verified that no scheduled tasks or updates were configured during the incident window.
Questions:
1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?
2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?
3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?