Proxmox cluster member server rebooted itself

oyusupov

New Member
Sep 23, 2024
1
0
1
Hi Proxmox community,

I encountered a critical issue today with one of my Proxmox nodes, where the host rebooted by itself without any clear reason. I’m reaching out to seek your advice on identifying the root cause and possible solutions to prevent this from happening again.

Incident Details:

Host: [Insert Host Details, like model or configuration]
Proxmox Version: 7.4-16
Kernel Version: 6.2.16-11-bpo11-pve
Symptoms: No prior warning or errors in the Proxmox GUI, but increased I/O delay and server load in the hours leading up to the incident. The system became unresponsive, it rebooted on its own. We think cause of this "https://forum.proxmox.com/threads/node-instability.61157/". But not sure.

Logs & Observations:
From the logs, I noticed repeated timeout messages from various services, including pve-ha-lrm and pve-ha-crm:
Jun 05 07:14:35 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Aug 01 14:26:00 uz5-srv03 pve-ha-crm[3541]: loop take too long (65 seconds)
Sep 13 11:08:14 uz5-srv03 pve-ha-crm[3541]: status change wait_for_quorum => slave
Sep 23 09:15:18 uz5-srv03 pve-ha-crm[3541]: loop take too long (62 seconds)

Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: watchdog active
Sep 13 11:08:09 uz5-srv03 pve-ha-lrm[15193]: status change wait_for_agent_lock => active
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: starting service vm:641
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686380]: start VM 641: UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:10 uz5-srv03 pve-ha-lrm[686375]: <root@pam> starting task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam:
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: <root@pam> end task UPID:uz5-srv03:000A792C:3875641F:66E3E282:qmstart:641:root@pam: OK
Sep 13 11:58:12 uz5-srv03 pve-ha-lrm[686375]: service status vm:641 started
Sep 23 09:15:18 uz5-srv03 pve-ha-lrm[15193]: loop take too long (70 seconds)
Steps I’ve Taken:

• Checked /var/log/syslog and journalctl for any kernel panic or hardware-related errors—nothing conclusive.
• Verified that no scheduled tasks or updates were configured during the incident window.


Questions:


1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?
2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?
3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?
 
Questions:


1. What additional logs or diagnostics should I check to understand the cause of this spontaneous reboot?

Check proper full log, e.g. journalctl --list-boots, then journalctl -b <bootid> -n 200.

2. Could the watchdog or HA configuration be responsible for the reboot? How can I investigate this further?

They can, it is so common I put it into a "tutorial" (I may need to update it on lost quorum information):
https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

3. Has anyone else experienced similar issues, and what steps helped in troubleshooting or resolving them?

You never mention if you actually use HA (but I suppose so). If it is not watchdog related, you are back to square 1, but full boot logs need to be examined first for "mysterious" reboots to at least make an attempt.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!