Sudden pve restart

kurdam

Member
Sep 29, 2020
36
1
13
33
Hi,
We had recently a node that restarted suddenly without any warnings.
The machine was running pretty capped RAM wise.
I downloaded the logs from the node in question but i don't see any trace in them.
all i see is the start phase, nothing interesting before
Can you lead me to a log file that can show me what happened ?

Thank you in advance,
 
Hi,

is your node running in a cluster? Do you have HA enabled? My first guess would be that the node fenced and thus restarted. If you are using HA, you can check out journalctl -u pve-ha-lrm to try and find out more.
 
Thank you for your answer.
No, we are not running HA yet. But we are running in a cluster.

Here is the result of the command:

root@pve5:~# journalctl -u pve-ha-lrm
-- Journal begins at Tue 2022-11-15 19:35:35 CET, ends at Wed 2023-01-04 14:59:49 CET. --
Dec 07 07:14:50 pve5 pve-ha-lrm[1857]: loop take too long (277 seconds)
Dec 07 07:18:42 pve5 pve-ha-lrm[1857]: loop take too long (232 seconds)
Dec 07 07:20:43 pve5 pve-ha-lrm[1857]: loop take too long (121 seconds)
Dec 07 07:25:26 pve5 pve-ha-lrm[1857]: loop take too long (283 seconds)
Dec 07 07:27:17 pve5 pve-ha-lrm[1857]: loop take too long (111 seconds)
Dec 07 07:31:48 pve5 pve-ha-lrm[1857]: unable to write lrm status file - closing file '/etc/pve/nodes/pve5/lrm_status.tmp.1857' failed - Device or resource busy
Dec 07 07:31:53 pve5 pve-ha-lrm[1857]: loop take too long (271 seconds)
Dec 07 07:54:11 pve5 pve-ha-lrm[1857]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Dec 07 07:54:16 pve5 pve-ha-lrm[1857]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Dec 22 13:09:41 pve5 pve-ha-lrm[1857]: loop take too long (34 seconds)
-- Boot 18ac3c84d9cd48d2b755271332328ed6 --
Jan 02 11:52:01 pve5 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jan 02 11:52:01 pve5 pve-ha-lrm[1929]: starting server
Jan 02 11:52:01 pve5 pve-ha-lrm[1929]: status change startup => wait_for_agent_lock
Jan 02 11:52:02 pve5 systemd[1]: Started PVE Local HA Resource Manager Daemon.

The problem happened on JAN 02
 
Last edited:
all i see is the start phase, nothing interesting before
Does that mean you're only looking at the boot *after* the crash? You can read the previous boot logs via journalctl --list-boots and journalctl -b <boot>.
 
I checked and there is nothing in the last 12 hours, just crons and backups running and workers stoping and starting.
At the time of the crash, no warning of any kind in journalctl.
 
Last edited:
Capture.PNG
Here is a screenshot of the system load of this node on the day it happened.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!