Sudden pve restart

kurdam · Jan 4, 2023

Hi,
We had recently a node that restarted suddenly without any warnings.
The machine was running pretty capped RAM wise.
I downloaded the logs from the node in question but i don't see any trace in them.
all i see is the start phase, nothing interesting before
Can you lead me to a log file that can show me what happened ?

Thank you in advance,

nunner · Jan 4, 2023

Hi,

is your node running in a cluster? Do you have HA enabled? My first guess would be that the node fenced and thus restarted. If you are using HA, you can check out journalctl -u pve-ha-lrm to try and find out more.

kurdam · Jan 4, 2023

Thank you for your answer.
No, we are not running HA yet. But we are running in a cluster.

Here is the result of the command:

root@pve5:~# journalctl -u pve-ha-lrm
-- Journal begins at Tue 2022-11-15 19:35:35 CET, ends at Wed 2023-01-04 14:59:49 CET. --
Dec 07 07:14:50 pve5 pve-ha-lrm[1857]: loop take too long (277 seconds)
Dec 07 07:18:42 pve5 pve-ha-lrm[1857]: loop take too long (232 seconds)
Dec 07 07:20:43 pve5 pve-ha-lrm[1857]: loop take too long (121 seconds)
Dec 07 07:25:26 pve5 pve-ha-lrm[1857]: loop take too long (283 seconds)
Dec 07 07:27:17 pve5 pve-ha-lrm[1857]: loop take too long (111 seconds)
Dec 07 07:31:48 pve5 pve-ha-lrm[1857]: unable to write lrm status file - closing file '/etc/pve/nodes/pve5/lrm_status.tmp.1857' failed - Device or resource busy
Dec 07 07:31:53 pve5 pve-ha-lrm[1857]: loop take too long (271 seconds)
Dec 07 07:54:11 pve5 pve-ha-lrm[1857]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Dec 07 07:54:16 pve5 pve-ha-lrm[1857]: unable to write lrm status file - unable to delete old temp file: Device or resource busy
Dec 22 13:09:41 pve5 pve-ha-lrm[1857]: loop take too long (34 seconds)
-- Boot 18ac3c84d9cd48d2b755271332328ed6 --
Jan 02 11:52:01 pve5 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Jan 02 11:52:01 pve5 pve-ha-lrm[1929]: starting server
Jan 02 11:52:01 pve5 pve-ha-lrm[1929]: status change startup => wait_for_agent_lock
Jan 02 11:52:02 pve5 systemd[1]: Started PVE Local HA Resource Manager Daemon.

The problem happened on JAN 02

nunner · Jan 4, 2023

kurdam said:
all i see is the start phase, nothing interesting before

Does that mean you're only looking at the boot *after* the crash? You can read the previous boot logs via journalctl --list-boots and journalctl -b <boot>.

kurdam · Jan 4, 2023

I checked and there is nothing in the last 12 hours, just crons and backups running and workers stoping and starting.
At the time of the crash, no warning of any kind in journalctl.

kurdam · Jan 4, 2023

Here is a screenshot of the system load of this node on the day it happened.

bbgeek17 · Jan 4, 2023

If the reboot was caused by hardware issue then its likely that Kernel had no time to log an error. Are you running an enterprise type server with good BMC/iDrac/Lilo/etc? If you do, then you should query those logs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kurdam · Jan 4, 2023

Yes, that is the next move, to check all these logs.

Sudden pve restart

kurdam

Well-Known Member

nunner

Active Member

kurdam

Well-Known Member

nunner

Active Member

kurdam

Well-Known Member

kurdam

Well-Known Member

bbgeek17

Distinguished Member

kurdam

Well-Known Member

We value your privacy