after server crash, all logs were gone

niggolas · Mar 20, 2023

Hey there, last week i had an incident were one of my proxmox servers crashed and rebuild automatically.
The reason why could not be cleared yet.
It was no planned restart.

My case is that happend at 03-15 6:47 AM(visible through the another server via cluster) and all the server logs started at 6:48. All the logs before that point of time are gone.
checked logs: journalctl, /var/log/messages

I checked the HDDs all fine according to smartctl.
CPU was never more than 5% occupied.
Sadly RAM had only around 7 out of 128 GiB free space.

What could be the reason why a server looses all of its logs?
Could the lack of memory be the reason?

Thanks in advance!

Chris · Mar 20, 2023

Hi,
what command did you use exactly to show the logs? /var/log/messages might have been rotated, check if there are /var/log/messages.1 and/or zipped messages files. Also check the content of /var/log/journal, where the systemd-journald logs are stored.

niggolas · Mar 20, 2023

hey,
i used vi /var/log/messages
i also checked the older logs and zipped logs now.
I could find logs from few weeks ago, but the complete log timespan from 12. to 15. march until that server "crash" is gone.

still don't know why and what happend there?

Chris · Mar 20, 2023

Try using journalctl --since <date> --until <date>. I don't see any reason why the logs should go missing other than disk failure or manual intervention.

pille99 · Mar 20, 2023

such kind of incidents, especially with an financially impact, helps to improve the environment.
think about an logserver. than this can not happen anymore and in case of regress you have your prove what happend

niggolas · Mar 20, 2023

I used that command: sudo journalctl --since "2023-03-12 00:00:00" --until "2023-03-17 00:00:00"


-- Logs begin at Wed 2023-03-15 06:48:43 CET, end at Mon 2023-03-20 10:08:38 CET. --

still not showing anything before the incident of lastweek 03-15 6:48.
was the command correct? (it worked on a different server)

pille99 · Mar 20, 2023

niggolas said:
hey,
i used vi /var/log/messages
i also checked the older logs and zipped logs now.
I could find logs from few weeks ago, but the complete log timespan from 12. to 15. march until that server "crash" is gone.

still don't know why and what happend there?

why you are searching for ? you checked the files itself and you confirmed that the time is missing. than they are gone. look after the incident of something like "clear" job or delete is fine in the logs

niggolas · Mar 20, 2023

there was only that one line with BMC to find, with the info of "clear"

Code:

XXX@pmoc2:~$ sudo journalctl | grep clear
Mar 15 06:48:44 pmoc2 kernel: ipmi_si IPI0001:00: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
Mar 20 11:16:18 pmoc2 sudo[5250]: XXX : TTY=pts/0 ; PWD=/home/XXX ; USER=root ; COMMAND=/bin/journalctl -t clear*
XXX@pmoc2:~$ sudo journalctl | grep delete
XXX@pmoc2:~$

niggolas · Mar 20, 2023

pille99 said:
why you are searching for ? you checked the files itself and you confirmed that the time is missing. than they are gone. look after the incident of something like "clear" job or delete is fine in the logs

what i am searching for is at least a hint why that happend, like why that proxmox server crashed and rebuild all the production VMs.
i do have no hint from monitoring except that RAM was nearly full with 121 GiB in use of 128 GiB.
and i do have no hint from logs, because there are none for the past three days since that happend.

those Hdds are still fine and running. no person touched the server. what could it be?

robb01 · Jun 6, 2023

I had a very similar situation which may have been as a result of a short power failure but I have no proof for that. The logs were all gone for anything prior to the re-start time including messages* and syslog*.

The last line of journalctl before the crash was

Code:

Jun 06 11:42:56 pve smartd[623]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70

telling me nothing about the cause.
The proxmox device was re-powered manually when it was found to be down.
Can I do something to improve my config?

Herkz · Jun 15, 2023

robb01 said:
I had a very similar situation which may have been as a result of a short power failure but I have no proof for that. The logs were all gone for anything prior to the re-start time including messages* and syslog*.

The last line of journalctl before the crash was

Code:

Jun 06 11:42:56 pve smartd[623]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70

telling me nothing about the cause.
The proxmox device was re-powered manually when it was found to be down.
Can I do something to improve my config?

I don't like to hijack posts, but I had the same problem just yesterday with an unresponsive host and just this line in journalctl.
After rebooting manually everything seems fine but with no logs I'm worried.

Polyphemus · Oct 6, 2023

Same problem here on my 8.0.4 node. Read only file system error, no /var/log/messages of kern.log. Nothing in journalctl, only logging of my hard power cycle:

Code:

Oct 06 06:25:19 frigate-nuc systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Oct 06 06:25:19 frigate-nuc systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and c>
-- Boot 2f43176818254bbda8457d81da9f135a --
Oct 06 08:04:15 frigate-nuc kernel: microcode: microcode updated early to revision 0x26, date = 2019>
Oct 06 08:04:15 frigate-nuc kernel: Linux version 6.2.16-15-pve (build@proxmox) (gcc (Debian 12.2.0->

kardigan42 · Nov 23, 2023

Hello! Just wanted to chime in. Same issue here on my 7.4-16 node. Last clue was SMART reporting a temperature change, then regular logs and then all of the sudden nothing up til I rebooted the node manually:


Nov 03 04:53:33 host2 audit[3620725]: SYSCALL arch=c000003e syscall=54 succes>
Nov 03 04:53:33 host2 audit: PROCTITLE proctitle="ebtables-restore"
-- Boot a4c7306ad8e64c8b9360e4c82d18cd40 --
Nov 23 13:12:50 host2 kernel: Linux version 5.15.116-1-pve (build@proxmox) (g>
Nov 23 13:12:50 host2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.116>

Nothing of use in any of the logs that I could see such as /var/log/messages and messages.1, /var/log/kern.log etc.

After rebooting, everything works fine but like others mentioned, the lack of log messages make me curious too.

Search

Search

after server crash, all logs were gone

niggolas

New Member

Chris

Proxmox Staff Member

niggolas

New Member

Chris

Proxmox Staff Member

pille99

Active Member

niggolas

New Member

pille99

Active Member

niggolas

New Member

niggolas

New Member

robb01

Member

Herkz

Member

Polyphemus

Member

kardigan42

New Member

We value your privacy