Random Reboots - what to try next?

koyaan134 · Sep 18, 2024

Hoping someone could point me in the right direction with some issues I've been having with Proxmox (I'm a beginner).

I recently moved my server and updated Proxmox to the latest kernel. The next day, I noticed that Proxmox was randomly rebooting at sporadic times (between every 10 minutes and 1 hour).

At the same time, the thread co-processor I attached via usb was failing to be recognized, and I noticed an error about usb power. I removed the usb but the reboots persisted.

I also tried:

Running memtest
Trying new PSU / different outlet on power strip
Downgrading to use an older kernel
Updating BIOS
Checking system resources / temperatures

All were normal and didn't change the reboot problem.

I tried digging through syslog but I'm not really sure what to look for. What I can say is that I'm not seeing any critical errors before the reboot occurs. I do see this at setup:

Code:

ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [CAP1] at bit offset/length 64/32 exceeds size of target Buffer (64 bits)
ACPI Error: Aborting method \_SB._OSC due to previous error (AE_AML_BUFFER_LIMIT)

But I'm not sure if that has any bearing on this problem. Any idea where I should start? Happy to do another syslog dump if needed

Wasca · Sep 18, 2024

I'm also experiencing the random reboots but not as often as you. I have 3 identical servers, so far one has been up for 28 days and the other two are lasting between 5 days and 2 weeks before they reboot (-- Reboot --). I'm following along with your thread cause it's driving us crazy trying to figure it out. I've done all the test you've done, even swapped RAM between the servers and still no closer to finding out what is triggering it.

esi_y · Sep 18, 2024

koyaan134 said:
I recently moved my server and updated Proxmox to the latest kernel. The next day, I noticed that Proxmox was randomly rebooting at sporadic times (between every 10 minutes and 1 hour).

So, previous kernel works fine?

esi_y · Sep 18, 2024

Wasca said:
I'm also experiencing the random reboots but not as often as you. I have 3 identical servers, so far one has been up for 28 days and the other two are lasting between 5 days and 2 weeks before they reboot (-- Reboot --). I'm following along with your thread cause it's driving us crazy trying to figure it out. I've done all the test you've done, even swapped RAM between the servers and still no closer to finding out what is triggering it.

How about sharing the last log entries before the new boot log starts?

Wasca · Sep 19, 2024

esi_y said:
How about sharing the last log entries before the new boot log starts?

Nothing really stands out before the --Reboot--

Code:

Sep 18 00:00:46 pmx01 systemd[1]: Finished logrotate.service - Rotate log files.
Sep 18 00:17:01 pmx01 CRON[837996]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 00:17:01 pmx01 CRON[837997]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 00:17:01 pmx01 CRON[837996]: pam_unix(cron:session): session closed for user root
Sep 18 00:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 01:17:01 pmx01 CRON[858759]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 01:17:01 pmx01 CRON[858760]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 01:17:01 pmx01 CRON[858759]: pam_unix(cron:session): session closed for user root
Sep 18 01:26:46 pmx01 systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Sep 18 01:26:48 pmx01 pveupdate[862120]: <root@pam> starting task UPID:pmx01:000D27AD:00DFBC7D:66E99FB8:aptupdate::root@pam:
Sep 18 01:26:49 pmx01 pveupdate[862125]: update new package list: /var/lib/pve-manager/pkgupdates
Sep 18 01:26:50 pmx01 pveupdate[862120]: <root@pam> end task UPID:pmx01:000D27AD:00DFBC7D:66E99FB8:aptupdate::root@pam: OK
Sep 18 01:26:50 pmx01 pveupdate[862120]: Custom certificate does not expire soon, skipping ACME renewal.
Sep 18 01:26:50 pmx01 systemd[1]: pve-daily-update.service: Deactivated successfully.
Sep 18 01:26:50 pmx01 systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Sep 18 01:26:50 pmx01 systemd[1]: pve-daily-update.service: Consumed 2.074s CPU time.
Sep 18 01:28:57 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 01:28:59 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 01:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 02:00:04 pmx01 pvescheduler[873958]: <root@pam> starting task UPID:pmx01:000D55E7:00E2C813:66E9A784:vzdump:1200:root@pam:
Sep 18 02:00:04 pmx01 pvescheduler[873959]: INFO: starting new backup job: vzdump 1200 --quiet 1 --notes-template '{{guestname}}' --fleecing 0 --mailnotification failure --node pmx01 --mode snapshot --prune-backups 'keep-daily=7,keep-monthly=3,keep-weekly=4' --storage PBS-BACKUPS --mailto xxx@xxx
Sep 18 02:00:04 pmx01 pvescheduler[873959]: INFO: Starting Backup of VM 1200 (lxc)
Sep 18 02:00:04 pmx01 dmeventd[626]: No longer monitoring thin pool fast--vm-fast--vm-tpool.
Sep 18 02:00:04 pmx01 dmeventd[626]: Monitoring thin pool fast--vm-fast--vm-tpool.
Sep 18 02:00:05 pmx01 kernel: EXT4-fs (dm-17): mounted filesystem 43d50acd-a6b1-4b33-ae48-ddaeafff1782 ro without journal. Quota mode: none.
Sep 18 02:00:12 pmx01 kernel: EXT4-fs (dm-17): unmounting filesystem 43d50acd-a6b1-4b33-ae48-ddaeafff1782.
Sep 18 02:00:13 pmx01 pvescheduler[873959]: INFO: Finished Backup of VM 1200 (00:00:09)
Sep 18 02:00:13 pmx01 pvescheduler[873959]: INFO: Backup job finished successfully
Sep 18 02:17:01 pmx01 CRON[880038]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 02:17:01 pmx01 CRON[880039]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 02:17:01 pmx01 CRON[880038]: pam_unix(cron:session): session closed for user root
Sep 18 02:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 02:55:27 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 02:55:29 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 03:00:00 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 03:10:01 pmx01 CRON[898363]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 03:10:01 pmx01 CRON[898364]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Sep 18 03:10:01 pmx01 CRON[898363]: pam_unix(cron:session): session closed for user root
Sep 18 03:17:01 pmx01 CRON[900778]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 03:17:01 pmx01 CRON[900779]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 03:17:01 pmx01 CRON[900778]: pam_unix(cron:session): session closed for user root
Sep 18 03:34:29 pmx01 kernel: vmbr0: left promiscuous mode
Sep 18 03:34:29 pmx01 wol_hack.sh[1086]: Captured magic packet for address: "00:00:00:00:03:70"
Sep 18 03:34:29 pmx01 wol_hack.sh[1086]: Looking for existing VM: 0 found
Sep 18 03:34:29 pmx01 wol_hack.sh[1086]: Looking for existing LXC: 0 found
Sep 18 03:34:34 pmx01 kernel: vmbr0: entered promiscuous mode
Sep 18 03:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 04:00:03 pmx01 pmxcfs[1188]: [status] notice: received log
Sep 18 04:17:01 pmx01 CRON[921520]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 04:17:01 pmx01 CRON[921521]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 04:17:01 pmx01 CRON[921520]: pam_unix(cron:session): session closed for user root
Sep 18 04:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 04:48:46 pmx01 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Sep 18 04:48:46 pmx01 systemd[1]: apt-daily.service: Deactivated successfully.
Sep 18 04:48:46 pmx01 systemd[1]: Finished apt-daily.service - Daily apt download activities.
Sep 18 05:17:01 pmx01 CRON[942292]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 05:17:01 pmx01 CRON[942293]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 05:17:01 pmx01 CRON[942292]: pam_unix(cron:session): session closed for user root
Sep 18 05:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 06:17:01 pmx01 CRON[963005]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 06:17:01 pmx01 CRON[963006]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 06:17:01 pmx01 CRON[963005]: pam_unix(cron:session): session closed for user root
Sep 18 06:25:01 pmx01 CRON[965769]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 06:25:01 pmx01 CRON[965770]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Sep 18 06:25:01 pmx01 CRON[965769]: pam_unix(cron:session): session closed for user root
Sep 18 06:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 06:51:46 pmx01 systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Sep 18 06:51:46 pmx01 systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Sep 18 06:51:46 pmx01 systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Sep 18 07:17:01 pmx01 CRON[983794]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 07:17:01 pmx01 CRON[983795]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 07:17:01 pmx01 CRON[983794]: pam_unix(cron:session): session closed for user root
Sep 18 07:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 08:17:01 pmx01 CRON[1004520]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 08:17:01 pmx01 CRON[1004521]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 08:17:01 pmx01 CRON[1004520]: pam_unix(cron:session): session closed for user root
Sep 18 08:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 09:00:47 pmx01 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Sep 18 09:00:47 pmx01 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 18 09:00:47 pmx01 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Sep 18 09:00:47 pmx01 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Sep 18 09:17:01 pmx01 CRON[1026320]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 09:17:01 pmx01 CRON[1026321]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 09:17:01 pmx01 CRON[1026320]: pam_unix(cron:session): session closed for user root
Sep 18 09:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 09:43:46 pmx01 systemd[1]: Starting man-db.service - Daily man-db regeneration...
Sep 18 09:43:46 pmx01 systemd[1]: man-db.service: Deactivated successfully.
Sep 18 09:43:46 pmx01 systemd[1]: Finished man-db.service - Daily man-db regeneration.
Sep 18 10:07:46 pmx01 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Sep 18 10:07:46 pmx01 systemd[1]: apt-daily.service: Deactivated successfully.
Sep 18 10:07:46 pmx01 systemd[1]: Finished apt-daily.service - Daily apt download activities.
-- Reboot --
Sep 18 10:09:45 pmx01 kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Sep 18 10:09:45 pmx01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet iommu=pt pci=assign-busses apicmaintimer idle=poll reboot=cold,hard
Sep 18 10:09:45 pmx01 kernel: KERNEL supported cpus:
Sep 18 10:09:45 pmx01 kernel:   Intel GenuineIntel
Sep 18 10:09:45 pmx01 kernel:   AMD AuthenticAMD
Sep 18 10:09:45 pmx01 kernel:   Hygon HygonGenuine
Sep 18 10:09:45 pmx01 kernel:   Centaur CentaurHauls
Sep 18 10:09:45 pmx01 kernel:   zhaoxin   Shanghai
Sep 18 10:09:45 pmx01 kernel: BIOS-provided physical RAM map:

esi_y · Sep 19, 2024

Wasca said:

Nothing really stands out before the --Reboot--

Code:

Sep 18 09:43:10 pmx01 pmxcfs[1188]: [dcdb] notice: data verification successful
Sep 18 09:43:46 pmx01 systemd[1]: Starting man-db.service - Daily man-db regeneration...
Sep 18 09:43:46 pmx01 systemd[1]: man-db.service: Deactivated successfully.
Sep 18 09:43:46 pmx01 systemd[1]: Finished man-db.service - Daily man-db regeneration.
Sep 18 10:07:46 pmx01 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Sep 18 10:07:46 pmx01 systemd[1]: apt-daily.service: Deactivated successfully.
Sep 18 10:07:46 pmx01 systemd[1]: Finished apt-daily.service - Daily apt download activities.
-- Reboot --
Sep 18 10:09:45 pmx01 kernel: Linux version 6.8.12-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) ()
Sep 18 10:09:45 pmx01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-1-pve root=/dev/mapper/pve-root ro quiet iommu=pt pci=assign-busses apicmaintimer idle=poll reboot=cold,hard
Sep 18 10:09:45 pmx01 kernel: KERNEL supported cpus:
Sep 18 10:09:45 pmx01 kernel:   Intel GenuineIntel
Sep 18 10:09:45 pmx01 kernel:   AMD AuthenticAMD
Sep 18 10:09:45 pmx01 kernel:   Hygon HygonGenuine
Sep 18 10:09:45 pmx01 kernel:   Centaur CentaurHauls
Sep 18 10:09:45 pmx01 kernel:   zhaoxin   Shanghai
Sep 18 10:09:45 pmx01 kernel: BIOS-provided physical RAM map:

May I ask where is this log from? I am a bit surprised but the "-- Reboot --", I think it has not been used by systemd-journald for a while (it now appears as "-- Boot $ID" in the log deliminer. Are these recently updated nodes? If so, how were they updated?

Wasca · Sep 20, 2024

esi_y said:
May I ask where is this log from? I am a bit surprised but the "-- Reboot --", I think it has not been used by systemd-journald for a while (it now appears as "-- Boot $ID" in the log deliminer. Are these recently updated nodes? If so, how were they updated?

The Log is from the System => System Log menu. I update the nodes from the Updates menu. I'm currently running ProxmoxVE 8.2.5 after updating today.

koyaan134 · Sep 20, 2024

Hi all - sorry for delay getting back - was sick for a few days. @Wasca - experiencing same thing where not seeing anything of note before -- Reboot -- looking through the System => System Log.

@esi_y - I was having rebooting problems with the previous kernels too. I also updated these from the updates menu recently but am fairly on top of updates (within a few weeks). Am on PVE 8.2.4.

I'll say that something kinda weird happened - after all of these problems and trying this that and the other thing from my original post, I left the server alone but with the VMs powered on and using a few services (but not logging in to Proxmox and whatnot) for the past two days while I was sick and ... no reboots. No idea why that would be!

Here's my syslog output before and after the last reboot

Code:

Sep 18 09:30:48 koyaan systemd[1]: systemd-fsckd.service: Deactivated successfully.
Sep 18 09:45:22 koyaan systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Sep 18 09:45:22 koyaan systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 18 09:45:22 koyaan systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Sep 18 09:45:22 koyaan systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Sep 18 10:17:01 koyaan CRON[8667]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 10:17:01 koyaan CRON[8668]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 10:17:01 koyaan CRON[8667]: pam_unix(cron:session): session closed for user root
Sep 18 11:17:01 koyaan CRON[18162]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 11:17:01 koyaan CRON[18163]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 11:17:01 koyaan CRON[18162]: pam_unix(cron:session): session closed for user root
Sep 18 12:17:01 koyaan CRON[27610]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 18 12:17:01 koyaan CRON[27611]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 18 12:17:01 koyaan CRON[27610]: pam_unix(cron:session): session closed for user root
-- Reboot --
Sep 18 12:56:03 koyaan kernel: Linux version 6.5.13-6-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-6 (2024-07-26T12:34Z) ()
Sep 18 12:56:03 koyaan kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.13-6-pve root=/dev/mapper/pve-root ro quiet
Sep 18 12:56:03 koyaan kernel: KERNEL supported cpus:
Sep 18 12:56:03 koyaan kernel:   Intel GenuineIntel
Sep 18 12:56:03 koyaan kernel:   AMD AuthenticAMD
Sep 18 12:56:03 koyaan kernel:   Hygon HygonGenuine
Sep 18 12:56:03 koyaan kernel:   Centaur CentaurHauls
Sep 18 12:56:03 koyaan kernel:   zhaoxin   Shanghai 
Sep 18 12:56:03 koyaan kernel: BIOS-provided physical RAM map:

koyaan134 · Sep 21, 2024

Well, for what it's worth - crashing has resumed after about 4 days total

esi_y · Sep 23, 2024

My bad, the "-- Reboot --" is apparently something that shows up in the GUI accessible log. I really do not know if it's not otherwise filtered, but in any case a full boot log "start to crash" might be something to start with (in both cases).

Usually, I would use journalctl --list-boots, then pick a boot sequence which I know ended in crash and get full log (on a machine this idle) with e.g. journalctl -b -1 > attach.log - i.e. that's for the last boot sequence before current, -2 would be one before, etc.

In both cases @koyaan134 and @Wasca it would always be like reading from the tea leaves unless you eliminate some usual causes (you do not need to do that if you are lucky with the log, but if there's no giveaway), I would typically boot the machine on some related system with different kernel (e.g. Ubuntu) and see how it performs, others would do memtests, disable CSTATEs, etc., but the most important is to know when did this start happening because it's very different for a new machine where nothing is known in terms of compatibility, stability of hardware and machine that just suddenly started doing this.

Also, mentioning what exact hardware it is might help someone recognise common issues with it and chip in. But first thing of them all, for me, is to try different kernel, there's a huge thread dedicated to the latest one with issues like this at the current time.

Wasca · Sep 23, 2024

esi_y said:
My bad, the "-- Reboot --" is apparently something that shows up in the GUI accessible log. I really do not know if it's not otherwise filtered, but in any case a full boot log "start to crash" might be something to start with (in both cases).

Usually, I would use journalctl --list-boots, then pick a boot sequence which I know ended in crash and get full log (on a machine this idle) with e.g. journalctl -b -1 > attach.log - i.e. that's for the last boot sequence before current, -2 would be one before, etc.

In both cases @koyaan134 and @Wasca it would always be like reading from the tea leaves unless you eliminate some usual causes (you do not need to do that if you are lucky with the log, but if there's no giveaway), I would typically boot the machine on some related system with different kernel (e.g. Ubuntu) and see how it performs, others would do memtests, disable CSTATEs, etc., but the most important is to know when did this start happening because it's very different for a new machine where nothing is known in terms of compatibility, stability of hardware and machine that just suddenly started doing this.

Also, mentioning what exact hardware it is might help someone recognise common issues with it and chip in. But first thing of them all, for me, is to try different kernel, there's a huge thread dedicated to the latest one with issues like this at the current time.

@esi_y Memtest's completed and passed, CSTATE's disabled in the BIOS. All new hardware...

CPU - AMD Ryzen 9 7900 12-Core Processor
MB - Gigabyte Technology Co., Ltd. | Product Name: B650M DS3H AM5 (On board Realtek 2.5GB NIC **Disabled**)
RAM - Corsair Vengeance 96GB (2x48GB) C40 5600MHz DDR5 RAM
OS Drive - Samsung 980 Pro 1TB PCIe Gen4 M.2 2280 NVMe
VM Drive - Samsung 990 Pro 4TB PCIe 4.0 M.2 2280 NVMe
Network - TP-Link 10Gbps PCIe Ethernet Network Card (TX401)

I'm going to wait for another random reboot even to occur before posting the journal logs, I can't recall the exact date the last time it happened and I've been manually rebooting the servers recently.

@esi_y can you link us to the forum thread about the different kernels?

esi_y · Sep 23, 2024

Wasca said:
I'm going to wait for another random reboot even to occur before posting the journal logs, I can't recall the exact date the last time it happened and I've been manually rebooting the servers recently.

Sometimes, there's actually some giveaways even during normal boot / operation. Anyhow you could even try out something like:

Code:

journalctl -b -3 -n 100 | grep -e "Shutting down" -e "reboot.target"

This looks at boot sequence "3 boots ago", last 100 lines of log, looks only for words as "Shutting down" and "reboot.target". If you do not see the "Shutting down", it did not go down orderly, then that boot sequence could be of interest. If you have many boots, you can look through entire log (no -b -n options) and grep only the words and then look at the timestamps and the timestamps of --list-boots list ... whenever you have an end and there was no "Shutting down" around that time, it was abrupt.

Wasca said:
@esi_y can you link us to the forum thread about the different kernels?

I have seen several threads, not that I keep track of them individually, but literally in the current one (pinned):
https://forum.proxmox.com/threads/o...st-no-subscription.144557/page-11#post-691134

And you can go backwards from most recent.

Wasca · Sep 23, 2024

esi_y said:
Sometimes, there's actually some giveaways even during normal boot / operation. Anyhow you could even try out something like:

Code:

journalctl -b -3 -n 100 | grep -e "Shutting down" -e "reboot.target"

This looks at boot sequence "3 boots ago", last 100 lines of log, looks only for words as "Shutting down" and "reboot.target". If you do not see the "Shutting down", it did not go down orderly, then that boot sequence could be of interest. If you have many boots, you can look through entire log (no -b -n options) and grep only the words and then look at the timestamps and the timestamps of --list-boots list ... whenever you have an end and there was no "Shutting down" around that time, it was abrupt.

I have seen several threads, not that I keep track of them individually, but literally in the current one (pinned):
https://forum.proxmox.com/threads/o...st-no-subscription.144557/page-11#post-691134

And you can go backwards from most recent.

Thanks I'll do some more hunting.

In the mean time to remove another possible culprit, I've disabled the WoL service I was running on my hosts. Let's see if that makes any difference.

Wasca · Oct 2, 2024

@esi_y I got 12 days uptime on one of my Proxmox hosts after I turned off the WoL service I was running, before that host did a sudden and planned reboot.

Here is the journal log just before the reboot.

Code:

Oct 02 06:55:39 pmx03 pmxcfs[1172]: [dcdb] notice: data verification successful
Oct 02 06:56:14 pmx03 systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Oct 02 06:56:14 pmx03 systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Oct 02 06:56:14 pmx03 systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Oct 02 07:17:01 pmx03 CRON[2975883]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 02 07:17:01 pmx03 CRON[2975884]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 02 07:17:01 pmx03 CRON[2975883]: pam_unix(cron:session): session closed for user root
Oct 02 07:55:39 pmx03 pmxcfs[1172]: [dcdb] notice: data verification successful
Oct 02 08:01:49 pmx03 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Oct 02 08:01:49 pmx03 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Oct 02 08:01:49 pmx03 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Oct 02 08:01:49 pmx03 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Oct 02 08:17:01 pmx03 CRON[3001558]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 02 08:17:01 pmx03 CRON[3001559]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 02 08:17:01 pmx03 CRON[3001558]: pam_unix(cron:session): session closed for user root
Oct 02 08:55:39 pmx03 pmxcfs[1172]: [dcdb] notice: data verification successful

And the next journal log as the server rebooted.

Code:

Oct 02 09:06:24 pmx03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT
_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Oct 02 09:06:24 pmx03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet iommu=pt pci=assign-busses apicmaintimer
idle=poll reboot=cold,hard
Oct 02 09:06:24 pmx03 kernel: KERNEL supported cpus:
Oct 02 09:06:24 pmx03 kernel:   Intel GenuineIntel
Oct 02 09:06:24 pmx03 kernel:   AMD AuthenticAMD
Oct 02 09:06:24 pmx03 kernel:   Hygon HygonGenuine
Oct 02 09:06:24 pmx03 kernel:   Centaur CentaurHauls
Oct 02 09:06:24 pmx03 kernel:   zhaoxin   Shanghai
Oct 02 09:06:24 pmx03 kernel: BIOS-provided physical RAM map:
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000000a210000-0x000000000affffff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b020fff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000000b021000-0x000000007e278fff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000007e279000-0x0000000084278fff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000084279000-0x000000008447efff] ACPI data
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008447f000-0x000000008647efff] ACPI NVS
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008647f000-0x000000008e47efff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008e47f000-0x000000008e5fefff] type 20
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008e5ff000-0x000000008fff8fff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008fff9000-0x000000008fffdfff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000008fffe000-0x000000008fffffff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000090000000-0x0000000097ffffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000009d7f3000-0x000000009fffffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x00000000f7000000-0x00000000ffffffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000183de7ffff] usable
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000183eec0000-0x00000018801fffff] reserved
Oct 02 09:06:24 pmx03 kernel: BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
Oct 02 09:06:24 pmx03 kernel: process: using polling idle threads

There is nothing that stands out about these two journal logs.

This is what is shown in the System Log menu.

Code:

Oct 02 08:17:01 pmx03 CRON[3001558]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Oct 02 08:17:01 pmx03 CRON[3001559]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Oct 02 08:17:01 pmx03 CRON[3001558]: pam_unix(cron:session): session closed for user root
Oct 02 08:55:39 pmx03 pmxcfs[1172]: [dcdb] notice: data verification successful
-- Reboot --
Oct 02 09:06:24 pmx03 kernel: Linux version 6.8.12-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) ()
Oct 02 09:06:24 pmx03 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet iommu=pt pci=assign-busses apicmaintimer idle=poll reboot=cold,hard
Oct 02 09:06:24 pmx03 kernel: KERNEL supported cpus:
Oct 02 09:06:24 pmx03 kernel:   Intel GenuineIntel
Oct 02 09:06:24 pmx03 kernel:   AMD AuthenticAMD
Oct 02 09:06:24 pmx03 kernel:   Hygon HygonGenuine
Oct 02 09:06:24 pmx03 kernel:   Centaur CentaurHauls
Oct 02 09:06:24 pmx03 kernel:   zhaoxin   Shanghai

esi_y · Oct 2, 2024

Wasca said:
@esi_y I got 12 days uptime on one of my Proxmox hosts after I turned off the WoL service I was running, before that host did a sudden and planned reboot.

Here is the journal log just before the reboot.

Hey! Well, I can't really tell much from last few lines of a boot log what might have happened, but it looks like the host is doing more or less nothing for hours if that's what's in your log as last items. Sometimes it is possible to guess better if you share the entire log. The first 1 second of startup sequence is also of little help.

I would start with (definitely) different kernel and the C STATES (kernel parameter, set to 0 so that the driver is not in place). The kernel related issues from the last one are so frequent it's almost waste of time to try to troubleshoot anything on the current one, at least in my opinion.

Random Reboots - what to try next?

koyaan134

New Member

Wasca

Renowned Member

esi_y

Renowned Member

esi_y

Renowned Member

Wasca

Renowned Member

esi_y

Renowned Member

Wasca

Renowned Member

koyaan134

New Member

koyaan134

New Member

esi_y

Renowned Member

Wasca

Renowned Member

esi_y

Renowned Member

Wasca

Renowned Member

Wasca

Renowned Member

esi_y

Renowned Member

We value your privacy