mystery reboots

Have you tested the RAM? You can use memtest86+ from the Proxmox ISO. I just mention this as a few years back I had a similar issue with a Linux box (not running proxmox) and finally decided to test the RAM. Turns out one module was bad even though it was bought brand new a few months earlier.
yes sir, ran it all day and overnight , no issues, "you can use memtest86+ from the Proxmox ISO." yes did that, i do that first thing because i have been using computers since bbc micro lol , so testing rams in Pentium one and Celerons was a must
 
Last edited:
  • Like
Reactions: IIEP_IT
Have you tried intel_idle. max_cstate=1 (or eventually even 0)?
well never tried that . this processor is of when it z790 came new so no bios fix was there, now i do have the latest 129 microcode, but if cpu was damaged before that no ways to know. i was thinking of getting a new processor but hen 15th gen is near the corner so rather wait. well let me pu it at cstate 1 and try it
 
well never tried that . this processor is of when it z790 came new so no bios fix was there, now i do have the latest 129 microcode, but if cpu was damaged before that no ways to know. i was thinking of getting a new processor but hen 15th gen is near the corner so rather wait. well let me pu it at cstate 1 and try it

I had some 12gen inexplicably freezing on some kernel, limiting the C states miraculously stopped it. True it was not rebooting, but that was not PVE, there's the watchdogs ... and that something does not leave any logs may simply mean it was frozen before watchdog expired so nothing flushed.
 
I had some 12gen inexplicably freezing on some kernel, limiting the C states miraculously stopped it. True it was not rebooting, but that was not PVE, there's the watchdogs ... and that something does not leave any logs may simply mean it was frozen before watchdog expired so nothing flushed.
yeah right , there is on bsod in linux to create crash report :)

Edit /etc/defaults/grub:

intel_idle.max_cstate=1 put in default int he end

update-grub
shutdown -r now

done
lets see now
 
yeah right , there is on bsod in linux to create crash report :)

There is kdump [1], there's quite a few guides out there [2].

But my issue with PVE is that it is useless for me because I am not going to be building the kernel every time myself [3].

But if your CPU falls asleep like snowy white, I am afraid no screen nothing would help you, not even record-capturing video output (you typically get strace on the screen even with drives not flushed).

[1] https://www.kernel.org/doc/Documentation/kdump/kdump.txt
[2] https://www.cyberciti.biz/faq/how-to-on-enable-kernel-crash-dump-on-debian-linux/
[3] https://forum.proxmox.com/threads/where-to-get-dbg-kernel.141686/#post-634966
 
Also, if you want to be sure it's not the softdog rebooting your machine and rather see it frozen, you may put:

options softdog soft_noboot=1

into:

/etc/modprobe.d/softdog.conf
 
Sep 07 15:41:27 Prox1 watchdog-mux[1231]: Watchdog driver 'Software Watchdog', version 0
Sep 07 15:41:27 Prox1 kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Sep 07 15:41:27 Prox1 kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0

yeah i got that this i saw in many of my logs in course of months but i ignored to look into it because i was focused on looking into hardware and wiring.

sometimes a loose neutral can cause spikes in the voltage line , or relay like in APC where it cuts neutral too on battery mode in my ups. if that relay is chattery even in neutral then also i have see out of 10 systems one will randomly reboot. but since that is rooted out now. i am exploring this

so
for me right now i have done
# intel_idle. max_cstate=1
if this does not work
i will try
# intel_idle. max_cstate=0
if this also does not work. Then remove the above
and put
# options softdog soft_noboot=1

if it was softdoggie doing it then it means the system was not frozen and it could have made some log.
and thanks for everyone for the forum help and suggestions
 
Last edited:
Sep 07 15:41:27 Prox1 watchdog-mux[1231]: Watchdog driver 'Software Watchdog', version 0
Sep 07 15:41:27 Prox1 kernel: softdog: initialized. soft_noboot=0 soft_margin=60 sec soft_panic=0 (nowayout=0)
Sep 07 15:41:27 Prox1 kernel: softdog: soft_reboot_cmd=<not set> soft_active_on_boot=0

This is totally fine, this is just loading the module, it is essentially active on every PVE install.

yeah i got that this i saw in many of my logs in course of months but i ignored to look into it because i was focused on looking into hardware and wiring.

sometimes a loose neutral can cause spikes in the voltage line , or relay like in APC where it cuts neutral too on battery mode in my ups. if that relay is chattery even in neutral then also i have see out of 10 systems one will randomly reboot. but since that is rooted out now. i am exploring this

I see.

so
for me right now i have done
# intel_idle. max_cstate=1
if this does not work
i will try
# intel_idle. max_cstate=0

The first one limits the C state, the second basically prevents using the driver.

if this also does not work. Then remove the above
and put
# options softdog soft_noboot=1

So this one is completely independent from my point of view, i.e. you can put it there even now.

if it was softdoggie doing it then it means the system was not frozen and it could have made some log.

That's not entirely true, you can have the system frozen to the point that it cannot flush log onto disk, but the softdog manages to reboot it. If you instead got to the system frozen (hours later), you would see trace on the screen where it went belly up, at the least.

and thanks for everyone for the forum help and suggestions

Cheers!
 
This is totally fine, this is just loading the module, it is essentially active on every PVE install.



I see.



The first one limits the C state, the second basically prevents using the driver.



So this one is completely independent from my point of view, i.e. you can put it there even now.



That's not entirely true, you can have the system frozen to the point that it cannot flush log onto disk, but the softdog manages to reboot it. If you instead got to the system frozen (hours later), you would see trace on the screen where it went belly up, at the least.



Cheers!
besides intel cstate i found that one of my nvr agent dvr windows vm was put changed from host cpu to default at after reloading of proxmox.
way before i did a nvfix entry with host set cpu and that line was still there so the vm was still using host cpu , changed that too.
5 day now still running, lets see
 
  • Like
Reactions: esi_y
Sep 02 06:03:57 Prox1 systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Sep 02 06:03:57 Prox1 systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Sep 02 06:03:57 Prox1 systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Sep 02 06:17:01 Prox1 CRON[410326]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 06:17:01 Prox1 CRON[410327]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 02 06:17:01 Prox1 CRON[410326]: pam_unix(cron:session): session closed for user root
Sep 02 06:25:01 Prox1 CRON[411831]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 06:25:01 Prox1 CRON[411832]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Sep 02 06:25:01 Prox1 CRON[411831]: pam_unix(cron:session): session closed for user root
Sep 02 07:17:01 Prox1 CRON[421577]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 07:17:01 Prox1 CRON[421578]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 02 07:17:01 Prox1 CRON[421577]: pam_unix(cron:session): session closed for user root
Sep 02 08:17:01 Prox1 CRON[432831]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 08:17:01 Prox1 CRON[432832]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 02 08:17:01 Prox1 CRON[432831]: pam_unix(cron:session): session closed for user root
-- Reboot --
Sep 02 08:21:55 Prox1 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()




ep 03 15:26:46 Prox1 smartd[1213]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 107 to 108
Sep 03 16:12:45 Prox1 systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Sep 03 16:12:45 Prox1 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Sep 03 16:12:45 Prox1 systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Sep 03 16:12:45 Prox1 systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Sep 03 16:17:01 Prox1 CRON[306019]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 03 16:17:01 Prox1 CRON[306020]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 03 16:17:01 Prox1 CRON[306019]: pam_unix(cron:session): session closed for user root
-- Reboot --
Sep 03 16:22:19 Prox1 kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
Sep 03 16:22:19 Prox1 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on initcall_blacklist=sysfb_init pcie_aspm=off
Sep 03 16:22:19 Prox1 kernel: KERNEL supported cpus:
I'm Having the same issue, One of my nodes suddenly started rebooting once or twice a day (the day before yesterday). I have created a ticket with ProxMox support, and i will be checking the hardware soon (it's pretty new dell server but of course you never know.) I'll report back when i hopefully have found a cause.
 
I'm Having the same issue, One of my nodes suddenly started rebooting once or twice a day (the day before yesterday). I have created a ticket with ProxMox support, and i will be checking the hardware soon (it's pretty new dell server but of course you never know.) I'll report back when i hopefully have found a cause.
1. start with power , put a sonoff device on the same line it will create logs if there is on and off from some relay in the ups
2. test ram (memtest86 is your friend
3. if its never generation like 12 13 14th gen open grub and slap entry for cpu state c0 to that it never cuts power to any core but scaling of ghz will still work
4. give your logs to chatgpt see if it notices anything :)
i have 13600k with asus z790 when i loaded this it was 6.2 and it worked with c0 and no host cpu assigned in vms and no devices forwarding for 30 40 days and suddenly reboot the frequency increases as time passed by. now it came to average once in 2 weeks so now put it to 6.3 plus updated asus bios to latest and put manual drivers for realtek lancards i have . lets see now. it does not bother me much if reboot is once a month because the system is so fast that i have small and bog vmz most of the small ones are ubuntu server cli only so it boots very fast back on :). before i put c0 entry in grub it was unexpected reboot once or 2 times a day

may be this helps because i have had my share of head meets desk before, so i can understand
 
Last edited: