[SOLVED] What is causing my restarts randomly?

looserealityinc

New Member
Dec 4, 2024
11
0
1
I am not entirely new but somewhat working on my random restarts here's my log, every few hours it restarts if it does I will lose progress and data if it happens while Im doing crucial client work. Please help? this is from the timestamp of the latest crash.

May 16 21:17:01 PVE CRON[386312]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 21:17:01 PVE CRON[386313]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
May 16 21:17:01 PVE CRON[386312]: pam_unix(cron:session): session closed for user root
May 16 21:20:16 PVE smartd[1712]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 117 to 116
May 16 21:20:17 PVE smartd[1712]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 58
May 16 21:20:17 PVE smartd[1712]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 41 to 42
May 16 21:20:17 PVE smartd[1712]: Device: /dev/sdc [SAT], 30 Offline uncorrectable sectors
May 16 21:20:17 PVE smartd[1712]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 153 to 150
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 40
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 57 to 56
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 43 to 44
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 59 to 58
May 16 21:20:22 PVE smartd[1712]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 41 to 42
May 16 21:23:01 PVE pvedaemon[2071]: <root@pam> successful auth for user 'root@pam'
May 16 21:23:06 PVE pvedaemon[2072]: worker exit
May 16 21:23:06 PVE pvedaemon[2069]: worker 2072 finished
May 16 21:23:06 PVE pvedaemon[2069]: starting 1 worker(s)
May 16 21:23:06 PVE pvedaemon[2069]: worker 389957 started
May 16 21:25:01 PVE CRON[391072]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 21:25:01 PVE CRON[391073]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 16 21:25:01 PVE CRON[391072]: pam_unix(cron:session): session closed for user root
May 16 21:28:46 PVE pveproxy[357012]: worker exit
May 16 21:28:46 PVE pveproxy[2078]: worker 357012 finished
May 16 21:28:46 PVE pveproxy[2078]: starting 1 worker(s)
May 16 21:28:46 PVE pveproxy[2078]: worker 393273 started
May 16 21:30:26 PVE pvedaemon[2070]: worker exit
May 16 21:30:26 PVE pvedaemon[2069]: worker 2070 finished
May 16 21:30:26 PVE pvedaemon[2069]: starting 1 worker(s)
May 16 21:30:26 PVE pvedaemon[2069]: worker 394276 started
May 16 21:32:34 PVE pvedaemon[2071]: worker exit
May 16 21:32:34 PVE pvedaemon[2069]: worker 2071 finished
May 16 21:32:34 PVE pvedaemon[2069]: starting 1 worker(s)
May 16 21:32:34 PVE pvedaemon[2069]: worker 395539 started
May 16 21:35:01 PVE CRON[397016]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 21:35:01 PVE CRON[397017]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 16 21:35:01 PVE CRON[397016]: pam_unix(cron:session): session closed for user root
May 16 21:38:02 PVE pvedaemon[395539]: <root@pam> successful auth for user 'root@pam'
May 16 21:41:34 PVE pveproxy[2078]: worker 371880 finished
May 16 21:41:34 PVE pveproxy[2078]: starting 1 worker(s)
May 16 21:41:34 PVE pveproxy[2078]: worker 401017 started
May 16 21:41:37 PVE pveproxy[401016]: got inotify poll request in wrong process - disabling inotify
May 16 21:45:01 PVE CRON[403031]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 21:45:01 PVE CRON[403032]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 16 21:45:01 PVE CRON[403031]: pam_unix(cron:session): session closed for user root
May 16 21:46:03 PVE pveproxy[374643]: worker exit
May 16 21:46:03 PVE pveproxy[2078]: worker 374643 finished
May 16 21:46:03 PVE pveproxy[2078]: starting 1 worker(s)
May 16 21:46:03 PVE pveproxy[2078]: worker 403698 started
May 16 21:50:00 PVE pmxcfs[1919]: [dcdb] notice: data verification successful
May 16 21:50:17 PVE smartd[1712]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 57
May 16 21:50:17 PVE smartd[1712]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 43
May 16 21:50:17 PVE smartd[1712]: Device: /dev/sdc [SAT], 30 Offline uncorrectable sectors
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 41
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 56 to 55
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 44 to 45
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 57
May 16 21:50:22 PVE smartd[1712]: Device: /dev/sdf [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 43
May 16 21:50:27 PVE smartd[1712]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
May 16 21:50:27 PVE smartd[1712]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 41
May 16 21:50:28 PVE smartd[1712]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 150 to 146
May 16 21:50:33 PVE smartd[1712]: Device: /dev/sdi [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
May 16 21:50:33 PVE smartd[1712]: Device: /dev/sdi [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 40
May 16 21:53:03 PVE pvedaemon[395539]: <root@pam> successful auth for user 'root@pam'
May 16 21:55:01 PVE CRON[409067]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 21:55:01 PVE CRON[409068]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 16 21:55:01 PVE CRON[409067]: pam_unix(cron:session): session closed for user root
May 16 22:05:01 PVE CRON[416203]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
May 16 22:05:01 PVE CRON[416204]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 16 22:05:01 PVE CRON[416203]: pam_unix(cron:session): session closed for user root
-- Reboot --
May 16 22:08:27 PVE kernel: Linux version 6.8.12-10-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) ()
May 16 22:08:27 PVE kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt video=vesafb:off video=efifb:off initcall_blacklist=sysfb_init
May 16 22:08:27 PVE kernel: KERNEL supported cpus:
May 16 22:08:27 PVE kernel: Intel GenuineIntel
May 16 22:08:27 PVE kernel: AMD AuthenticAMD
May 16 22:08:27 PVE kernel: Hygon HygonGenuine
May 16 22:08:27 PVE kernel: Centaur CentaurHauls
May 16 22:08:27 PVE kernel: zhaoxin Shanghai
May 16 22:08:27 PVE kernel: BIOS-provided physical RAM map:
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d81fff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x0000000009d82000-0x0000000009ffffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a210fff] ACPI NVS
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000000a211000-0x000000000affffff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000000b020000-0x00000000ab46bfff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000ab46c000-0x00000000ab46cfff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000ab46d000-0x00000000ab495fff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000ab496000-0x00000000ab496fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000ab497000-0x00000000bb117fff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bb118000-0x00000000bb497fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bb498000-0x00000000bb4fbfff] ACPI data
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bb4fc000-0x00000000bcbfafff] ACPI NVS
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bcbfb000-0x00000000bdb4afff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bdb4b000-0x00000000bdbfefff] type 20
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bdbff000-0x00000000beffffff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000bf000000-0x00000000bfffffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fd200000-0x00000000fd2fffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fd600000-0x00000000fd7fffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fea00000-0x00000000fea0ffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000feb80000-0x00000000fec01fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fec30000-0x00000000fec30fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fed80000-0x00000000fed8ffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fedc2000-0x00000000fedcffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000fedd4000-0x00000000fedd5fff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x0000000100000000-0x000000203f37ffff] usable
May 16 22:08:27 PVE kernel: BIOS-e820: [mem 0x000000203f380000-0x000000203fffffff] reserved
May 16 22:08:27 PVE kernel: NX (Execute Disable) protection: active
May 16 22:08:27 PVE kernel: APIC: Static calls initialized
May 16 22:08:27 PVE kernel: efi: EFI v2.7 by American Megatrends
May 16 22:08:27 PVE kernel: efi: ACPI=0xbcbe4000 ACPI 2.0=0xbcbe4014 TPMFinalLog=0xbcbae000 SMBIOS=0xbd9fe000 MEMATTR=0xb87c7118 MOKvar=0xbda2a000
May 16 22:08:27 PVE kernel: efi: Remove mem327: MMIO range=[0xf0000000-0xf7ffffff] (128MB) from e820 map
May 16 22:08:27 PVE kernel: e820: remove [mem 0xf0000000-0xf7ffffff] reserved
May 16 22:08:27 PVE kernel: efi: Remove mem328: MMIO range=[0xfd200000-0xfd2fffff] (1MB) from e820 map
May 16 22:08:27 PVE kernel: e820: remove [mem 0xfd200000-0xfd2fffff] reserved
May 16 22:08:27 PVE kernel: efi: Remove mem329: MMIO range=[0xfd600000-0xfd7fffff] (2MB) from e820 map
May 16 22:08:27 PVE kernel: e820: remove [mem 0xfd600000-0xfd7fffff] reserved
May 16 22:08:27 PVE kernel: efi: Not removing mem330: MMIO range=[0xfea00000-0xfea0ffff] (64KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Remove mem331: MMIO range=[0xfeb80000-0xfec01fff] (0MB) from e820 map
May 16 22:08:27 PVE kernel: e820: remove [mem 0xfeb80000-0xfec01fff] reserved
May 16 22:08:27 PVE kernel: efi: Not removing mem332: MMIO range=[0xfec10000-0xfec10fff] (4KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem333: MMIO range=[0xfec30000-0xfec30fff] (4KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem334: MMIO range=[0xfed00000-0xfed00fff] (4KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem335: MMIO range=[0xfed40000-0xfed44fff] (20KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem336: MMIO range=[0xfed80000-0xfed8ffff] (64KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem337: MMIO range=[0xfedc2000-0xfedcffff] (56KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Not removing mem338: MMIO range=[0xfedd4000-0xfedd5fff] (8KB) from e820 map
May 16 22:08:27 PVE kernel: efi: Remove mem339: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
May 16 22:08:27 PVE kernel: e820: remove [mem 0xff000000-0xffffffff] reserved
May 16 22:08:27 PVE kernel: secureboot: Secure boot disabled
May 16 22:08:27 PVE kernel: SMBIOS 2.8 present.
May 16 22:08:27 PVE kernel: DMI: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK MAX (MS-7C02), BIOS 3.I0 10/14/2023
May 16 22:08:27 PVE kernel: tsc: Fast TSC calibration using PIT
May 16 22:08:27 PVE kernel: tsc: Detected 4199.728 MHz processor
May 16 22:08:27 PVE kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
May 16 22:08:27 PVE kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
May 16 22:08:27 PVE kernel: last_pfn = 0x203f380 max_arch_pfn = 0x400000000
May 16 22:08:27 PVE kernel: total RAM covered: 3071M
May 16 22:08:27 PVE kernel: Found optimal setting for mtrr clean up
May 16 22:08:27 PVE kernel: gran_size: 64K chunk_size: 64M num_reg: 3 lose cover RAM: 0G
May 16 22:08:27 PVE kernel: MTRR map: 7 entries (3 fixed + 4 variable; max 20), built from 9 variable MTRRs
May 16 22:08:27 PVE kernel: x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
May 16 22:08:27 PVE kernel: e820: update [mem 0xbc680000-0xbc68ffff] usable ==> reserved
 
The log is indistinguishable from a hard reset or momentary power loss. Any additional information on a physical display connected to the Proxmox host (which might not be written to the logs because of drive issue)? It's typically a hardware issue or maybe you need to update the firmware. Maybe test and/or replace part like PSU and memory or add a UPS. Maybe setup logging to a different system over the network (to get a clue about what happens just before the reset)?
ugh this forum seems dead sometimes...
if it does I will lose progress and data if it happens while Im doing crucial client work.
If you need urgent support and make money buy running Proxmox, consider a support subscription with support tickets (to have experts help you instead of random internet volunteers).
 
Oddly enough the monitor i have hooked up to the proxmox host right now is just showing that it is loading linux pve, and loading initial ramdisk, I will keep watching the screen and have a camera on it to try and catch anything that might appear suddenly before crash, but I would think it wouldn't sit on this screen as everything is fully running at this point for an hour so far.
 
The log is indistinguishable from a hard reset or momentary power loss. Any additional information on a physical display connected to the Proxmox host (which might not be written to the logs because of drive issue)? It's typically a hardware issue or maybe you need to update the firmware. Maybe test and/or replace part like PSU and memory or add a UPS. Maybe setup logging to a different system over the network (to get a clue about what happens just before the reset)?


If you need urgent support and make money buy running Proxmox, consider a support subscription with support tickets (to have experts help you instead of random internet volunteers).
Sorry proxmox doesn't make me money just something I chose to do through one of my proxmox vm servers haha, but no totally understandable.
 
Actually I looked into that one already but it wasn't the issue, I have now since fixed it.. or at least for the last 72 hours I have had no random restarts which is way beyond how far I could get before. I had exactly the same amount of cores My host machine has doled out to all of my LXCs and VMs, which apparently it did not like, so now I just gave one of my VMs one less core and it's given the whole system stability it seems.
 
I had exactly the same amount of cores My host machine has doled out to all of my LXCs and VMs, which apparently it did not like, so now I just gave one of my VMs one less core and it's given the whole system stability it seems.
Assigning more than the number of physical cores (even counting SMT) should work fine. Although when every VM/CT tries to use it for 100% it would slow things down, Proxmox should not restart because of this. I've run stress tests like that before (on a different AM4 system than yours) without problems. Sounds more like a PSU or heat problem or other hardware weakness in your system. Maybe try replacing parts until it is actually fixed?
 
Hmmm... I'll try updating the BIOS I just realized that for some reason I was thinking about my editing rig PC on which motherboard bios I updated. Haha.
 
So far it still hasn't restarted or done anything outside of run smoothly the last 72 hours which is like I said 65+ Plus more hours than last time. Haha
 
May 16 22:08:27 PVE kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt video=vesafb:off video=efifb:off initcall_blacklist=sysfb_init
While I'm at it: amd_iommu=on and video=vesafb:off and video=efifb:off all do nothing on Proxmox anymore.

Maybe disabling SMT might also fix your strange restarts and might also make your CPU less vulnerable (see the output of lscpu or add mitigations=auto,nosmt to the kernel parameters) and possibly faster in certain tasks.

Are any overclock setting enabled (by default) in the motherboard BIOS? That can also push your CPU or RAM over the edge when stressed causing restarts.
 
No I've never had it overclocked but I actually have officially figured it out. There is something wrong with my Mac OS VM, I deleted the old one and reinstalled a new one. I tested it because the moment I had the old one opened it restarted after a few minutes and I figured it out. Thanks everyone.