Random reboots

nadin

New Member
Jul 3, 2024
3
0
1
My proxmox box keeps rebooting randomly. There was no one awake during the time this happened.
I have attached the logs.

Running on an MS-01, 64gig ram with a couple of vms.
 

Attachments

Welcome to the forum, nadin!

Have you already found a solution to your problem? If not, it would be helpful to have the log just before the the system has rebooted. As your attached reboot log shows that it happened on Sep 17 04:44:38, you can either gather the log before the reboot with something like sudo journalctl --since="2024-09-17 04:00:00" --until="2024-09-17 04:44:38" or you could use sudo journalctl -b <boot-number>, where <boot-number> is 0 for the current boot, -1 for the last boot, -2 one before the last boot, etc.
 
It has happened again. This is a new install (last week) where I rebuilt the proxmox box last week.



I noticed when my streaming app died (I run opnsense on a VM on proxmox). So I am still not sure what the problem is.



Dec 03 23:10:06 home kernel: pci 0000:00:06.2: PCI bridge to [bus 02]
Dec 03 23:10:06 home kernel: pci 0000:00:06.2: bridge window [mem 0x6c700000-0x6c7fffff]
Dec 03 23:10:06 home kernel: pci 0000:00:06.2: bridge window [mem 0x611c800000-0x611e2fffff 64bit pref]
Dec 03 23:10:06 home kernel: pci 0000:00:06.2: PME# supported from D0 D3hot D3cold
Dec 03 22:56:43 home postfix/smtp[170275]: connect to gmail-smtp-in.l.google.com[142.250.107.26]:25: Connection timed out
Dec 03 22:56:43 home postfix/smtp[170274]: connect to gmail-smtp-in.l.google.com[2607:f8b0:400e:c0d::1b]:25: Network is unr>
Dec 03 22:56:43 home postfix/smtp[170275]: connect to gmail-smtp-in.l.google.com[2607:f8b0:400e:c0d::1a]:25: Network is unr>
Dec 03 22:56:43 home postfix/smtp[170275]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:4023:100b::1a]:25: Network >
Dec 03 22:57:13 home postfix/smtp[170272]: connect to alt1.gmail-smtp-in.l.google.com[142.251.186.27]:25: Connection timed >
Dec 03 22:57:13 home postfix/smtp[170272]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4003:c04::1b]:25: Network i>
Dec 03 22:57:13 home postfix/smtp[170273]: connect to alt1.gmail-smtp-in.l.google.com[142.251.186.26]:25: Connection timed >
Dec 03 22:57:13 home postfix/smtp[170273]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:4023:100b::1a]:25: Network >
Dec 03 22:57:13 home postfix/smtp[170273]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4003:c04::1b]:25: Network i>
Dec 03 22:57:13 home postfix/smtp[170274]: connect to alt1.gmail-smtp-in.l.google.com[142.251.186.27]:25: Connection timed >
Dec 03 22:57:13 home postfix/smtp[170274]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:4023:100b::1a]:25: Network >
Dec 03 22:57:13 home postfix/smtp[170275]: connect to alt1.gmail-smtp-in.l.google.com[142.251.186.27]:25: Connection timed >
Dec 03 22:57:13 home postfix/smtp[170273]: 7A3A81A0ED9: to=<xxxxxxx@gmail.com>, relay=none, delay=292369, delays=292309/0.>
Dec 03 22:57:13 home postfix/smtp[170272]: A10271A0EE6: to=<xxxxxxx@gmail.com>, relay=none, delay=250911, delays=250851/0.>
Dec 03 22:57:43 home postfix/smtp[170274]: connect to alt2.gmail-smtp-in.l.google.com[108.177.104.26]:25: Connection timed >
Dec 03 22:57:43 home postfix/smtp[170275]: connect to alt2.gmail-smtp-in.l.google.com[108.177.104.26]:25: Connection timed >
Dec 03 22:57:43 home postfix/smtp[170275]: B83091A0EC7: to=<xxxxxxx@GMAIL.COM>, relay=none, delay=382517, delays=382427/0.>
Dec 03 22:57:43 home postfix/smtp[170274]: B50991A0ECE: to=<xxxxxxx@GMAIL.COM>, relay=none, delay=382517, delays=382427/0.>
-- Boot 7bf2bca17004438c9aa778c4976ea8f8 --
Dec 03 23:10:06 home kernel: Linux version 6.8.12-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutil>
Dec 03 23:10:06 home kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-4-pve root=/dev/mapper/pve-root ro quiet intel_i>
Dec 03 23:10:06 home kernel: KERNEL supported cpus:
Dec 03 23:10:06 home kernel: Intel GenuineIntel
Dec 03 23:10:06 home kernel: AMD AuthenticAMD
Dec 03 23:10:06 home kernel: Hygon HygonGenuine
......
 
Hi!

There's a couple of things standing out in your previous boot log:

Code:
Sep 17 04:44:38 prox kernel: resource: resource sanity check: requesting [mem 0x00000000fedc0000-0x00000000fedcffff], which spans more than pnp 00:04 [mem 0xfedc0000-0xfedc7fff]
Sep 17 04:44:38 prox kernel: caller igen6_probe+0x186/0x8b0 [igen6_edac] mapping multiple BARs
Sep 17 04:44:38 prox kernel: EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
Sep 17 04:44:38 prox kernel: EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
Sep 17 04:44:38 prox kernel: EDAC igen6 MC1: HANDLING IBECC MEMORY ERROR
Sep 17 04:44:38 prox kernel: EDAC igen6 MC1: ADDR 0x7fffffffe0
Sep 17 04:44:38 prox kernel: EDAC igen6 MC0: HANDLING IBECC MEMORY ERROR
Sep 17 04:44:38 prox kernel: EDAC igen6 MC0: ADDR 0x7fffffffe0
Sep 17 04:44:38 prox kernel: EDAC igen6: v2.5.1

As far as I know, these should be correctable ECC memory errors, so there shouldn't be any trouble here. But it's always worth to check that out by running a couple of full cycles with memtest86+ to make sure that there is no memory issue (If you can afford the downtime, e.g. during the night).

Code:
Sep 17 04:44:38 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 17 04:44:38 prox kernel: cfg80211: failed to load regulatory.db
This should also not cause any trouble - at least for the random crashes. If you've bought the NIC in your regulatory domain (i.e. the same region you use the NIC), this should not be a problem, especially if you don't use WLAN or use it in a server room. This just makes sure that you're using the right WLAN channels as stated for your regulatory domain.

Code:
Sep 17 04:49:21 prox kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
[ ... snip ... ]
Sep 17 04:44:47 prox kernel: x86/split lock detection: #AC: CPU 3/KVM/1527 took a split_lock trap at address: 0x7ef1d050
Sep 17 04:44:47 prox kernel: x86/split lock detection: #AC: CPU 2/KVM/1526 took a split_lock trap at address: 0x7ef1d050
Sep 17 04:44:47 prox kernel: x86/split lock detection: #AC: CPU 1/KVM/1525 took a split_lock trap at address: 0x7ef1d050

This might be a probable cause, even though it would be unlikely as the split lock would have to happen in kernel space, i.e. having a misaligned atomic memory access operation in the kernel. You could try to turn of split lock detection or better find the cause why this happens as described at [0].

As it seems like it's hard to reproduce and there is no log before the reboot, I would thoroughly check your hardware setup and see if anything solves this problem:

  1. Check if the BIOS firmware is up-to-date
  2. Check if the CPU has the newest microcode applied
  3. Check the temperature (if it's too hot the hardware will halt)
  4. Check the memory with memtest
  5. Check the stability of the system with a CPU stress test
  6. Check if there's a reproducible error that's causing this

[0] https://pve.proxmox.com/wiki/Split_lock_detection