Frequent crashes - [Hardware Error] CPU RIP TSC PROCESSOR - Kernal panic

sender

Member
Apr 9, 2021
57
0
11
47
Since a while I have intermittent crashes.
Like these:
1621406422526.png
(although it says it reboots, it does NOT)

It happened a lot:
1621406725857.png

My current setup:
Intel NUC8i5BEH with 16GB RAM 500GB Samsung 980 PRO SSD
pve-manager/6.4-6/be2fa32c (running kernel: 5.4.114-1-pve)
All bios for NUC & SSD up-to-date

What I experience(d):
Before I had VERY frequent crashes with pve 6.4.5 and the version before (forgot number).
With the previous SSD firmware I had: https://forum.proxmox.com/threads/d..._find_block-failed-error-5.87344/#post-389506 (this has now not happened anymore).

What I tried:
- updating firmwares
- updating pve
- this: https://forum.proxmox.com/threads/r...mox-ve-6-1-auf-ex62-nvme-hetzner.63597/page-3

But it keeps crahing now and then. I must admit that every time I tried anything it got better but still it crashes every 2-4 days (!)...

What can I do about this?

Here is the error written out (might be typos):
Code:
mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 10: be2000c00002010b
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff870f62ec> {mutex_spin_on_owner+0x6c/0xa0}
mce: [Hardware Error]: TSC 154e1c68e3c2e ADDR dedb80001d0f36 MISC 66119c0
mce: [Hardware Error]: PROCESSOR 0:806ea TIME 1621405756 SOCKET 0 APIC 0 microcode e0
mce: [Hardware Error]: Run the above through 'mcelog --ascii"
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
Shutting down cpus with NMI
Kernel offset: 0x6000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Rebooting in 30 seconds..
 

Attachments

  • 1621406497658.png
    1621406497658.png
    331.6 KB · Views: 14
  • 1621406560309.png
    1621406560309.png
    290.8 KB · Views: 15
May you try to install the new kernel 5.11 and test it [0]?

If you want to switch back to the 5.4 kernel, you can select the old kernel in the boot loader during a reboot and uninstall the pve-kernel-5.11 package.

[0] https://forum.proxmox.com/threads/kernel-5-11.86225/
 
I happened again and it is starting to get pretty annoying... I think I am on the verge of leaving proxmox for what it is...

1621447588567.png

I do not want to but I will upgrade to newer kernel...
 
FFS!

I upgraded to the newest kernel:
pve-manager/6.4-6/be2fa32c (running kernel: 5.11.17-1-pve)

And it crashed again! It is now getting very frustrating since it really totally hangs till Christmas and beyond.

I believe I have the most standard thinkable hardware setup in the market currently...

1621523820003.png
 
Looks like an issue with your hardware.

Intel NUCs are known to work well.
 
Easy to say... the error messag is different after each "action/update"...
Maybe a cooling issue? As I said, as it works for others, there must be something different/broken on your box.
 
I crashed again after 4 day thus now. Where is the information written in the logs and how can I find them? I really want this to stay stable... and swapping hardware is an option, but what to swap? The NUC, the SSD, the MEM? All is brand new and works for a longer while (now 4 days).

This is not hardware is it!?
1621871590512.png
 
Last edited:
And again it crashed with a new nuc! This must be software. Please assist...

Where can I find those "Machine check events logged"?
 
Last edited:
Can I please receive some help on this error?

Where can I find those "Machine check events logged"?
 
Can I please get some usefull assistance? It keeps crashing every now and then. I am fully up to date with all firmware and updates etc. It lasts some days up to a week and then crashes... the most annoying now is after power cycle (only way to get it going) all logging is gone... how can I overcome that?
 
Is this really Proxmox? I have asked many times for help... "how" to see the logs after crash. I cannot find them... I appreciate some help with this ever ongoing issue...
 
This is most likely a hardware issue or an overcommitted system in terms of VM usage

Can you specify what VM's are running ?

Have you tried a different ssd ?
 
It isn’t clear what you have done to narrow down and identify your issues. I see you keep testing Proxmox, but that won’t eliminate whether this is a hardware or software issue.

Have you removed all but one DIMM and tested each module individually with memtest?

Have you tested any other hypervisor or OS? Results? Can you run Hyper-V, ESXi, or Debian stably?

Are you running the newest microcode? (apt install intel-microcode)

What happens when you benchmark or stress the CPU with prime95, geekbench, and passmark? Have you run a live ISO such as Ultimate Boot CD to test all hardware components?
 
Hi, thanks for the replies.

No this is a fresh system installed with Proxmox.

Why am I not receviing a reply to the simple question... where can I see the log to se what happened?

Ofcourse I can start stressing the hardware with all kinds of 3rd party software... but that takes time, efferot, etc. and without even knowing "what" happesn (log) I do not want to start that.... I don't even know how to do that on the currenthardware and get everything back...

I will try to instal:
apt install intel-microcode

EDIT, can't:
Code:
root@proxmox01:~# apt install intel-microcode
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package intel-microcode is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'intel-microcode' has no installation candidate
root@proxmox01:~#

I have 2 VMs and 5 LXC containers...

actually all systems do "nothing"but 1 an LXC container with docker on it running object detection with a google coral...

CPU load is around 25%
Hour average:
1625336034449.png

Month Average:

1625336094154.png

See above, it can run well for day but then suddenly is "gone"...
 
Last edited:
Why am I not receviing a reply to the simple question... where can I see the log to se what happened?

It is a kernel panic. The system's lock up typically means there is no time or ability to write out to any persistent log file. You can enable debugging to dump more data on screen, but you would still need to catch the data when it is crashing.

Ofcourse I can start stressing the hardware with all kinds of 3rd party software... but that takes time, efferot, etc. and without even knowing "what" happesn (log) I do not want to start that....

Yes, it takes time and effort. There is no other way to properly eliminate variables. You don't know what is happening, thus the proper method to diagnose the cause is to eliminate variables. It is substantially more likely there is a problem with your hardware than it is you have found a problem in a mature kernel deployed to thousands of production servers.
 
Ok fair enough... So I have another NUC8i5 with different memory (samsung) and different SSD (NVME).

What is the best/easiest way to fully migrate allcontainers/VMs from one to another system?

Ik think that sounds like the best way to rule out...

That outher NUC8i5 with samsung mem and 1TB NVME ran fine with esxi for over a year...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!