Frequent crashes - [Hardware Error] CPU RIP TSC PROCESSOR - Kernal panic

sender · May 19, 2021

Since a while I have intermittent crashes.
Like these:

(although it says it reboots, it does NOT)

It happened a lot:

My current setup:
Intel NUC8i5BEH with 16GB RAM 500GB Samsung 980 PRO SSD
pve-manager/6.4-6/be2fa32c (running kernel: 5.4.114-1-pve)
All bios for NUC & SSD up-to-date

What I experience(d):
Before I had VERY frequent crashes with pve 6.4.5 and the version before (forgot number).
With the previous SSD firmware I had: https://forum.proxmox.com/threads/d..._find_block-failed-error-5.87344/#post-389506 (this has now not happened anymore).

What I tried:
- updating firmwares
- updating pve
- this: https://forum.proxmox.com/threads/r...mox-ve-6-1-auf-ex62-nvme-hetzner.63597/page-3

But it keeps crahing now and then. I must admit that every time I tried anything it got better but still it crashes every 2-4 days (!)...

What can I do about this?

Here is the error written out (might be typos):

Code:

mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 10: be2000c00002010b
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff870f62ec> {mutex_spin_on_owner+0x6c/0xa0}
mce: [Hardware Error]: TSC 154e1c68e3c2e ADDR dedb80001d0f36 MISC 66119c0
mce: [Hardware Error]: PROCESSOR 0:806ea TIME 1621405756 SOCKET 0 APIC 0 microcode e0
mce: [Hardware Error]: Run the above through 'mcelog --ascii"
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
Shutting down cpus with NMI
Kernel offset: 0x6000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Rebooting in 30 seconds..

Moayad · May 19, 2021

May you try to install the new kernel 5.11 and test it [0]?

If you want to switch back to the 5.4 kernel, you can select the old kernel in the boot loader during a reboot and uninstall the pve-kernel-5.11 package.

[0] https://forum.proxmox.com/threads/kernel-5-11.86225/

sender · May 19, 2021

Thanks for your reply. Willing to try. but is this in any means linked to my issues?

Moayad · May 19, 2021

Not sure 100% but the new kernel has more support for the HA see below link [0]

EDIT: I mean HW (HardWare) sorry

[0] https://forum.proxmox.com/threads/kernel-5-11.86225/post-380246

sender · May 19, 2021

I happened again and it is starting to get pretty annoying... I think I am on the verge of leaving proxmox for what it is...

I do not want to but I will upgrade to newer kernel...

sender · May 20, 2021

FFS!

I upgraded to the newest kernel:
pve-manager/6.4-6/be2fa32c (running kernel: 5.11.17-1-pve)

And it crashed again! It is now getting very frustrating since it really totally hangs till Christmas and beyond.

I believe I have the most standard thinkable hardware setup in the market currently...

tom · May 20, 2021

Looks like an issue with your hardware.

Intel NUCs are known to work well.

sender · May 20, 2021

Easy to say... the error messag is different after each "action/update"...

tom · May 20, 2021

sender said:
Easy to say... the error messag is different after each "action/update"...

Maybe a cooling issue? As I said, as it works for others, there must be something different/broken on your box.

sender · May 23, 2021

I crashed again after 4 day thus now. Where is the information written in the logs and how can I find them? I really want this to stay stable... and swapping hardware is an option, but what to swap? The NUC, the SSD, the MEM? All is brand new and works for a longer while (now 4 days).

This is not hardware is it!?

sender · May 26, 2021

And again it crashed with a new nuc! This must be software. Please assist...

Where can I find those "Machine check events logged"?

sender · Jun 1, 2021

Can I please receive some help on this error?

Where can I find those "Machine check events logged"?

sender · Jun 15, 2021

Can I please get some usefull assistance? It keeps crashing every now and then. I am fully up to date with all firmware and updates etc. It lasts some days up to a week and then crashes... the most annoying now is after power cycle (only way to get it going) all logging is gone... how can I overcome that?

sender · Jul 3, 2021

Is this really Proxmox? I have asked many times for help... "how" to see the logs after crash. I cannot find them... I appreciate some help with this ever ongoing issue...

bobmc · Jul 3, 2021

This is most likely a hardware issue or an overcommitted system in terms of VM usage

Can you specify what VM's are running ?

Have you tried a different ssd ?

jasonsansone · Jul 3, 2021

It isn’t clear what you have done to narrow down and identify your issues. I see you keep testing Proxmox, but that won’t eliminate whether this is a hardware or software issue.

Have you removed all but one DIMM and tested each module individually with memtest?

Have you tested any other hypervisor or OS? Results? Can you run Hyper-V, ESXi, or Debian stably?

Are you running the newest microcode? (apt install intel-microcode)

What happens when you benchmark or stress the CPU with prime95, geekbench, and passmark? Have you run a live ISO such as Ultimate Boot CD to test all hardware components?

sender · Jul 3, 2021

Hi, thanks for the replies.

No this is a fresh system installed with Proxmox.

Why am I not receviing a reply to the simple question... where can I see the log to se what happened?

Ofcourse I can start stressing the hardware with all kinds of 3rd party software... but that takes time, efferot, etc. and without even knowing "what" happesn (log) I do not want to start that.... I don't even know how to do that on the currenthardware and get everything back...

I will try to instal:
apt install intel-microcode

EDIT, can't:

Code:

root@proxmox01:~# apt install intel-microcode
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package intel-microcode is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'intel-microcode' has no installation candidate
root@proxmox01:~#

I have 2 VMs and 5 LXC containers...

actually all systems do "nothing"but 1 an LXC container with docker on it running object detection with a google coral...

CPU load is around 25%
Hour average:

Month Average:

See above, it can run well for day but then suddenly is "gone"...

jasonsansone · Jul 3, 2021

sender said:
Why am I not receviing a reply to the simple question... where can I see the log to se what happened?

It is a kernel panic. The system's lock up typically means there is no time or ability to write out to any persistent log file. You can enable debugging to dump more data on screen, but you would still need to catch the data when it is crashing.

Ofcourse I can start stressing the hardware with all kinds of 3rd party software... but that takes time, efferot, etc. and without even knowing "what" happesn (log) I do not want to start that....

Yes, it takes time and effort. There is no other way to properly eliminate variables. You don't know what is happening, thus the proper method to diagnose the cause is to eliminate variables. It is substantially more likely there is a problem with your hardware than it is you have found a problem in a mature kernel deployed to thousands of production servers.

sender · Jul 3, 2021

Ok fair enough... So I have another NUC8i5 with different memory (samsung) and different SSD (NVME).

What is the best/easiest way to fully migrate allcontainers/VMs from one to another system?

Ik think that sounds like the best way to rule out...

That outher NUC8i5 with samsung mem and 1TB NVME ran fine with esxi for over a year...

qinqiang · May 17, 2022

I also have this problem, and then the system restarts frequently, did you solve it

Frequent crashes - [Hardware Error] CPU RIP TSC PROCESSOR - Kernal panic

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Renowned Member

Active Member

Member

Active Member

Member

New Member