Proxmox Server crashed, why?

Mrt12

Well-Known Member
May 19, 2019
161
19
58
45
CH
Good day dear colleagues,
I have a freshly installed Proxmox Server with AMD EPYC CPUs.
On Friday I left it running through the weekend, like nothing special, just a bunch of normal VMs, nothing critical. Everything was working.
Today I was lookign at it via remote IPMI and found that it has crashed, must have happened sometime during the weekend.
Because it is IPMI, I cannot fully copy/paste all text, but I attach a screenshot. I cannot decode why it crashes. Can someone interpret what is going on here?
 

Attachments

  • Screenshot_2025-10-27_08-20-29.png
    Screenshot_2025-10-27_08-20-29.png
    356.8 KB · Views: 30
Hello,

Can you dump the entire dmseg and check if it has the same alerts than mine ?

Because it seems pretty close to what I see :p
 
I absolutely cannot because the machine refuses to boot when I do a reset. I did everything: switch off the server, plug all power cords, wait couple minutes and redo it. Then it boots, and after the GRUB screen, it seems to get stuck at "Loading initial ramdisk...".
I then rebooted and edited the command line and removed the "quiet" option. Now I can see this (see attachment). I have no idea what is going on here. Because on Friday it was still working and in my Email Inbox I have notifications from cron jobs that run on Friday night. But because of some reason, it probably reset itself, and is now stuck.

Never have I ever seen something like this.
 

Attachments

  • DSC_0278.jpg
    DSC_0278.jpg
    1 MB · Views: 23
Last edited:
Maybe you can try to use console and switch to safe mode boot.
This may halp to debug the logs or revert to a previous kernel ?

Do you have sysctl or tuned ? If so you may also try to disable them ?
 
I have nothing configured, it is a super plain Proxmox. I just installed Debian (netinst) and then on Debian installed Proxmox. I configured absolutely nothing, except the SMTP server for sending the notifications. For this reason I am so puzzled, I never saw that.
 
Hmm maybe it could be a hardware problem... I have brand new hardware (4 days old only) and let it run, 3 days it ran fine and yesterday it crashed. I tried now to recover but, interestingly, now does not even the BIOS screen appear. Completely dead.

However, what is maybe interesting:

I uploaded the screenprint of the kernel panic to ChatGPT and he thinks it is a kernel bug that mainly affects AMD EPYC. He recommends the following fixes:

1761570985688.png

1761571043175.png

1761571063969.png


So I don't know if this is of any use for someone, honestly I would have tried it but currently, my system even fails to boot at all. So I cannot even try this one fix.
 
Try the official proxmox ve installer iso. If the issue is related to the 6.14 kernel then you could try to install the opt-in 6.17 kernel.
If you can't even boot anymore (even from usb live media) then it's most likely a hardware issue.

And check if there are firmware updates available
 
OK this thread is resolved now - and you guys won't believe the solution.
So I cannnot reproduce the above problem with the RCU / TLS bug.
Also, the machine refused to boot completely, not even POST or BIOS. After calling the manufacturer, we found that it would be worth to exchange the CPU. Yesterday, I got a fresh CPU in the mail, and changed it - and now everything works again absolutely perfectly, no problems at all. So the CPU was bad.
Why? nobody knows. I use computers since almost 30 yrs. Never have I ever seen a CPU working and then suddenly stop.
So it was not a Proxmox problem, and not a kernel bug problem but a CPU that was starting to do weird things. Funny enough, after the kernel panic happened and I rebooted the machine, it first worked fine, I could also enter the BIOS and everything, no problem. The CPU failure occured randomly after some time.
I ran now prime95 benchmark several times and observed no crashes, so I think I leave the system now running for a few more days until I completely comission it again.