Proxmox Server crashed, why?

Mrt12

Well-Known Member
May 19, 2019
160
19
58
45
CH
Good day dear colleagues,
I have a freshly installed Proxmox Server with AMD EPYC CPUs.
On Friday I left it running through the weekend, like nothing special, just a bunch of normal VMs, nothing critical. Everything was working.
Today I was lookign at it via remote IPMI and found that it has crashed, must have happened sometime during the weekend.
Because it is IPMI, I cannot fully copy/paste all text, but I attach a screenshot. I cannot decode why it crashes. Can someone interpret what is going on here?
 

Attachments

  • Screenshot_2025-10-27_08-20-29.png
    Screenshot_2025-10-27_08-20-29.png
    356.8 KB · Views: 18
Hello,

Can you dump the entire dmseg and check if it has the same alerts than mine ?

Because it seems pretty close to what I see :p
 
I absolutely cannot because the machine refuses to boot when I do a reset. I did everything: switch off the server, plug all power cords, wait couple minutes and redo it. Then it boots, and after the GRUB screen, it seems to get stuck at "Loading initial ramdisk...".
I then rebooted and edited the command line and removed the "quiet" option. Now I can see this (see attachment). I have no idea what is going on here. Because on Friday it was still working and in my Email Inbox I have notifications from cron jobs that run on Friday night. But because of some reason, it probably reset itself, and is now stuck.

Never have I ever seen something like this.
 

Attachments

  • DSC_0278.jpg
    DSC_0278.jpg
    1 MB · Views: 14
Last edited:
Maybe you can try to use console and switch to safe mode boot.
This may halp to debug the logs or revert to a previous kernel ?

Do you have sysctl or tuned ? If so you may also try to disable them ?
 
I have nothing configured, it is a super plain Proxmox. I just installed Debian (netinst) and then on Debian installed Proxmox. I configured absolutely nothing, except the SMTP server for sending the notifications. For this reason I am so puzzled, I never saw that.
 
Hmm maybe it could be a hardware problem... I have brand new hardware (4 days old only) and let it run, 3 days it ran fine and yesterday it crashed. I tried now to recover but, interestingly, now does not even the BIOS screen appear. Completely dead.

However, what is maybe interesting:

I uploaded the screenprint of the kernel panic to ChatGPT and he thinks it is a kernel bug that mainly affects AMD EPYC. He recommends the following fixes:

1761570985688.png

1761571043175.png

1761571063969.png


So I don't know if this is of any use for someone, honestly I would have tried it but currently, my system even fails to boot at all. So I cannot even try this one fix.