Random kernel panic/crashes/reboot

collider18 · Dec 14, 2023

ajay310 said:
The new kernel did not fix the issue for me. The server already crashed the same night after installing it. I'm only running VMs, no containers. Turning them off is no option for me.

I've now followed collider18's idea and switched to intel_cpufreq. I have done that by setting p-states to passive:

Code:

echo "passive" > /sys/devices/system/cpu/intel_pstate/status

This of course will revert back to the default after a reboot, but I'll deal with a permanent solution later if this works.

cpufreq-info will confirm you're using intel_cpufreq as the driver.

exactly what I tried. It is still running but I am not overly optimistic because it has only been 6 days.

matthias77 · Dec 15, 2023

Crashed tonight with echo "passive" > /sys/devices/system/cpu/amd_pstate/status after 1.6 days.
I will now also add idle=poll to kernel boot parameters...

collider18 · Dec 15, 2023

matthias77 said:
Crashed tonight with echo "passive" > /sys/devices/system/cpu/amd_pstate/status after 1.6 days.
I will now also add idle=poll to kernel boot parameters...

Unfortunately, mine also crashed. I requested a different server (with a different motherboard) and will check if this bug persists.

collider18 · Dec 15, 2023

Has anyone bought a subscription and accessed the prod repo, and still experienced this bug? I wonder if, by paying for a subscription and creating a ticket, someone from the Proxmox team will help. It seems that in this forum, we will not receive any support from them.

collider18 · Dec 15, 2023

The standard subscription includes 10 tickets per year and customer support via SSH. Perhaps I will pay for it. I am curious to hear what the Proxmox team has to say about this issue, because I have encountered the same bug on several servers since October(the same as you), despite changing hardware with no success. It seems there might be something wrong with the kernel itself.

gzbenson · Dec 17, 2023

I have the same issue like you when I upgraded my host from proxmox 7.14 to 8.1.3 , kernel from 6.2 to 6.5.11-7

It’s working fine in old version

Is this the kernel bug?

B660m , 12100 cpu , 64G non ecc ram
nvme p44 pro x 2 zfs for pve8

VM: windows server 2022
passthrough onboard pch sata controller with 16T x4 HDD to windows server 2022 for storage and VirtIO block disk (100gb) for system
passthrough asm1166 6xsata to nvme adapter
Passthrough vgpu to windows server

Another VMs shutdown

collider18 · Dec 17, 2023

Hi, guys! I think I have an idea on how we can debug it, but I'm going to need your help. I was thinking, why don't we have any information in the kernel log? Maybe it's because something is happening to the PCI Express bus, and the system is losing access to the network and NVMe drives? Perhaps there's nothing in the logs because the system doesn't have access to any storage to store this information about kernel crash. What if we connect a USB flash drive to the USB 2.0 port (do not use the USB 3.0 port) and just mount it to the /var/log path in the /etc/fstab? In this case, the next time the system crashes, it will have access to the USB flash drive, and we will be able to dig up some information about this bug. Unfortunately, I don't have physical access to my servers, but you can do it in 5 minutes. What do you think? Please try this out.

ajay310 · Dec 17, 2023

No luck with my server either, crashed multiple times since switching to intel_cpufreq.

collider18 said:
Hi, guys! I think I have an idea on how we can debug it, but I'm going to need your help. I was thinking, why don't we have any information in the kernel log? Maybe it's because something is happening to the PCI Express bus, and the system is losing access to the network and NVMe drives? Perhaps there's nothing in the logs because the system doesn't have access to any storage to store this information about kernel crash. What if we connect a USB flash drive to the USB 2.0 port (do not use the USB 3.0 port) and just mount it to the /var/log path in the /etc/fstab? In this case, the next time the system crashes, it will have access to the USB flash drive, and we will be able to dig up some information about this bug. Unfortunately, I don't have physical access to my servers, but you can do it in 5 minutes. What do you think? Please try this out.

I do not have physical access currently either, so I am trying something else instead: I've opened an SSH session from another server and ran
journalctl -f. The SSH session stays open (I'm using tmux for that) and I will receive all system log output until a crash occurs. Hopefully I will be able to see some output that otherwise gets lost, because it can't get flushed to the SSD in time after the crash. This step was suggested in another thread.

collider18 · Dec 17, 2023

ajay310 said:
No luck with my server either, crashed multiple times since switching to intel_cpufreq.

I do not have physical access currently either, so I am trying something else instead: I've opened an SSH session from another server and ran
journalctl -f. The SSH session stays open (I'm using tmux for that) and I will receive all system log output until a crash occurs. Hopefully I will be able to see some output that otherwise gets lost, because it can't get flushed to the SSD in time after the crash. This step was suggested in another thread.

I am not sure about it because if you lose your PCIe bus, you will not have any network and storage at all, and therefore no SSH. That's why I suggested using USB 2.0 only because it does not use PCIe, whereas USB 3.0 does. I need someone with physical access to the server to mount a flash drive to the /var/log folder.

collider18 · Dec 17, 2023

ajay310 said:
No luck with my server either, crashed multiple times since switching to intel_cpufreq.

I do not have physical access currently either, so I am trying something else instead: I've opened an SSH session from another server and ran
journalctl -f. The SSH session stays open (I'm using tmux for that) and I will receive all system log output until a crash occurs. Hopefully I will be able to see some output that otherwise gets lost, because it can't get flushed to the SSD in time after the crash. This step was suggested in another thread.

But just keep it running; maybe you will be able to intercept something. It's worth trying, but with the USB 2.0 flash drive, I think we have more chances.

ajay310 · Dec 20, 2023

My experiment didn't work. This morning it happened again. The remote log didn't show anything unusual and just stopped with

Code:

client_loop: send disconnect: Broken pipe

But I now have physical access to the machine and will try collider18's suggestion and also update firmware if updates are available.

collider18 · Dec 20, 2023

ajay310 said:
My experiment didn't work. This morning it happened again. The remote log didn't show anything unusual and just stopped with

Code:

client_loop: send disconnect: Broken pipe

But I now have physical access to the machine and will try collider18's suggestion and also update firmware if updates are available.

Great, please try a USB 2.0 flash drive. Just rsync /var/log to somewhere else, preserving ownership. Mount the flash drive to /var/log and rsync your logs back. Mount it permanently with /etc/fstab and just wait for the crash. Once again, do not use USB 3.0 because it is on the PCIe x1 bus. Maybe your effort will save all of us. I would do it myself, but as I said, I don't have any physical access to the servers. Thank you.

ChikoDc · Dec 20, 2023

Hello,

I am a new user of the Proxmox system and in the world of Linux platforms in general. I'm not sure if my example could also help find a solution to this issue; however, I also notice that my Proxmox server randomly restarts almost every day.

According to the system logs, I don't see anything that seems to mention an error; I simply have a mention of "-- Reboot --."

On my server, I only have a PiHole server on a Ubuntu VM and another VM with Ubuntu as well to perform general tests and increase my knowledge in Linux.

I've been using a PiHole server for 2 years now, but previously, instead of having a virtual machine, I had installed Ubuntu Server directly on the physical machine, and yet I never had these restart problems. (Into the same PC that I use now for Proxmox)

Unlike some, I am not using a Ryzen processor; I have an old i7-5500U that meets my needs.

I provide you with the logs between the last two reboots of the machine, if that can help you.

matthias77 · Jan 3, 2024

(editing because I missed that there already was a second page.)

I currently don't have physical access as I'm moving into a different location but can access it every few days. I also tried tailing logs over SSH with no log and from what I read it confirmed that it won't help as @collider18 stated. I though that the kdump-crash procedure would help with that.

I can clearly see a correlation with running VMs or only containers. Without VMs it didn't crash for more than 2 weeks and today I started using a VM and crashed twice within a few hours, even with pstate set to passive.

Has anyone bought a subscription and accessed the prod repo, and still experienced this bug? I wonder if, by paying for a subscription and creating a ticket, someone from the Proxmox team will help. It seems that in this forum, we will not receive any support from them.

I only have the community subscription without tickets but I'm using the production repo since about 1 year.

collider18 · Jan 3, 2024

matthias77 said:
(editing because I missed that there already was a second page.)

I currently don't have physical access as I'm moving into a different location but can access it every few days. I also tried tailing logs over SSH with no log and from what I read it confirmed that it won't help as @collider18 stated. I though that the kdump-crash procedure would help with that.

I can clearly see a correlation with running VMs or only containers. Without VMs it didn't crash for more than 2 weeks and today I started using a VM and crashed twice within a few hours, even with pstate set to passive.

I only have the community subscription without tickets but I'm using the production repo since about 1 year.

Could you do what I suggest and mount /var/log folder to usb 2.0 flashdrive when you have access to the server? I don't have an access to mine and I hope maybe you can test my theory. Thank you

ajay310 · Jan 6, 2024

I am cautiously optimistic, I may have found a solution to the issue. By coincidence I came across this thread that describes similar symptoms. What really caught my attention is that they mention the exact same hardware I am using (HP EliteDesk 800 G4).

Basically the issue seems to be caused by the graphics driver and its GPU power management. By disabling power management the issue goes away. I followed this advice and didn't have a crash since, two weeks and counting. More background info here.

What you need to do is add i915.enable_dc=0 to your kernel parameters in /etc/kernel/cmdline and then run proxmox-boot-tool refresh. After a reboot you can verify your changes took effect by checking whether the new parameter is present in /proc/cmdline.

collider18 · Jan 10, 2024

ajay310 said:
I am cautiously optimistic, I may have found a solution to the issue. By coincidence I came across this thread that describes similar symptoms. What really caught my attention is that they mention the exact same hardware I am using (HP EliteDesk 800 G4).

Basically the issue seems to be caused by the graphics driver and its GPU power management. By disabling power management the issue goes away. I followed this advice and didn't have a crash since, two weeks and counting. More background info here.

What you need to do is add i915.enable_dc=0 to your kernel parameters in /etc/kernel/cmdline and then run proxmox-boot-tool refresh. After a reboot you can verify your changes took effect by checking whether the new parameter is present in /proc/cmdline.

Thank you for the information. One more proof: it is something related to power management, and from here, we do have three options - SATA controller power management, GPU power management, or CPU power management. In my case, I already excluded the CPU with cstate=1 and it didn't help. In which particular case, depending on the hardware, it could be one of those. I suggest we try to find out and disable all three of them together, and if the problem is fixed, exclude them one by one from kernel parameters.

alphalone · Feb 21, 2024

Coming in to chime in that I've been experiencing the same issues on an i3-12100 system, where I've changed everything but the CPU and the boot drive without any change (running DDR4-3600 sticks at default clocks to be sure, on first a Gigabyte B660M-DS3H DDR4 and now an Asus Pro B660M-C D4-CSM, only external PCIe devices are two NVMe SSDs, a Crucial P1 and an Intel 660p). Instabilities are both on Kernel 5.15.83-1 and on any Kernel 6 (though any Kernel 6 version just refuses to boot at this point, even though they didn't fail to boot before). I've experienced data loss through corruption of some VMs filesystems and am very distraught with all this. Had been running Proxmox for 5 years at this point with Xeons and Threadripper before this and never saw anything of the sort. I'm getting kind of tired of changing settings and parts and having it "seem" stable for over a week to finally (Reboot) without any logs or dumps. Seriously considering Windows Server since Windows 11 has shown great stability, running for over a month without crashes (since apparently even some Ryzen owners are afflicted by this issue). This whole thing leaves me dumbfounded.

matthias77 · Mar 15, 2024

collider18 said:
Could you do what I suggest and mount /var/log folder to usb 2.0 flashdrive when you have access to the server? I don't have an access to mine and I hope maybe you can test my theory. Thank you

I finally mounted the USB2 flashdrive at /var/log. I think the motherboard only has USB3 ports but the stick is USB2. I will report as soon as I get the next crash.

matthias77 · Mar 15, 2024

There are still no logs of the crash:

Code:

2024-03-15T14:43:00.091048+01:00 proxmox1 pvedaemon[598259]: worker exit
2024-03-15T14:43:00.120650+01:00 proxmox1 pvedaemon[5826]: worker 598259 finished
2024-03-15T14:43:00.120773+01:00 proxmox1 pvedaemon[5826]: starting 1 worker(s)
2024-03-15T14:43:00.120855+01:00 proxmox1 pvedaemon[5826]: worker 608400 started
2024-03-15T14:43:07.309734+01:00 proxmox1 pveproxy[589138]: worker exit
2024-03-15T14:43:07.334612+01:00 proxmox1 pveproxy[5836]: worker 589138 finished
2024-03-15T14:43:07.334696+01:00 proxmox1 pveproxy[5836]: starting 1 worker(s)
2024-03-15T14:43:07.337588+01:00 proxmox1 pveproxy[5836]: worker 608884 started
2024-03-15T15:30:20.546479+01:00 proxmox1 systemd-modules-load[422]: Inserted module 'vhost_net'
2024-03-15T15:30:20.546601+01:00 proxmox1 kernel: [    0.000000] Linux version 6.5.13-1-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-1 (2024-02-05T13:50Z) ()
2024-03-15T15:30:20.546633+01:00 proxmox1 dmeventd[434]: dmeventd ready for processing.
2024-03-15T15:30:20.546641+01:00 proxmox1 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.13-1-pve root=/dev/mapper/pve-root ro quiet amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.bapm=0 amdgpu.runpm=0 crashkernel=384M-:512M
2024-03-15T15:30:20.546642+01:00 proxmox1 kernel: [    0.000000] KERNEL supported cpus:
2024-03-15T15:30:20.546643+01:00 proxmox1 kernel: [    0.000000]   Intel GenuineIntel
2024-03-15T15:30:20.546643+01:00 proxmox1 kernel: [    0.000000]   AMD AuthenticAMD

Random kernel panic/crashes/reboot

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Attachments

New Member

New Member

New Member

New Member

New Member

New Member

New Member