Random kernel panics

znt · May 11, 2023

I have a small home lab and run proxmox on a handful of nodes, two of which are ThinkCentre M90q Gen 2 Tiny (Intel) Desktops with i7-11700T. For the last year, I've been experiencing random crashes. Nothing gets printed in the screen. Nothing obvious shows up in the logs. Machine just completely freezes.

From reading through the forum, it seemed like it was issues with the e1000 based NICs or c-states being enabled in the bios. I've reviewed suggestions on the board and tried various troubleshooting steps (e.g. disabling c-states), but nothing has worked. It seems like when the machines are under less load, the crashes are more infrequent.

I've tried to enable kdump, following directions on the forum, but nothing appears to show up in the /var/crash directory.

Code:

 kdump-config show
DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_COREDIR:        /var/crash
crashkernel addr: 0x67000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.15.107-1-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-5.15.107-1-pve
current state:    ready to kdump

kexec command:
  /sbin/kexec -p --command-line="BOOT_IMAGE=/boot/vmlinuz-5.15.107-1-pve root=/dev/mapper/pve-root ro quiet reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

Code:

tail /sys/kernel/kexec_crash_loaded
1

Then I simulated a kernel panic.

Code:

echo c >/proc/sysrq-trigger

If anyone has any suggestions, it would be greatly appreciated.

gurubert · May 11, 2023

Have you tested the memory with e.g. memtest86?

juliokele · May 11, 2023

If your BIOS is not up to date, try updating it.
I had the same problem with Asrock Rack W480D4U motherboard + Intel i7-10700 CPU,
after BIOS update the problem is gone.

PigLover · May 12, 2023

Exactly the same problem with an m90q gen2 with i5-11500. Completely “silent” restarts (no messages, no kernel dump, no evidence of a panic). It seems to happen more frequently when there is heavy write activity to NVMe (using Gen4 M.2 drives). Updated BIOS does not seem to help. Tried all the available kernels under 7.x as they have been made available over the last several months.

They seem more like HW resets than a software issue, like a VRM shutdown or something. I’d suspect faulty hardware except that the exact same issue is present on 3 separate and identical m90q devices - I move the write-heavy workload and the random resets move with it.

Very frustrating.

znt · May 13, 2023

gurubert said:
Have you tested the memory with e.g. memtest86?

I did a long while back. I actually have two identical units and both exhibit similar behavior, one slightly more than other.

PigLover said:
Exactly the same problem with an m90q gen2 with i5-11500. Completely “silent” restarts (no messages, no kernel dump, no evidence of a panic). It seems to happen more frequently when there is heavy write activity to NVMe (using Gen4 M.2 drives). Updated BIOS does not seem to help. Tried all the available kernels under 7.x as they have been made available over the last several months.

They seem more like HW resets than a software issue, like a VRM shutdown or something. I’d suspect faulty hardware except that the exact same issue is present on 3 separate and identical m90q devices - I move the write-heavy workload and the random resets move with it.

Very frustrating.

Same. Updated bios. Even tried disabling the internal NIC and using a USB NIC, since it seems like other users have trouble with the e1000 drivers.

Mine both have Samsung MZVL2512HCJQ 512GB NVMe M.2 drives. Write-heavy workloads does seem like a match for me as well. I'll try to move some workloads around and see if that helps alleviate the crashing.

znt · May 14, 2023

znt said:
Mine both have Samsung MZVL2512HCJQ 512GB NVMe M.2 drives. Write-heavy workloads does seem like a match for me as well. I'll try to move some workloads around and see if that helps alleviate the crashing.

Quick update. I purchased and installed a pair of SAMSUNG 970 EVO Plus SSD 1TB NVMe M.2 drives. These are gen3 drives, which is a bit of a downer, but I've read it's a pretty insignificant difference. I had been running into storage issues anyway, so getting a pair of 1TB drives in was a big upgrade over the 512GB that shipped with the unit. Was happy to find these lenovo machines support two NVMe M.2 drives.

I reinstalled proxmox and continued to see crashes. I then added this back into my `/etc/network/interfaces` file, as others have reported this was necessary to address problems with the e1000 based NICs.

Code:

iface eno1 inet manual
        post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

I have a couple of Dells that also support gen4 dNVMe M.2 drives. I might try to move these 512GB drives into the Dells, since they shipped with 256GB gen4 NVMe M.2 drives. I'll report back to see if the crashing follows the drives into the Dell units and/or if the gen3 drives mitigate the crashing on the lenovo units.

znt · May 16, 2023

I think replacing the drives with gen3 drives did the trick. The worst offending node is currently at 2 days of uptime (previously it was crashing every handful of hours). I've also doubled the number of VMs on it.

xokia · May 16, 2023

will be curious if your system stays up mine also crashes unless I disable C-states. Haven'f figured out why. Havent been able to get crash dump working.

The e1000 nic is a emulated nic from what I understand.

znt · May 19, 2023

xokia said:
will be curious if your system stays up mine also crashes unless I disable C-states. Haven'f figured out why. Havent been able to get crash dump working.

The e1000 nic is a emulated nic from what I understand.

Both of machines have been up for almost 5 days. I was able to set bios back to default; although, I later disabled bluetooth and wifi (since I didn't need either).

Code:

uptime
 00:25:26 up 4 days, 23:38,  1 user,  load average: 1.97, 2.07, 2.52

uptime
 00:22:46 up 4 days, 22:47,  1 user,  load average: 1.98, 1.95, 2.13

Seems like this may have resolved my issue:

1. Replacing NVMe Gen4 drives w/ gen 3 drives.
2. Updating network settings as described above.
3. Reinstalling proxmox.

I left C-states alone, after reverting bios to default settings.

Appreciate the suggestions about the SSD. I had previously assumed it was when network traffic spiked.

xokia · May 21, 2023

znt said:
Both of machines have been up for almost 5 days. I was able to set bios back to default; although, I later disabled bluetooth and wifi (since I didn't need either).

Code:

uptime 00:25:26 up 4 days, 23:38, 1 user, load average: 1.97, 2.07, 2.52 uptime 00:22:46 up 4 days, 22:47, 1 user, load average: 1.98, 1.95, 2.13

Seems like this may have resolved my issue:

1. Replacing NVMe Gen4 drives w/ gen 3 drives.
2. Updating network settings as described above.
3. Reinstalling proxmox.

I left C-states alone, after reverting bios to default settings.

Appreciate the suggestions about the SSD. I had previously assumed it was when network traffic spiked.

hmmmm........replacing gen4 drives with gen3 drives? It would be good to know what proxmox officially supports. My NVMe is gen4 I dont see changing it out. I would think the version of NVMe should be transparent to proxmox as the speed difference is handled in hardware not software.

where do I find more infor about the ethernet changes you made?

znt · May 21, 2023

xokia said:
hmmmm........replacing gen4 drives with gen3 drives? It would be good to know what proxmox officially supports. My NVMe is gen4 I dont see changing it out. I would think the version of NVMe should be transparent to proxmox as the speed difference is handled in hardware not software.

where do I find more infor about the ethernet changes you made?

I should note that I have gen4 drives in other proxmox nodes (Dell 1L units), that are problem free. Seems like it was just the combination of hardware in the lenovo machines that was having issues. I don't know the speed difference and needed the extra space (2x1TB vs a single 512GB drive).

For the e1000 issues, there are several threads on it. This one goes back quite a long way and has some recent posts.

https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-9#post-558410

Good luck!

xokia · May 22, 2023

znt said:
I should note that I have gen4 drives in other proxmox nodes (Dell 1L units), that are problem free. Seems like it was just the combination of hardware in the lenovo machines that was having issues. I don't know the speed difference and needed the extra space (2x1TB vs a single 512GB drive).

For the e1000 issues, there are several threads on it. This one goes back quite a long way and has some recent posts.

https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-9#post-558410

Good luck!

I'll try it but e1000 is the emulated NIC not the physical NIC. My system crashes with nothing but proxmox running so no emulated NICs. I have a 2.5Gbps NIC anything is worth trying.

xokia · May 24, 2023

update....unfortunately the NIC changes did not work for me. Still crashes once per day.

might need to look for a different hypervisor I just can't get proxmox to run stable and cant seem to enable crash logs so its shooting in the dark for debug.

Thanks for the suggestion though!

PigLover · May 24, 2023

I think I've gotten my system stable but I am really not happy with the side effects of the workaround. Its been running stable for a few days but I'd really like to get at least 30 days running before declaring victory (a pyrrhic victory, perhaps).

For background, so people don't have to re-read the thread, system is a Lenovo m90q gen2, I5-11500, 64gb, 2x NVMe. Proxmox runs great up to the point that I start a write-intensive VM (BlueIris NVR with 9 camera, continuous recording to disk with alerts/clips triggered by motion detection and an AI filter). If that one VM is running then the whole system - not just that VM - will just hang after some time. Could be an hour or could be a day but the behavior is very consistent. After some time, never more than 36 hours, the whole system just stops. No messages, no kernel panic, no core core dump. Just a silent stall. I have more than one identical m90q and the same problem presents if you run the write-intensive VM on any of them.

Note: yes, I know its silly to do continuous write NVR to an m.2 NVMe drive. That is a temporary solution and is not the point of the problem.

As discussed above I tried all of the things discussed here. The e1000 "fix". Changed NVMe drives to see if it is drive related or PCIe 3.0/4.0 related. Disabled C-states in the BIOS. Etc.

Did a lot of google searches and talked to a few old friends. C-state issues kept coming up as the likely cause that best fit the symptoms. If your system just silently stops like this it is almost always c-state related. After some more digging I realized that disabling C-states in the BIOS doesn't really disable them on modern Linux systems. The kernel power management driver (intel_idle) appears to turn them back on.

So I disabled C-states by adding the following kernel arguments to the boot command line:

processor.max_cstate=1 intel_idle.max_cstate=1

The docs on the intel_idle are not really clear about how it interacts with the ACPI idle drivers. Just to be safe I set both of them to allow a max cstate of "1". And viola - the system has been running with the NVR VM for several days. I want to see it run stable for 30 days before I really declare this the fix. I did try with different levels of c-state enabled. Setting intel_idle.max_state=3 still crashed, but =1 seems to stabilize it.

Now for the pyrrhic part - running with these kernel options has caused a ~35% increase in power consumption. Before (with crashes) it averaged about 60w with the NVR VM running. With the "fix" in place it is averaging 82w. Also, temps are a bit high with Tcase hitting 77-80c and individual cores running from 65-75c (chassis is in a room at rather normal room temperature). I didn't notice temp issues without the "fix" but the whole reason I checked was due to hearing the fan revving up and I don't recall it doing that before. I'm guessing this is because the NVR VM is set to run on 6 vCPUs and the i5-11500 is 6 core/12 thread, which leaves room for roughly half of the cores to be idle most of the time but they are never allowed to actually hit a low-power state.

xokia · May 24, 2023

PigLover said:
I think I've gotten my system stable but I am really not happy with the side effects of the workaround. Its been running stable for a few days but I'd really like to get at least 30 days running before declaring victory (a pyrrhic victory, perhaps).

For background, so people don't have to re-read the thread, system is a Lenovo m90q gen2, I5-11500, 64gb, 2x NVMe. Proxmox runs great up to the point that I start a write-intensive VM (BlueIris NVR with 9 camera, continuous recording to disk with alerts/clips triggered by motion detection and an AI filter). If that one VM is running then the whole system - not just that VM - will just hang after some time. Could be an hour or could be a day but the behavior is very consistent. After some time, never more than 36 hours, the whole system just stops. No messages, no kernel panic, no core core dump. Just a silent stall. I have more than one identical m90q and the same problem presents if you run the write-intensive VM on any of them.

Note: yes, I know its silly to do continuous write NVR to an m.2 NVMe drive. That is a temporary solution and is not the point of the problem.

As discussed above I tried all of the things discussed here. The e1000 "fix". Changed NVMe drives to see if it is drive related or PCIe 3.0/4.0 related. Disabled C-states in the BIOS. Etc.

Did a lot of google searches and talked to a few old friends. C-state issues kept coming up as the likely cause that best fit the symptoms. If your system just silently stops like this it is almost always c-state related. After some more digging I realized that disabling C-states in the BIOS doesn't really disable them on modern Linux systems. The kernel power management driver (intel_idle) appears to turn them back on.

So I disabled C-states by adding the following kernel arguments to the boot command line:

processor.max_cstate=1 intel_idle.max_cstate=1

The docs on the intel_idle are not really clear about how it interacts with the ACPI idle drivers. Just to be safe I set both of them to allow a max cstate of "1". And viola - the system has been running with the NVR VM for several days. I want to see it run stable for 30 days before I really declare this the fix. I did try with different levels of c-state enabled. Setting intel_idle.max_state=3 still crashed, but =1 seems to stabilize it.

Now for the pyrrhic part - running with these kernel options has caused a ~35% increase in power consumption. Before (with crashes) it averaged about 60w with the NVR VM running. With the "fix" in place it is averaging 82w. Also, temps are a bit high with Tcase hitting 77-80c and individual cores running from 65-75c (chassis is in a room at rather normal room temperature). I didn't notice temp issues without the "fix" but the whole reason I checked was due to hearing the fan revving up and I don't recall it doing that before. I'm guessing this is because the NVR VM is set to run on 6 vCPUs and the i5-11500 is 6 core/12 thread, which leaves room for roughly half of the cores to be idle most of the time but they are never allowed to actually hit a low-power state.

It's definitely a C-state issue. Disabling Cstates problem goes away. Idle power goes from 40w ->120w with C-states disabled which is also unacceptable. I'm unsure why some folks running the same processor are not also experiencing the issue.

I'm thinking is a C-state tied in with something that is trying to be serviced that is failing. What that something is I have not figured out.

znt · May 31, 2023

Update: still seeing kernel panics, but the frequency has gone down... from multiple times per day to once every few days.

Next steps for me:

- Try to figure out how to enable kdump.
- Adjust c-states in Lenovo bios.

Sigh.

znt · Jun 3, 2023

Just wanted to share an update:

- I've tried adjusting bios CPU settings (C1E support, C state support), but that doesn't seem to help. I think I tried several different permutations, with the most recent being disabling C1E support and setting C State support to C1 only. That didn't help.

- I've tried moving to a USB NIC, but saw the same crashing.

- I checked the memory. For reference, I have 2x16GB of memory in each node... but the memory type is different. Both nodes experience similar crashing behavior.

- I've enabled kdump and have confirmed it's enabled, but don't get any crash logs.

- Nothing regarding the crash is displayed to STDOUT.

I've been trying to resolve this for a year and think I'm ready to surrender (i.e. may repurpose these nodes for something else in the homelab). The other 10 nodes in my cluster work great. If anyone has any last minute tips before I go that route, feel free to let me know.

xokia · Jun 3, 2023

znt said:
Just wanted to share an update:

- I've tried adjusting bios CPU settings (C1E support, C state support), but that doesn't seem to help. I think I tried several different permutations, with the most recent being disabling C1E support and setting C State support to C1 only. That didn't help.

- I've tried moving to a USB NIC, but saw the same crashing.

- I checked the memory. For reference, I have 2x16GB of memory in each node... but the memory type is different. Both nodes experience similar crashing behavior.

- I've enabled kdump and have confirmed it's enabled, but don't get any crash logs.

- Nothing regarding the crash is displayed to STDOUT.

I've been trying to resolve this for a year and think I'm ready to surrender (i.e. may repurpose these nodes for something else in the homelab). The other 10 nodes in my cluster work great. If anyone has any last minute tips before I go that route, feel free to let me know.

sorry.......I'm kinda in the same boat looking for other options

You can try the following command see if anything stands out
dmesg --level=err,warn

znt · Jun 3, 2023

Thanks. Just an FYI. I'm currently trying with the VM hard disks configured with aio=native. Seems like that may have helped others in the past. I'll report back with the results.

KingDigweed · Jun 4, 2023

Hi lads,

Just thought I'd drop by as I've been having similar issues and started my own thread about a month ago. Tried everything here (and perhaps a bit more!) except the C-states tweak, I believe. I'll give that a go tomorrow I suppose.

Just caught my eye as the way you've been describing your issues sounds astonishingly familiar. I would say "don't give up!" but at some point I guess you have to cut your losses. Unfortunately that's not really an option for me so I'm hoping someone manages to figure things out!

If I find anything of note I'll drop something here.

Best of luck,

Chris

Random kernel panics

New Member

Famous Member

Renowned Member

Renowned Member

New Member

New Member

New Member

Member

New Member

Member

New Member

Member

Member

Renowned Member

Member

New Member

New Member

Member

New Member

Member