Proxmox random reboots on HP Elitedesk 800g4 - fixed with proxmox install on top of Debian 12 - now issues with hardware transcoding in plex

limit · Sep 7, 2023

Just an FYI, I updated proxmox tonight, and one of the updates was pve-firmware. This put the kbl_dmc_ver1_04.bin file back and after a reboot caused system crashes again. Renaming the file once again fixed the issue for now.

hanzoh · Sep 7, 2023

Yes, each upgrade of pve-firmware will reinstate that file.
In the bug report upstream (https://gitlab.freedesktop.org/drm/intel/-/issues/9244), I was given the workaround of adding i915.enable_dc=0 to the kernel commandline (in /etc/default/grub). This also works as it disables the GPU power management.

In the meantime, I did so much testing with my three identical machines, that I am certain this is a mainboard issue on the first node.
I ordered a 4th used one and that also does NOT have the issue. Now I have my working 3 node cluster and will use the faulty one as a Windows machine. Don't know why what the Windows GPU driver does differently that it does not trigger the issue.

mcdy · Sep 7, 2023

Just for documentation, I have the same problem on a HP ProDesk 400 G4 Mini and so far it looks like connecting a monitor works.
The only differnce is my system dies after "some" reboots and stucks at all while it gets hotter and hotter.

Now I will try the workaround with renaming the file.

Cheers!

mcdy · Sep 7, 2023

hanzoh said:
Yes, each upgrade of pve-firmware will reinstate that file.
In the bug report upstream (https://gitlab.freedesktop.org/drm/intel/-/issues/9244), I was given the workaround of adding i915.enable_dc=0 to the kernel commandline (in /etc/default/grub). This also works as it disables the GPU power management.

In the meantime, I did so much testing with my three identical machines, that I am certain this is a mainboard issue on the first node.
I ordered a 4th used one and that also does NOT have the issue. Now I have my working 3 node cluster and will use the faulty one as a Windows machine. Don't know why what the Windows GPU driver does differently that it does not trigger the issue.

Hello,

does that mean you won't do anything more about this problem? That would be a terrible loss, because even if it's a problem with the mainboard and only occurs sporadically, it should still affect several devices and a final solution would of course be better than a workaround.

I also have a question about this:
Would it be better to disable it in the kernel or run a script automatically after each update?

Thank you!
mcdy

hanzoh · Sep 7, 2023

If the Intel Linux team comes back with further testing, I will gladly continue, right now I only tested one kernel patch which unfortunately did not help.
While using the "broken" machine, I monitored any pveupgrade/apt upgrade that I ran via SSH for upgrade to pve-firmware and then renamed the kbl_dmc_ver1_04.bin again. But now that I have learned about the kernel parameter, I would probably go with that one.

limit · Sep 7, 2023

mcdy said:
Just for documentation, I have the same problem on a HP ProDesk 400 G4 Mini and so far it looks like connecting a monitor works.
The only differnce is my system dies after "some" reboots and stucks at all while it gets hotter and hotter.

Now I will try the workaround with renaming the file.

Cheers!

I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.

limit · Sep 7, 2023

hanzoh said:
If the Intel Linux team comes back with further testing, I will gladly continue, right now I only tested one kernel patch which unfortunately did not help.
While using the "broken" machine, I monitored any pveupgrade/apt upgrade that I ran via SSH for upgrade to pve-firmware and then renamed the kbl_dmc_ver1_04.bin again. But now that I have learned about the kernel parameter, I would probably go with that one.

This is a better workaround i feel, I'm going to implement that tonight and see how it goes. Now that i'm confident in a fix(workaround) for these devices I'll probably add a 3rd node to my cluster as well.

hanzoh · Sep 8, 2023

limit said:
I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.

I have never seen a red LED, so maybe you really have a temperature issue.
Try installing lm-sensors and check the temperatures with `sensors`

limit · Sep 8, 2023

hanzoh said:
I have never seen a red LED, so maybe you really have a temperature issue.
Try installing lm-sensors and check the temperatures with `sensors`

Yeah when it happened I did install lm-sensors and would run it using watch -n 2 sensors to see. I never did see high temps under normal circumstances. but sometimes the fans would ramp up during a crash (presumably because of the bug we're facing). However I did decide to go ahead and reapply new thermal compound and clean out the heatsink and fans on both units, i don't really think that made much of a difference but I haven't had the issue since(I've been running with the workaround for weeks now so there's that too).

mcdy · Sep 8, 2023

limit said:
I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.

Hello, that's sounds like my problem.
But the device keep active and blows, nevertheless, it was getting warmer.
The case then had quite a high temperature.

Yesterday, I set the kernel parameter and restarted the device without a monitor, and it was still running this morning.

In case someone else hasn't done something like this yet:

Code:

nano /etc/default/grub

Search for the line with

Code:

GRUB_CMDLINE_LINUX_DEFAULT

and add it here in the inverted commas.
e.g.:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.enable_dc=0"

Code:

update-grub

after restart check config with

Code:

cat /proc/cmdline

Source: https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

Cheers!

limit · Sep 9, 2023

mcdy said:
Hello, that's sounds like my problem.
But the device keep active and blows, nevertheless, it was getting warmer.
The case then had quite a high temperature.

Yesterday, I set the kernel parameter and restarted the device without a monitor, and it was still running this morning.

In case someone else hasn't done something like this yet:

Code:

nano /etc/default/grub

Search for the line with

Code:

GRUB_CMDLINE_LINUX_DEFAULT

and add it here in the inverted commas.
e.g.:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.enable_dc=0"

Code:

update-grub

after restart check config with

Code:

cat /proc/cmdline

Source: https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

Cheers!

Still running OK without crashing or overheating?

Aussi · Sep 22, 2023

limit said:
I dont' really know what its meant for. Perhaps somekind of power saving feature that happens because a monitor is detached? It's really strange because 2 out of 3 identical devices had this issue. hanzoh seems to know more about it than I, he's opened a bug with the developer team

It's still going strong on my device ! Really nice !
IT started happening when I upgraded from a Nuc i3 5th gen to a Nuc i3 8th gen.

rophan · Oct 29, 2023

Just wanted to thank the contributers here, this information has really helped me! I am running Ubuntu 22.04 on a HP EliteDesk 800 (G4?) with i7-8700T as a headless home server and was having random reboots, around 6 per 24h on average.

I've tried various things including disabling "Extended idle states" in the bios and adding the following kernel command line options:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="-- i915.enable_psr2_sel_fetch=0 i915.enable_psr=0 intel_idle.max_cstate=1 i915.enable_dc=0 ahci.mobile_lpm_policy=1 processor.max_cstate=1 i915.disable_power_well=1"

I found none of the max_cstate options seemed to work (`/sys/module/intel_idle/parameters/max_cstate` always shows 9).
I thought the BIOS setting had done it (before removing the firmware), but still had a crash over night.

Finally removing the `/lib/firmware/i915/kbl_dmc_ver1_04.bin` seems to have fixed it (up 24h without a crash).

For anyone else finding this on Ubuntu, after removing the file on disk I needed to regenerate an initramfs (which will warn that this file is missing), then reboot.

Code:

sudo update-initramfs -c -k $(uname -r)

I also upgradeted to oem kernel (6.5).

I'm going to go step by step putting things back now (first the bios CPU power management states). But I would never have got it working without the pointer here to kbl_dmc_ver1_04.bin.

EDIT: powertop shows that the GPU is spending 100% of time in RC6. From a quick bit of reading this seems to be a deep sleep state for GPU so hoping power consumption will not be too high. Others listed in powertop are RC6p and RC6pp but I read that these are not used in more recent versions. So what is disabled by the fix here is DC3-DC6 power saving state, but I can't find much information about how those differ from the RC6 state.

EDIT2: Having tested more, having only the kernel option i915.enable_dc=0 and the firmware in place does seem to fix the random reboots for me. I am not sure if this is actually a problem with the GPU driver itself - I think it migth be a problem with the CPU in deep sleep states, but those states can only be reached when the GPU is in deep sleep state - so preventing GPU sleep through the enable_dc option prevents the CPU going to the states that make the system panic and reboot.

opcodeoeprator · Oct 30, 2023

I have 3 HP Pro Elitedesk G4s and have been following this issue.
Plugging in a monitor did prevent the random freezing.
It also seems to be there are two different issues being troubleshooted here which is important to point out.
That module seems to freeze and halt the system due to a bug, and a side effect is the internal system temperature will slowly increase until the BIOS powers the device down due to a thermal event. Those who say this fix is working for them, but are claiming to have no thermal issues need to check their BIOS to be sure. As this would indicate the motherboards are handling the issue differently (one instantly resetting, the other halting and increasing temperature)
Out of the three, only one seemed to be affected. Renaming the kbl_dmc_ver1_04.bin did the trick.
However, the issue is back again on another system, the exact 2-5 hour time frame between reboots.
I'm not sure if an update has embedded the module elsewhere, but I'm updating here for history.
I will now try the "i915.enable_dc=0" fix.

Please note those updating the boot parameters:
Make sure you update your grub boot loader if using grub, but linux command line if using that.
Mostly everyone and all guides are still only talking about /etc/default/grub. But if you are on EFI/ZFS you need to edit /etc/kernel/cmdline

Hopefully we can get a solution to this soon.

opcodeoeprator · Oct 30, 2023

hanzoh said:
I have never seen a red LED, so maybe you really have a temperature issue.
Try installing lm-sensors and check the temperatures with `sensors`

A lot of people here including myself are seeing system halting with increasing thermal temperatures until a shutdown is forced by the BIOS when the kernel module is present/ or without a monitor plugged in.
What is actually resetting yours when the issue happens? Do you see any temperature change? Or is it just halting until you power it down manually?

rophan · Oct 30, 2023

The issue I see is random reboots - the system is always up but when I check `last reboot` I see it is rebooting every few hours. There is nothing in any logs or dmesg about kernel panic or shutdown, just the start of the new boot sequence. It is exactly as if the power was disconnected (apart from that the machine is not set to boot up on reconnection of power). It is like pressing an old fashioned hard reset power button. I don't see any issue with temperature.
From what I have read this issue is probably from the system entering too deep a sleep state, so everything turns off. I think it might not be the GPU at all, but might be because the GPU going into deep sleep state is a necessary precursor to the CPU going to one of the deeper sleep states (like 9 or 10). So as long as the GPU doesn't sleep, the CPU won't enter the problem state. I am not sure about any of this though.

I am running now with the firmware in place but `enable_dc=0`.

The strange thing for me was that the `intel_idle.max_cstate` option didn't seem to have any effect (this would be able to limit the CPU state directly so could maybe keep graphics sleep states)

Apicedda · Oct 30, 2023

Hello, I'm having the same issues or so i think. I got a prodesk 400 g4 mini, installed debian headless (multiple times by now), no monitor plugged. On the latest clean install i did it stayed up for 6 days without crashing, then today it crashed and rebooted like 10 times in 2 hours (i did plug the monitor once today for a couple of minutes, then unplugged, then started crashing soon later). I tried swapping cpu, psu and ram.
Most of the time it reboots and i don't even notice it until i check last reboots (and see all the sessions still running). Other times when it crashes while i'm connected via SSH, it hangs there, doesn't respond to pings, plugging the monitor gives no signal, and i have to shut it down manually (i don't expect this to reach any thermal limit, not with the current 2 core pentium installed). As you mentioned already, no logs at all about the issue.
I'm going to try the fixes proposed here, but are you sure this is not an hardware related issue? cause my unit is under warranty, but i don't know if i should try to get it replaced (if they accept it, cause they might test it on windows and it works fine for them), or if i would have the same issues on another unit.
Another thing i noticed, it never goes higher than C3, powertop shows like 98% C3, 2% C2. C6 and up are all 0%. I didn't see any misconfiguration n the bios, is this expected or do i have something wrong in my debian install? (this unit would stay idle most of the time, so i would like it to enter deeper sleep states).

rophan · Oct 30, 2023

Apicedda said:
I'm going to try the fixes proposed here, but are you sure this is not an hardware related issue?

Obviously your issue may be different but I'm pretty sure its software. There is quite a lot written online.
Easy way to test is to run something else on the machine that is not a recent linux kernel. For example run memtest86, or install windows and see if it is stable, or try to find a old linux live cd. I have just read the problems may have started with the 5.x kernels around 5.3, so if you can find a live CD based on 4.15 you could see if that is stable.

https://forums.linuxmint.com/viewtopic.php?t=326017
https://linuxreviews.org/Linux_Kern...he_Frequent_Intel_GPU_Hangs_In_Recent_Kernels

Try the `i915.enable_dc=0` option first?

dustojnikhummer · Nov 4, 2023

Did you fix this? I also have an issue with PVE crashing on a HP 400 G4 Mini. When I loaded Windows Server, it hasn't crashed in a week. Is there some sort firmware issue with Coffee Lake T CPUs?

rophan · Nov 4, 2023

dustojnikhummer said:
Did you fix this? I also have an issue with PVE crashing on a HP 400 G4 Mini. When I loaded Windows Server, it hasn't crashed in a week. Is there some sort firmware issue with Coffee Lake T CPUs?

Have you read the thread? The suggestions here fixed my issue.

Proxmox random reboots on HP Elitedesk 800g4 - fixed with proxmox install on top of Debian 12 - now issues with hardware transcoding in plex

New Member

Member

New Member

New Member

Member

New Member

New Member

Member

New Member

New Member

New Member

Member

New Member

New Member

New Member

New Member

New Member

New Member

Member

New Member

We value your privacy