Hello,Yes, each upgrade of pve-firmware will reinstate that file.
In the bug report upstream (https://gitlab.freedesktop.org/drm/intel/-/issues/9244), I was given the workaround of adding i915.enable_dc=0 to the kernel commandline (in /etc/default/grub). This also works as it disables the GPU power management.
In the meantime, I did so much testing with my three identical machines, that I am certain this is a mainboard issue on the first node.
I ordered a 4th used one and that also does NOT have the issue. Now I have my working 3 node cluster and will use the faulty one as a Windows machine. Don't know why what the Windows GPU driver does differently that it does not trigger the issue.
I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.Just for documentation, I have the same problem on a HP ProDesk 400 G4 Mini and so far it looks like connecting a monitor works.
The only differnce is my system dies after "some" reboots and stucks at all while it gets hotter and hotter.
Now I will try the workaround with renaming the file.
Cheers!
This is a better workaround i feel, I'm going to implement that tonight and see how it goes. Now that i'm confident in a fix(workaround) for these devices I'll probably add a 3rd node to my cluster as well.If the Intel Linux team comes back with further testing, I will gladly continue, right now I only tested one kernel patch which unfortunately did not help.
While using the "broken" machine, I monitored any pveupgrade/apt upgrade that I ran via SSH for upgrade to pve-firmware and then renamed the kbl_dmc_ver1_04.bin again. But now that I have learned about the kernel parameter, I would probably go with that one.
I have never seen a red LED, so maybe you really have a temperature issue.I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.
Yeah when it happened I did install lm-sensors and would run it using watch -n 2 sensors to see. I never did see high temps under normal circumstances. but sometimes the fans would ramp up during a crash (presumably because of the bug we're facing). However I did decide to go ahead and reapply new thermal compound and clean out the heatsink and fans on both units, i don't really think that made much of a difference but I haven't had the issue since(I've been running with the workaround for weeks now so there's that too).I have never seen a red LED, so maybe you really have a temperature issue.
Try installing lm-sensors and check the temperatures with `sensors`
Hello, that's sounds like my problem.I noticed while trying to troubleshoot the issue that sometimes the node would be in a frozen/offline state and it would have a red LED on the front, if I connected a monitor and rebooted it, it would show that it had shut down due to thermal overheat. Most of the time, the system will hang for a while(like 5 minutes unresponsive) then eventually crash and reboot, sometimes it just simply crashes and reboots right away.
nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.enable_dc=0"
update-grub
cat /proc/cmdline
Still running OK without crashing or overheating?Hello, that's sounds like my problem.
But the device keep active and blows, nevertheless, it was getting warmer.
The case then had quite a high temperature.
Yesterday, I set the kernel parameter and restarted the device without a monitor, and it was still running this morning.
In case someone else hasn't done something like this yet:
Search for the line withCode:nano /etc/default/grub
and add it here in the inverted commas.Code:GRUB_CMDLINE_LINUX_DEFAULT
e.g.:Code:GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.enable_dc=0"
after restart check config withCode:update-grub
Code:cat /proc/cmdline
Source: https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter
Cheers!
It's still going strong on my device ! Really nice !I dont' really know what its meant for. Perhaps somekind of power saving feature that happens because a monitor is detached? It's really strange because 2 out of 3 identical devices had this issue. hanzoh seems to know more about it than I, he's opened a bug with the developer team
GRUB_CMDLINE_LINUX_DEFAULT="-- i915.enable_psr2_sel_fetch=0 i915.enable_psr=0 intel_idle.max_cstate=1 i915.enable_dc=0 ahci.mobile_lpm_policy=1 processor.max_cstate=1 i915.disable_power_well=1"
sudo update-initramfs -c -k $(uname -r)
A lot of people here including myself are seeing system halting with increasing thermal temperatures until a shutdown is forced by the BIOS when the kernel module is present/ or without a monitor plugged in.I have never seen a red LED, so maybe you really have a temperature issue.
Try installing lm-sensors and check the temperatures with `sensors`
Obviously your issue may be different but I'm pretty sure its software. There is quite a lot written online.I'm going to try the fixes proposed here, but are you sure this is not an hardware related issue?
Have you read the thread? The suggestions here fixed my issue.Did you fix this? I also have an issue with PVE crashing on a HP 400 G4 Mini. When I loaded Windows Server, it hasn't crashed in a week. Is there some sort firmware issue with Coffee Lake T CPUs?