Hello Proxmox, Love the work you guys have going, its awesome.
Proxmox Upgrade & New install on 2 NUCs, a
NUC7i7DNK &
NUC8BEH
- - The NUC7i7DNKtook the upgrade like a champ, no issues what so ever!
- - Has Integrated GPU - 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)
- - The NUC8BEH took it not as well on the new kernel.
- - Has Integrated GPU - 00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-U GT3e [Iris Plus Graphics 655] (rev 01)
Backstory
Both nucs managed to upgrade but for some odd reason the
NUC8BEH stopped responding after a minute or two, or 5, So an investigation was ensued, as it was running headless I took it to my bench and plugged in a monitor. Booted up and showed no issue and worked as intended, thought it was a fluke and took it right back.
Plugged it back in headless.. it got up for a few minutes and then acted up again, no network no game
aka was not reachable.
Back to the bench it was, no issue again.. until I unplugged the monitor, a minute or two later, dead fish.
Investigation ensued part 2
It seems the new kernel has some issues on the
NUC8BEH with being headless, a kernel graphics driver issue, even though no display is in?
So started to scavenge the forum here to go trough a lot of the issues with the errors in the dmesg I could see, a particular one being
about power delivery and a change from
D3hot or
D3cold to
D0 because of
(config space inaccessible) when unplugging the monitor.
So started to look into GRUB settings, and tried the list below in the order I found them.
nosgx initcall_blacklist=sysfb_init video=efifb:off video=simplefb:off video=vesafb:off iommu=pt pcie_aspm=off
But as of now have not found a solution other than to order a small dummy hdmi plug
(aka. the backup plan) .
Anyway whilst doing the reboots I reverted to
5.15.108-1-pve #1 SMP PVE 5.15.108-1 (2023-06-17T09:41Z) x86_64 GNU/Linux
` and the issue is gone again.
going back up to
6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux
and the issue is back.
on
5.15.108-1-pve
I do still see the D3cold to D0 error when unplugging the monitor but the box never ever freezes.. or looses network`
This is the dmesg dump on
5.15.108-1-pve
full dmesg is attached.
Code:
[ 515.297967] pcieport 0000:00:1c.4: pciehp: Slot(8): Link Down
[ 515.297968] pcieport 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.297970] pcieport 0000:00:1c.4: pciehp: Slot(8): Card not present
[ 515.298044] pcieport 0000:03:02.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.298061] xhci_hcd 0000:3a:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.298064] xhci_hcd 0000:3a:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.298084] xhci_hcd 0000:3a:00.0: Controller not ready at resume -19
[ 515.298086] xhci_hcd 0000:3a:00.0: PCI post-resume error -19!
[ 515.298087] xhci_hcd 0000:3a:00.0: HC died; cleaning up
[ 515.298097] xhci_hcd 0000:3a:00.0: remove, state 4
[ 515.298099] usb usb4: USB disconnect, device number 1
[ 515.298316] xhci_hcd 0000:3a:00.0: USB bus 4 deregistered
[ 515.298322] xhci_hcd 0000:3a:00.0: remove, state 4
[ 515.298324] usb usb3: USB disconnect, device number 1
[ 515.298506] xhci_hcd 0000:3a:00.0: Host halt failed, -19
[ 515.298511] xhci_hcd 0000:3a:00.0: Host not accessible, reset failed.
[ 515.298590] xhci_hcd 0000:3a:00.0: USB bus 3 deregistered
[ 515.298743] pcieport 0000:03:01.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.298898] pcieport 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 515.298988] pci_bus 0000:04: busn_res: [bus 04] is released
[ 515.299056] pci 0000:03:00.0: Removing from iommu group 14
[ 515.299070] pci_bus 0000:05: busn_res: [bus 05-39] is released
[ 515.299113] pci 0000:03:01.0: Removing from iommu group 15
[ 515.464563] pci 0000:3a:00.0: Removing from iommu group 16
[ 515.464577] pci_bus 0000:3a: busn_res: [bus 3a] is released
[ 515.464666] pci 0000:03:02.0: Removing from iommu group 16
[ 515.464805] pci_bus 0000:03: busn_res: [bus 03-3a] is released
[ 515.464919] pci 0000:02:00.0: Removing from iommu group 13
To see a dmesg errors on
6.2.16-3-pve I hade to resort to some schenanigans
dmesg -w > somefile &
as the dmesg message was way longer than the moment it dropped networking, as having to unplug the monitor to cause the issue on
6.2.16-3-pve
.
Code:
[ 174.736314] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
[ 174.736323] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
[ 174.736333] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
[ 174.736343] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
[ 174.736346] i915 0000:00:02.0: [drm] *ERROR* Error reading LSPCON mode
[ 174.736347] i915 0000:00:02.0: [drm] *ERROR* LSPCON resume failed
[ 174.736356] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
[ 174.737340] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
... 10000 rows of same message ...
[ 174.738278] i915 0000:00:02.0: [drm] *ERROR* AUX B/DDI B/PHY B: not done (status 0x00000000)
...
...
...
[ 174.766220] ------------[ cut here ]------------
[ 174.766220] RPM raw-wakeref not held
[ 174.766245] WARNING: CPU: 1 PID: 24 at drivers/gpu/drm/i915/intel_runtime_pm.h:127 release_async_put_domains+0x115/0x120 [i915]
[ 174.766386] Modules linked in: cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_set xt_physdev xt_addrtype xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_mark iptable_filter bpfilter ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress intel_rapl_msr ac97_bus intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal snd_pcm_dmaengine
[ 174.766433] intel_powerclamp coretemp i915 iwlmvm snd_hda_intel btusb drm_buddy snd_intel_dspcfg kvm_intel btrtl btbcm ttm mac80211 btintel drm_display_helper snd_intel_sdw_acpi libarc4 btmtk mei_pxp mei_hdcp cec snd_hda_codec rc_core kvm irqbypass crct10dif_pclmul iwlwifi polyval_clmulni polyval_generic ghash_clmulni_intel snd_hda_core sha512_ssse3 snd_hwdep aesni_intel crypto_simd snd_pcm cryptd drm_kms_helper rapl i2c_algo_bit bluetooth syscopyarea snd_timer intel_wmi_thunderbolt sysfillrect intel_cstate pcspkr snd mei_me ee1004 soundcore cfg80211 ecdh_generic mei sysimgblt ecc wmi_bmof intel_pch_thermal joydev input_leds acpi_pad acpi_tad mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_logitech_hidpp hid_logitech_dj dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbmouse usbkbd
[ 174.766498] usbhid hid rtsx_pci_sdmmc nvme i2c_i801 crc32_pclmul xhci_pci e1000e xhci_pci_renesas i2c_smbus nvme_core rtsx_pci nvme_common ahci libahci xhci_hcd video wmi pinctrl_cannonlake
[ 174.766511] CPU: 1 PID: 24 Comm: kworker/1:0 Tainted: P W O 6.2.16-4-pve #1
[ 174.766513] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0094.2023.0612.1527 06/12/2023
[ 174.766515] Workqueue: events output_poll_execute [drm_kms_helper]
[ 174.766535] RIP: 0010:release_async_put_domains+0x115/0x120 [i915]
[ 174.766648] Code: 1d a3 f0 1d 00 80 fb 01 0f 87 2c 58 0e 00 83 e3 01 0f 85 50 ff ff ff 48 c7 c7 f4 a7 fc c1 c6 05 83 f0 1d 00 01 e8 db 17 88 c0 <0f> 0b e9 36 ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[ 174.766650] RSP: 0018:ffffaf9c00143c98 EFLAGS: 00010246
[ 174.766652] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 174.766653] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 174.766654] RBP: ffffaf9c00143cd0 R08: 0000000000000000 R09: 0000000000000000
[ 174.766655] R10: 0000000000000000 R11: 0000000000000000 R12: ffffaf9c00143ce0
[ 174.766657] R13: ffff988b899f8978 R14: ffff988b899f8000 R15: 0000000000000002
[ 174.766658] FS: 0000000000000000(0000) GS:ffff9892e0e80000(0000) knlGS:0000000000000000
[ 174.766659] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 174.766661] CR2: 0000561ff52a2000 CR3: 00000006a3e10005 CR4: 00000000003706e0
[ 174.766662] Call Trace:
[ 174.766663] <TASK>
[ 174.766667] intel_display_power_flush_work+0xc1/0xf0 [i915]
[ 174.766777] intel_dp_detect+0x3a8/0x730 [i915]
[ 174.766888] ? ww_mutex_lock+0x19/0xa0
[ 174.766893] drm_helper_probe_detect_ctx+0x57/0x120 [drm_kms_helper]
[ 174.766911] output_poll_execute+0x192/0x250 [drm_kms_helper]
[ 174.766926] process_one_work+0x222/0x430
[ 174.766930] worker_thread+0x50/0x3e0
[ 174.766932] ? __pfx_worker_thread+0x10/0x10
[ 174.766934] kthread+0xe6/0x110
[ 174.766937] ? __pfx_kthread+0x10/0x10
[ 174.766940] ret_from_fork+0x29/0x50
[ 174.766944] </TASK>
[ 174.766945] ---[ end trace 0000000000000000 ]---
[ 174.766947] ------------[ cut here ]------------
...
...
...
[ 197.802354] i915 0000:00:02.0: Use count on power well PW_2 is already zero
[ 197.802378] WARNING: CPU: 1 PID: 171 at drivers/gpu/drm/i915/display/intel_display_power_well.c:127 intel_power_well_put+0xa1/0xb0 [i915]
[ 197.802520] Modules linked in: cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables iptable_raw ipt_REJECT nf_reject_ipv4 xt_set xt_physdev xt_addrtype xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_mark iptable_filter bpfilter ip_set_hash_net ip_set sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_sof_pci_intel_cnl snd_sof_inte
l_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress intel_rapl_msr ac97_bus intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal snd_pcm_dmaengine
[ 197.802567] intel_powerclamp coretemp i915 iwlmvm snd_hda_intel btusb drm_buddy snd_intel_dspcfg kvm_intel btrtl btbcm ttm mac80211 btintel drm_display_helper snd_intel_sdw_acpi libarc4 btmtk mei_pxp mei_hdcp cec snd_hda_codec rc_core kvm irqbypass crct10dif_pclmul iwlwifi polyval_clmulni polyval_generic ghash_clmulni_intel snd_hda_core sha512_ssse3 snd_hwdep aesni_intel crypto_simd snd_pcm cryptd drm_kms_helper rapl i2c_algo_bit bluetooth syscopyarea snd_timer intel_wmi_thunderbolt sysfillrect intel_cstate pcspkr snd mei_me ee1004 soundcore cfg80211 ecdh_generic mei sysimgblt ecc wmi_bmof intel_pch_thermal joydev input_l
eds acpi_pad acpi_tad mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq simplefb hid_logitech_hidpp hid_logitech_dj dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c hid_generic usbmouse usbkbd
[ 197.802631] usbhid hid rtsx_pci_sdmmc nvme i2c_i801 crc32_pclmul xhci_pci e1000e xhci_pci_renesas i2c_smbus nvme_core rtsx_pci nvme_common ahci libahci xhci_hcd video wmi pinctrl_cannonlake
[ 197.802645] CPU: 1 PID: 171 Comm: kworker/1:4 Tainted: P W O 6.2.16-4-pve #1
[ 197.802647] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0094.2023.0612.1527 06/12/2023
[ 197.802649] Workqueue: events output_poll_execute [drm_kms_helper]
[ 197.802668] RIP: 0010:intel_power_well_put+0xa1/0xb0 [i915]
[ 197.802782] Code: 40 48 8d 04 c1 4c 8b 30 4d 85 ed 75 03 4c 8b 2f e8 54 0f 24 c1 4c 89 f1 4c 89 ea 48 c7 c7 f8 47 fa c1 48 89 c6 e8 df b0 87 c0 <0f> 0b 8b 43 18 e9 72 ff ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90
[ 197.802784] RSP: 0018:ffffaf9c006efc10 EFLAGS: 00010246
[ 197.802786] RAX: 0000000000000000 RBX: ffff988b8139e080 RCX: 0000000000000000
[ 197.802788] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 197.802789] RBP: ffffaf9c006efc30 R08: 0000000000000000 R09: 0000000000000000
[ 197.802790] R10: 0000000000000000 R11: 0000000000000000 R12: ffff988b899f8000
[ 197.802791] R13: ffff988b85dfb770 R14: ffffffffc1fcacf5 R15: 0000000000000002
[ 197.802793] FS: 0000000000000000(0000) GS:ffff9892e0e80000(0000) knlGS:0000000000000000
[ 197.802794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 197.802796] CR2: 00007f413cd0f298 CR3: 00000006a3e10004 CR4: 00000000003706e0
[ 197.802797] Call Trace:
[ 197.802799] <TASK>
[ 197.802802] __intel_display_power_put_domain+0xed/0x1e0 [i915]
[ 197.802912] ? __intel_runtime_pm_get+0x32/0xa0 [i915]
[ 197.802988] release_async_put_domains+0x88/0x120 [i915]
[ 197.803097] intel_display_power_flush_work+0xc1/0xf0 [i915]
[ 197.803206] intel_dp_detect+0x3a8/0x730 [i915]
[ 197.803318] ? ww_mutex_lock+0x19/0xa0
[ 197.803324] drm_helper_probe_detect_ctx+0x57/0x120 [drm_kms_helper]
[ 197.803342] output_poll_execute+0x192/0x250 [drm_kms_helper]
[ 197.803358] process_one_work+0x222/0x430
[ 197.803362] worker_thread+0x50/0x3e0
[ 197.803365] ? __pfx_worker_thread+0x10/0x10
[ 197.803367] kthread+0xe6/0x110
[ 197.803370] ? __pfx_kthread+0x10/0x10
[ 197.803373] ret_from_fork+0x29/0x50
[ 197.803377] </TASK>
[ 197.803378] ---[ end trace 0000000000000000 ]---
[ 197.803380] ------------[ cut here ]------------
So Im writing up this lengthy post to see if any of you know what I can try to resolv the issue with the kernel drivers or a secret `GRUB` flag that fixes the not so pleasant freeze/hang glitch. or should I just go with the backup plan as this might be one of those odd ones out.
Or just bring the issue to light if anyone else gets the same occurence.
I did a full reinstall aswell and the issue was present on a empty proxmox aswell on the new kernel.
Best Regards!
LK
Last but not least, you guys are doing an awesome product! keep it up and keep on kicking!
PS. for now I have reinstalled proxmox 7 on it and its rocking solid again, but do have the version 8 on another ssd, if I would want to venture into some kernel debugging again.