Intel iGPU and HDMI video out stops working every few days

MacGyver · Jan 13, 2024

This is likely a Linux or even a hardware/BIOS-related issue. But I'm posting here anyways in case anyone can shed some light on what might be happening or how to fix it. Thanks!

I have an odd hardware problem where every few days my system's onboard HDMI will go to sleep and not be able to be woken back up. But more importantly, the Intel GPU video acceleration device (/dev/dri/*) will also disappear at the same time. Commands like this below will start reporting errors afterwards. This means I lose video acceleration in my containers like Plex for transcoding. But otherwise everything else is fine. This problem has occurred from Proxmox 7.x thru to the latest 8.1.3.

Hardware: Lenovo ThinkCentre M920q running an i7-8700T (UHD Graphics 630) and Q370 chipset.

# intel_gpu_top
No device filter specified and no discrete/integrated i915 devices found
# intel_gpu_frequency
Test requirement not met in function drm_open_driver, file ../lib/drmtest.c:572:
Test requirement: !(fd<0)
No known gpu found for chipset flags 0x1 (intel)
Last errno: 2, No such file or directory
SKIP (-1.000s)

This usually occurs after I've unplugged the HDMI cable or switched my my video switcher away from it. As far as I can tell, it will get in this state after about 5 to 6 days of being "video-disconnected". I have vPro enabled and its hardware "remote desktop" functionality will also remain blank too after going into this state. Also, the video output will remain off through a restart of the Proxmox host. It only will "come back" at the Lenovo logo POST screen. I've also gone thru several Lenovo BIOS upgrades and I haven't not spotted any new/changed BIOS settings that look like could be related to this. Lenovo's BIOS is very lacking in advanced options.

Here is my GRUB/kernel command line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt consoleblank=0"

Anyone know how to prevent this, or at least wake back up the HDMI output or reset the kernel's video driver?

If there's any command output I can follow-up with, kindly let me know how to post it (i.e. code vs. inline code, etc).
Thanks!
-Mac

MacGyver · Jan 13, 2024

UPDATE/PROGRESS:

I found this interesting change between when my system is first booted (working) to when HDMI output and iGPU acceleration stops working. My VGA controller device "Kernel driver in use" will change from i915 to vfio-pci!

When HDMI output & iGPU acceleration is working:
# lspci -nnk
<snip>
00:02.0 VGA compatible controller [0300]: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] [8086:3e92]
DeviceName: Onboard - Video
Subsystem: Lenovo CoffeeLake-S GT2 [UHD Graphics 630] [17aa:3136]
Kernel driver in use: i915
Kernel modules: i915

When it's not working:
# lspci -nnk
<snip>
00:02.0 VGA compatible controller [0300]: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] [8086:3e92]
DeviceName: Onboard - Video
Subsystem: Lenovo CoffeeLake-S GT2 [UHD Graphics 630] [17aa:3136]
Kernel driver in use: vfio-pci
Kernel modules: i915

At one time I was experimenting with hardware passthru of various devices like network cards. I'd eventually like to get back to working on that... such as toying around with OPNsense, tomato-x64, etc. My current /etc/modules file has these lines currently:
vfio
vfio_iommu_type1
vfio_pci

Could this a vfio driver bug? Should I bring this up at /r/VFIO or Level1Techs forums? Thanks for any ideas you might have!

-Mac

leesteken · Jan 13, 2024

MacGyver said:
I found this interesting change between when my system is first booted (working) to when HDMI output and iGPU acceleration stops working. My VGA controller device "Kernel driver in use" will change from i915 to vfio-pci!

This usually happens when you start a VM that has that device (or another device from the same IOMMU group) for passthrough.
Check your IOMMU groups and which devices share with the iGPU. Also check which VMs were started around the time you lost the iGPU output.

MacGyver · Jan 13, 2024

leesteken said:
This usually happens when you start a VM that has that device (or another device from the same IOMMU group) for passthrough.
Check your IOMMU groups and which devices share with the iGPU. Also check which VMs were started around the time you lost the iGPU output.

Thanks for the info. The only VM I have is for Home Assistant and I'm not using any host hardware passthrus on it. I have lots of LXC containers but the only one that has been "on" during my history of this system losing HDMI & video acceleration and that has hardware passthru configured is one for Plex. It was sourced from TTECK and has these LXC hardware related entries:

lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 29:0 rwm
lxc.mount.entry: /dev/fb0 dev/fb0 none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/dri/renderD128 dev/renderD128 none bind,optional,create=file

BTW, here's my IOMMU groups right now:
├── 0
│ └── 00:02.0 VGA compatible controller [0300]: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] [8086:3e92]
├── 1
│ └── 00:00.0 Host bridge [0600]: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec2] (rev 07)
├── 2
│ └── 00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 07)
├── 3
│ └── 00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
├── 4
│ ├── 00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10)
│ └── 00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10)
├── 5
│ ├── 00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10)
│ └── 00:16.3 Serial controller [0700]: Intel Corporation Cannon Lake PCH Active Management Technology - SOL [8086:a363] (rev 10)
├── 6
│ └── 00:17.0 SATA controller [0106]: Intel Corporation Cannon Lake PCH SATA AHCI Controller [8086:a352] (rev 10)
├── 7
│ └── 00:1b.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 [8086:a32c] (rev f0)
├── 8
│ └── 00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 [8086:a33d] (rev f0)
├── 9
│ └── 00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 [8086:a330] (rev f0)
├── 10
│ ├── 00:1f.0 ISA bridge [0601]: Intel Corporation Q370 Chipset LPC/eSPI Controller [8086:a306] (rev 10)
│ ├── 00:1f.3 Audio device [0403]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10)
│ ├── 00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10)
│ ├── 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10)
│ └── 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-LM [8086:15bb] (rev 10)
├── 11
│ └── 01:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
├── 12
│ └── 01:00.1 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
├── 13
│ └── 02:00.0 Non-Volatile memory controller [0108]: KIOXIA Corporation NVMe SSD Controller XG7 [1e0f:000d]
└── 14
└── 03:00.0 Network controller [0280]: Qualcomm Atheros QCA9377 802.11ac Wireless Network Adapter [168c:0042] (rev 31)

Does the iGPU appear properly isolated on group 0? Notice any other issues here?

I'd also like to figure out how to further isolate those devices in group 10, but I haven't had any luck with that. I may need other kernel parameters, but it may be impossible as they're in the same chipset? I'm a bit outta my league on this topic.

cheers,
Mac

MacGyver · Jan 29, 2024

So it's been about two weeks that I've disabled my Intel iGPU passthru to my Plex LXC by commenting out those "lxc.*" entries I listed in my last post. That was the only thing using my iGPU before.

And so far the iGPU is still working... and I'm no longer getting the error below... i.e. "intel_gpu_top" is still working!
# intel_gpu_top
No device filter specified and no discrete/integrated i915 devices found

So would anyone know why/how the iGPU video acceleration device (/dev/dri/*) would disappear from the host? Or what other logs to check, dmesg, etc., or how to enable more verbose kernel or other Proxmox logging?

Any advice would be appreciated! I hope to get to the bottom of this.

leesteken · Jan 29, 2024

MacGyver said:
So would anyone know why/how the iGPU video acceleration device (/dev/dri/*) would disappear from the host? Or what other logs to check, dmesg, etc., or how to enable more verbose kernel or other Proxmox logging?

Did you check journalctl (scroll with the arrow keys) around the time that you lost the device?

MacGyver said:
I found this interesting change between when my system is first booted (working) to when HDMI output and iGPU acceleration stops working. My VGA controller device "Kernel driver in use" will change from i915 to vfio-pci!

Proxmox really only does this when you start a VM with passthrough of the device (unless you manually changed the driver). Maybe you were experimenting with PCIe passthrough? Can you show the output of journalctl around the time that the driver changed?

MacGyver · Jan 29, 2024

Thanks @leesteken. I just now re-enabled iGPU passthru again on my Plex LXC via adding back those lxc.mount.entry lines, and will examine the output of journalctl after the issue returns.

BTW, do I check journalctl on the host or within the LXC?

leesteken · Jan 29, 2024

MacGyver said:
Thanks @leesteken. I just now re-enabled iGPU passthru again on my Plex LXC via adding back those lxc.mount.entry lines, and will examine the output of journalctl after the issue returns.

That container cannot be the cause of this (as it should not have the permissions to change drivers on the host).

MacGyver said:
BTW, do I check journalctl on the host or within the LXC?

The Proxmox host, since the driver is switched there.

MacGyver · Jan 29, 2024

That container cannot be the cause of this (as it should not have the permissions to change drivers on the host).

Yes, I had previously experimented with VM PCIe passthrough on some VMs. But none of those VMs have been powered on in a long time. I still have some related kernel modules enabled though. I don't mind to remove them if it'll help with troubleshooting.

/etc/modules:

vfio
vfio_iommu_type1
vfio_pci

I've experienced that by disabling the device mounts for /dev/dri* (/dev/fb0 /dev/dri /dev/dri/renderD128) in my Plex LXC, then those devices (/dev/dri*) will remain working after two weeks. But when they're enabled, these devices will typically disappear within a week's time and tools like intel_gpu_top will no longer work. Now that I've re-enabled these device mounts back into this LXC, I'll report back later to see if this change alone will cause the issue to reappear.

leesteken · Jan 29, 2024

MacGyver said:
I don't mind to remove them if it'll help with troubleshooting.

/etc/modules:
vfio vfio_iommu_type1 vfio_pci

I'm not sure if it would make a difference. Maybe it would stop you starting VMs with PCIe passthrough accidentally.

MacGyver said:
I've experienced that by disabling the device mounts for /dev/dri* (/dev/fb0 /dev/dri /dev/dri/renderD128) in my Plex LXC, then those devices (/dev/dri*) will remain working after two weeks. But when they're enabled, these devices will typically disappear within a week's time and tools like intel_gpu_top will no longer work. Now that I've re-enabled these device mounts back into this LXC, I'll report back later to see if this change alone will cause the issue to reappear.

Again, the container and it's configuration will not have any influence on switching the driver.

Can you show the journalctl part I asked, please?

omgitsbanzai · Apr 4, 2024

leesteken said:
I'm not sure if it would make a difference. Maybe it would stop you starting VMs with PCIe passthrough accidentally.

Again, the container and it's configuration will not have any influence on switching the driver.

Can you show the journalctl part I asked, please?

Hi,
I'm not OP, but having the same issue where my frigate LXC stops working during the night. After I reboot the proxmox host, the LXC works fine all day. This is what I'm seeing on the proxmox host with journalctl

Code:

Feb 09 18:30:16 proxmox kernel: Linux version 6.5.11-8-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for De>
Feb 09 18:30:16 proxmox kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.11-8-pve root=/dev/mapper/pve-root ro quiet
Feb 09 18:30:16 proxmox kernel: KERNEL supported cpus:
Feb 09 18:30:16 proxmox kernel:   Intel GenuineIntel
Feb 09 18:30:16 proxmox kernel:   AMD AuthenticAMD
Feb 09 18:30:16 proxmox kernel:   Hygon HygonGenuine
Feb 09 18:30:16 proxmox kernel:   Centaur CentaurHauls
Feb 09 18:30:16 proxmox kernel:   zhaoxin   Shanghai
Feb 09 18:30:16 proxmox kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space spl>
Feb 09 18:30:16 proxmox kernel: BIOS-provided physical RAM map:
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x000000000009e000-0x000000000009efff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x000000000009f000-0x000000000009ffff] usable
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000039c93fff] usable
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000039c94000-0x000000004077efff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x000000004077f000-0x00000000408edfff] ACPI data
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000408ee000-0x0000000040a1afff] ACPI NVS
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000040a1b000-0x0000000041735fff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000041736000-0x00000000417fefff] type 20
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000417ff000-0x00000000417fffff] usable
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000041800000-0x0000000047ffffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000048c00000-0x0000000048ffffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000049e00000-0x00000000503fffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000c0000000-0x00000000cfffffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000fed20000-0x00000000fed7ffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Feb 09 18:30:16 proxmox kernel: BIOS-e820: [mem 0x0000000100000000-0x00000004afbfffff] usable
Feb 09 18:30:16 proxmox kernel: NX (Execute Disable) protection: active
Feb 09 18:30:16 proxmox kernel: efi: EFI v2.8 by American Megatrends
Feb 09 18:30:16 proxmox kernel: efi: ACPI=0x408ed000 ACPI 2.0=0x408ed014 TPMFinalLog=0x40964000 SMBIOS=0x41390000 SMBIOS 3.0=0x4138f00>
Feb 09 18:30:16 proxmox kernel: efi: Remove mem80: MMIO range=[0xc0000000-0xcfffffff] (256MB) from e820 map
Feb 09 18:30:16 proxmox kernel: e820: remove [mem 0xc0000000-0xcfffffff] reserved
Feb 09 18:30:16 proxmox kernel: efi: Not removing mem81: MMIO range=[0xfe000000-0xfe010fff] (68KB) from e820 map
Feb 09 18:30:16 proxmox kernel: efi: Not removing mem82: MMIO range=[0xfec00000-0xfec00fff] (4KB) from e820 map
Feb 09 18:30:16 proxmox kernel: efi: Not removing mem83: MMIO range=[0xfed00000-0xfed00fff] (4KB) from e820 map
Feb 09 18:30:16 proxmox kernel: efi: Not removing mem85: MMIO range=[0xfee00000-0xfee00fff] (4KB) from e820 map
Feb 09 18:30:16 proxmox kernel: efi: Remove mem86: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
Feb 09 18:30:16 proxmox kernel: e820: remove [mem 0xff000000-0xffffffff] reserved
Feb 09 18:30:16 proxmox kernel: secureboot: Secure boot disabled
Feb 09 18:30:16 proxmox kernel: SMBIOS 3.5.0 present.
Feb 09 18:30:16 proxmox kernel: DMI: LENOVO 90W2000HUT/333E, BIOS M4YKT14A 11/16/2023
Feb 09 18:30:16 proxmox kernel: tsc: Detected 3200.000 MHz processor
Feb 09 18:30:16 proxmox kernel: tsc: Detected 3187.200 MHz TSC
Feb 09 18:30:16 proxmox kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Feb 09 18:30:16 proxmox kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Feb 09 18:30:16 proxmox kernel: last_pfn = 0x4afc00 max_arch_pfn = 0x400000000
Feb 09 18:30:16 proxmox kernel: MTRR map: 5 entries (3 fixed + 2 variable; max 23), built from 10 variable MTRRs
Feb 09 18:30:16 proxmox kernel: x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
Feb 09 18:30:16 proxmox kernel: last_pfn = 0x41800 max_arch_pfn = 0x400000000
Feb 09 18:30:16 proxmox kernel: esrt: Reserving ESRT space from 0x00000000355ddb18 to 0x00000000355ddc18.
Feb 09 18:30:16 proxmox kernel: e820: update [mem 0x355dd000-0x355ddfff] usable ==> reserved
Feb 09 18:30:16 proxmox kernel: Using GB pages for direct mapping
Feb 09 18:30:16 proxmox kernel: Incomplete global flushes, disabling PCID
Feb 09 18:30:16 proxmox kernel: secureboot: Secure boot disabled
Feb 09 18:30:16 proxmox kernel: RAMDISK: [mem 0x1e72a000-0x220ddfff]
Feb 09 18:30:16 proxmox kernel: ACPI: Early table checksum verification disabled
Feb 09 18:30:16 proxmox kernel: ACPI: RSDP 0x00000000408ED014 000024 (v02 LENOVO)
Feb 09 18:30:16 proxmox kernel: ACPI: XSDT 0x00000000408EC728 000124 (v01 LENOVO TC-M4Y   00001140 AMI  01000013)

and here's what i see on the LXC for journalctl

Code:

Mar 29 19:55:42 docker-frigate systemd-journald[68]: Journal started
Mar 29 19:55:42 docker-frigate systemd-journald[68]: Runtime Journal (/run/log/journal/c350c9fe444f4a5a93d2a4189888a7d9) is 8.0M, max 311.4M, 303.4M free.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
Mar 29 19:55:42 docker-frigate systemd[1]: Finished ifupdown-pre.service - Helper to synchronize boot up for ifupdown.
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-sysusers.service - Create System Users.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev...
Mar 29 19:55:42 docker-frigate systemd-journald[68]: Time spent on flushing to /var/log/journal/c350c9fe444f4a5a93d2a4189888a7d9 is 792us for 6 entries.
Mar 29 19:55:42 docker-frigate systemd-journald[68]: System Journal (/var/log/journal/c350c9fe444f4a5a93d2a4189888a7d9) is 8.0M, max 1.5G, 1.5G free.
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-journal-flush.service - Flush Journal to Persistent Storage.
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target local-fs-pre.target - Preparation for Local File Systems.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target local-fs.target - Local File Systems.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting networking.service - Raise network interfaces...
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-binfmt.service - Set Up Additional Binary Formats was skipped because of an unmet condition check (ConditionPathIsReadWrite=/proc/sys).
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-machine-id-commit.service - Commit a transient machine-id on disk...
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-sysext.service - Merge System Extension Images into /usr/ and /opt/ was skipped because no trigger condition checks were met.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-tmpfiles-setup.service - Create Volatile Files and Directories...
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-udevd.service - Rule-based Manager for Device Events and Files was skipped because of an unmet condition check (ConditionPathIsReadWrite=/sys).
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-tmpfiles-setup.service - Create Volatile Files and Directories.
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-timesyncd.service - Network Time Synchronization was skipped because of an unmet condition check (ConditionVirtualization=!container).
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target time-set.target - System Time Set.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-update-utmp.service - Record System Boot/Shutdown in UTMP...
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-machine-id-commit.service - Commit a transient machine-id on disk.
Mar 29 19:55:42 docker-frigate systemd[1]: Finished systemd-update-utmp.service - Record System Boot/Shutdown in UTMP.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target sysinit.target - System Initialization.
Mar 29 19:55:42 docker-frigate systemd[1]: Started postfix-resolvconf.path - Watch for resolv.conf updates and restart postfix.
Mar 29 19:55:42 docker-frigate systemd[1]: Started apt-daily.timer - Daily apt download activities.
Mar 29 19:55:42 docker-frigate systemd[1]: Started apt-daily-upgrade.timer - Daily apt upgrade and clean activities.
Mar 29 19:55:42 docker-frigate systemd[1]: Started dpkg-db-backup.timer - Daily dpkg database backup timer.
Mar 29 19:55:42 docker-frigate systemd[1]: Started e2scrub_all.timer - Periodic ext4 Online Metadata Check for All Filesystems.
Mar 29 19:55:42 docker-frigate systemd[1]: fstrim.timer - Discard unused blocks once a week was skipped because of an unmet condition check (ConditionVirtualization=!container).
Mar 29 19:55:42 docker-frigate systemd[1]: Started logrotate.timer - Daily rotation of log files.
Mar 29 19:55:42 docker-frigate systemd[1]: Started man-db.timer - Daily man-db regeneration.
Mar 29 19:55:42 docker-frigate systemd[1]: Started systemd-tmpfiles-clean.timer - Daily Cleanup of Temporary Directories.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target paths.target - Path Units.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target timers.target - Timer Units.
Mar 29 19:55:42 docker-frigate systemd[1]: Listening on dbus.socket - D-Bus System Message Bus Socket.
Mar 29 19:55:42 docker-frigate systemd[1]: Listening on ssh.socket - OpenBSD Secure Shell server socket.
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target sockets.target - Socket Units.
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-pcrphase-sysinit.service - TPM2 PCR Barrier (Initialization) was skipped because of an unmet condition check (ConditionPathExists=/sys/firmware/efi/efivar>
Mar 29 19:55:42 docker-frigate systemd[1]: Reached target basic.target - Basic System.
Mar 29 19:55:42 docker-frigate systemd[1]: Started cron.service - Regular background program processing daemon.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting dbus.service - D-Bus System Message Bus...
Mar 29 19:55:42 docker-frigate systemd[1]: Starting e2scrub_reap.service - Remove Stale Online ext4 Metadata Check Snapshots...
Mar 29 19:55:42 docker-frigate systemd[1]: getty-static.service - getty on tty2-tty6 if dbus and logind are not available was skipped because of an unmet condition check (ConditionPathExists=!/usr/bin/dbus>
Mar 29 19:55:42 docker-frigate systemd[1]: Started postfix-resolvconf.service - Copies updated resolv.conf to postfix chroot and restarts postfix..
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-logind.service - User Login Management...
Mar 29 19:55:42 docker-frigate systemd[1]: systemd-pcrphase.service - TPM2 PCR Barrier (User) was skipped because of an unmet condition check (ConditionPathExists=/sys/firmware/efi/efivars/StubPcrKernelIma>
Mar 29 19:55:42 docker-frigate cron[104]: (CRON) INFO (pidfile fd = 3)
Mar 29 19:55:42 docker-frigate cron[104]: (CRON) INFO (Running @reboot jobs)
Mar 29 19:55:42 docker-frigate systemd[1]: Started dbus.service - D-Bus System Message Bus.
Mar 29 19:55:42 docker-frigate systemd[1]: postfix-resolvconf.service: Deactivated successfully.
Mar 29 19:55:42 docker-frigate systemd-logind[108]: New seat seat0.
Mar 29 19:55:42 docker-frigate systemd[1]: Started systemd-logind.service - User Login Management.
Mar 29 19:55:42 docker-frigate systemd[1]: e2scrub_reap.service: Deactivated successfully.
Mar 29 19:55:42 docker-frigate systemd[1]: Finished e2scrub_reap.service - Remove Stale Online ext4 Metadata Check Snapshots.
Mar 29 19:55:42 docker-frigate systemd[1]: Starting systemd-networkd.service - Network Configuration...
Mar 29 19:55:42 docker-frigate systemd-networkd[128]: eth0: Link UP
Mar 29 19:55:42 docker-frigate systemd-networkd[128]: eth0: Gained carrier
Mar 29 19:55:42 docker-frigate systemd-networkd[128]: lo: Link UP
Mar 29 19:55:42 docker-frigate systemd-networkd[128]: lo: Gained carrier
Mar 29 19:55:42 docker-frigate systemd-networkd[128]: Enumeration completed
Mar 29 19:55:42 docker-frigate systemd[1]: Started systemd-networkd.service - Network Configuration.

leesteken · Apr 4, 2024

omgitsbanzai said:
I'm not OP, but having the same issue where my frigate LXC stops working during the night. After I reboot the proxmox host, the LXC works fine all day. This is what I'm seeing on the proxmox host with journalctl

You are only showing the first parts of the logs (and at different times?) but not the parts around the time that it stops working. There is nothing for me to see here (and I don't know what to look for).

omgitsbanzai · Apr 4, 2024

leesteken said:
You are only showing the first parts of the logs (and at different times?) but not the parts around the time that it stops working. There is nothing for me to see here (and I don't know what to look for).

Thank you for your reply and apologies for my insufficient data as I learn to troubleshoot here. It is strange to me that by default, the journalctl command my proxmox host by default returned a previous date. I'll test the journalctl commands here and try to identify the error. https://www.digitalocean.com/commun...ournalctl-to-view-and-manipulate-systemd-logs

I may also run some tteck scripts to try and cleanup since I had originally created a VM with iGPU passthrough prior to the LXC setup.
Backing up the LXCs and VMs and re-installing the host might also be an option

MacGyver · Apr 5, 2024

leesteken said:
I'm not sure if it would make a difference. Maybe it would stop you starting VMs with PCIe passthrough accidentally.

Again, the container and it's configuration will not have any influence on switching the driver.

Can you show the journalctl part I asked, please?

Update:
Good news... my Proxmox host has been up for 56 days and the issue of losing my Intel GPU video acceleration hasn't come back. I just now manually restarted anyway and will continue to watch if the issue returns.

What's changed?
About that many days ago I also switched to using an HDMI dummy plug (EDID emulator). This one to be precise... https://www.amazon.com/gp/product/B0B7R5NYPV.

If I happen to get a chance to re-test without the dummy plug in the future and can get the journalctl output around the time that I lose my iGPU video acceleration (i.e., my dev/dri/* devices disappearing), I'll keep everyone posted. I can only assume the hardware (Lenovo ThinkCentre M920q running an i7-8700T [UHD Graphics 630] and Q370 chipset) may be partly to blame, or that something undocumented hardware-wise is going on that the kernel/VFIO can't handle. And this system does have vPro and I've got that functionality (including "remote desktop") enabled in the BIOS. If anyone has any insight into this, I'd be interested in knowing the true culprit.

cheers,
Mac

omgitsbanzai · Apr 5, 2024

MacGyver said:
Update:
Good news... my Proxmox host has been up for 56 days and the issue of losing my Intel GPU video acceleration hasn't come back. I just now manually restarted anyway and will continue to watch if the issue returns.

What's changed?
About that many days ago I also switched to using an HDMI dummy plug (EDID emulator). This one to be precise... https://www.amazon.com/gp/product/B0B7R5NYPV.

If I happen to get a chance to re-test without the dummy plug in the future and can get the journalctl output around the time that I lose my iGPU video acceleration (i.e., my dev/dri/* devices disappearing), I'll keep everyone posted. I can only assume the hardware (Lenovo ThinkCentre M920q running an i7-8700T [UHD Graphics 630] and Q370 chipset) may be partly to blame, or that something undocumented hardware-wise is going on that the kernel/VFIO can't handle. And this system does have vPro and I've got that functionality (including "remote desktop") enabled in the BIOS. If anyone has any insight into this, I'd be interested in knowing the true culprit.

cheers,
Mac

Thanks for the update! Maybe I'll test out a dummy plug as well.

Do you happen to use proxmox backup server? I'm looking through my logs and I think my issue is related to PBS backups at 1:30am.
VM 102 and 103 previously contained ubuntu server, but are no longer in use.
LXC 104 is my new frigate LXC which requires iGPU and PCIe coral.

This morning I removed the two VMs from my backup job (datacenter > backup > backup job > exclude selected VMs).
I'll see if the issue occurs again tonight.

Here was my journalctl -b

Code:

Apr 05 01:31:17 proxmox pvescheduler[532995]: INFO: Finished Backup of VM 102 (00:00:18)
Apr 05 01:31:17 proxmox pvescheduler[532995]: INFO: Starting Backup of VM 103 (qemu)
Apr 05 01:31:17 proxmox qmeventd[534411]: Starting cleanup for 102
Apr 05 01:31:17 proxmox qmeventd[534411]: Finished cleanup for 102
Apr 05 01:31:19 proxmox kernel: pcieport 0000:00:06.2: broken device, retraining non-functional downstream link at 2.5GT/s
Apr 05 01:31:20 proxmox kernel: pcieport 0000:00:06.2: retraining failed
Apr 05 01:31:21 proxmox kernel: pcieport 0000:00:06.2: broken device, retraining non-functional downstream link at 2.5GT/s
Apr 05 01:31:22 proxmox kernel: pcieport 0000:00:06.2: retraining failed
Apr 05 01:31:22 proxmox kernel: vfio-pci 0000:02:00.0: not ready 1023ms after resume; waiting
Apr 05 01:31:23 proxmox kernel: vfio-pci 0000:02:00.0: not ready 2047ms after resume; waiting
Apr 05 01:31:25 proxmox kernel: vfio-pci 0000:02:00.0: not ready 4095ms after resume; waiting
Apr 05 01:31:30 proxmox kernel: vfio-pci 0000:02:00.0: not ready 8191ms after resume; waiting
Apr 05 01:31:38 proxmox kernel: vfio-pci 0000:02:00.0: not ready 16383ms after resume; waiting
Apr 05 01:31:56 proxmox kernel: vfio-pci 0000:02:00.0: not ready 32767ms after resume; waiting
Apr 05 01:32:31 proxmox kernel: vfio-pci 0000:02:00.0: not ready 65535ms after resume; giving up
Apr 05 01:32:31 proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
Apr 05 01:32:31 proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
Apr 05 01:32:32 proxmox kernel: pcieport 0000:00:06.2: broken device, retraining non-functional downstream link at 2.5GT/s
Apr 05 01:32:33 proxmox kernel: pcieport 0000:00:06.2: retraining failed
Apr 05 01:32:34 proxmox kernel: pcieport 0000:00:06.2: broken device, retraining non-functional downstream link at 2.5GT/s
Apr 05 01:32:35 proxmox kernel: pcieport 0000:00:06.2: retraining failed
Apr 05 01:32:35 proxmox kernel: vfio-pci 0000:02:00.0: not ready 1023ms after bus reset; waiting
Apr 05 01:32:36 proxmox kernel: vfio-pci 0000:02:00.0: not ready 2047ms after bus reset; waiting
Apr 05 01:32:39 proxmox kernel: vfio-pci 0000:02:00.0: not ready 4095ms after bus reset; waiting
Apr 05 01:32:43 proxmox kernel: vfio-pci 0000:02:00.0: not ready 8191ms after bus reset; waiting
Apr 05 01:32:52 proxmox kernel: vfio-pci 0000:02:00.0: not ready 16383ms after bus reset; waiting
Apr 05 01:33:10 proxmox kernel: vfio-pci 0000:02:00.0: not ready 32767ms after bus reset; waiting
Apr 05 01:33:45 proxmox kernel: vfio-pci 0000:02:00.0: not ready 65535ms after bus reset; giving up
Apr 05 01:33:45 proxmox kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible

MacGyver · Apr 5, 2024

omgitsbanzai said:
Thanks for the update! Maybe I'll test out a dummy plug as well.

Do you happen to use proxmox backup server? I'm looking through my logs and I think my issue is related to PBS backups at 1:30am.

No I don't use PBS.

omgitsbanzai · Apr 9, 2024

MacGyver said:
No I don't use PBS.

In case someone else finds this thread useful, just wanted to confirm that PBS backup was causing the issue for me.

Search

Search

Intel iGPU and HDMI video out stops working every few days

MacGyver

New Member

MacGyver

New Member

leesteken

Distinguished Member

MacGyver

New Member

MacGyver

New Member

leesteken

Distinguished Member

MacGyver

New Member

leesteken

Distinguished Member

MacGyver

New Member

leesteken

Distinguished Member

omgitsbanzai

New Member

leesteken

Distinguished Member

omgitsbanzai

New Member

MacGyver

New Member

omgitsbanzai

New Member

MacGyver

New Member

omgitsbanzai

New Member