[BUG] [HELP] - Kernel 6.8.12-13-pve Boot Failure - PCIe AER Errors & Storage Timeouts

GCX-1991

New Member
Aug 11, 2025
1
0
1

[BUG] - Kernel 6.8.12-13-pve Boot Failure - PCIe AER Errors & Storage Timeouts​



1. System Overview​

  • Kernel Version: Linux version 6.8.12-13-pve (build@proxmox) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-13 (2025-07-22T10:00Z) ()
  • CPU: AMD Ryzen 7 1800X Eight-Core Processor
  • Motherboard: ASUS TUF GAMING B550-PRO
  • BIOS: Version 3621, Date 01/13/2025
  • IOMMU: AMD-Vi is enabled with interrupt remapping and Virtual APIC.

2. Problem Description​

After upgrading to kernel 6.8.12-13-pve, the system fails to boot and enters emergency mode. The boot process is flooded with Uncorrectable (Non-Fatal) PCIe Advanced Error Reporting (AER) errors and AMD-Vi: IO_PAGE_FAULT events. These hardware-level errors appear to cause instability in storage and network controllers, ultimately leading to systemd timing out while waiting for storage devices to appear. This has occurred on two different ASUS systems with AMD processors.

3. Key Log Data: PCIe & IOMMU Errors​

The boot log shows a constant stream of AER errors, primarily involving the PCIe root port 0000:00:01.2 and its downstream devices.

AMD-Vi IOMMU Faults:​

IO page faults are repeatedly logged for the atlantic network controller before other errors cascade:

  • atlantic 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc5739000 flags=0x0000]
  • pci 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc5733800 flags=0x0020]

Uncorrectable PCIe AER Errors:​

Multiple devices report unrecoverable errors.

Example 1: Atlantic Network Controller (Requester & First Agent to report error)

  • Device: atlantic 0000:0a:00.0 (Vendor: 1d6a, Device: 07b1)
  • Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
  • Status/Mask: 00104000/00000000
  • Details: [14] CmpltTO (Completion Timeout),[20] UnsupReq (Unsupported Request) (First)
  • TLP Header: 4a002004 00000010 0a000d30 001873c5
  • Note: The log states AER: Error of this Agent is reported first, indicating this network card may be the origin of the bus flood.
Example 2: AHCI SATA Controller (Completer)

  • Device: ahci 0000:01:00.1 (Vendor: 1022, Device: 43eb)
  • Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Completer ID)
  • Status/Mask: 00108000/00000000
  • Details: [15] CmpltAbrt (Completer Abort),[20] UnsupReq (First)
  • TLP Header: 04000001 0000210f 01030000 00000000
Example 3: xHCI USB Controller (Completer)

  • Device: xhci_hcd 0000:01:00.0 (Vendor: 1022, Device: 43ee)
  • Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Completer ID)
  • Status/Mask: 00108000/00000000
  • Details: [15] CmpltAbrt,[20] UnsupReq (First)
  • TLP Header: 04000001 0000210f 01030000 00000000

4. Symptom: Storage Mount Failure​

The direct consequence of the hardware instability is the failure of the OS to mount required filesystems.

  • Systemd Timeout: The system waits for the full timeout period before giving up on the block devices.
    • Timed out waiting for device dev-disk-by\x2duuid-0676278C76277B95.device - /dev/disk/by-uuid/0676278C76277B95
    • Timed out waiting for device dev-disk-by\x2duuid-665E24CA5E249539.device - /dev/disk/by-uuid/665E24CA5E249539
  • Result: This leads to dependency failures for mount units and the local-fs.target, forcing the system into emergency mode.
This appears to be a kernel regression or a hardware/firmware incompatibility exposed by the new kernel's handling of PCIe AER or AMD IOMMU.