[BUG] - Kernel 6.8.12-13-pve Boot Failure - PCIe AER Errors & Storage Timeouts
1. System Overview
- Kernel Version: Linux version 6.8.12-13-pve (build@proxmox) (gcc (Debian 12.2.0-14+deb12u1) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-13 (2025-07-22T10:00Z) ()
- CPU: AMD Ryzen 7 1800X Eight-Core Processor
- Motherboard: ASUS TUF GAMING B550-PRO
- BIOS: Version 3621, Date 01/13/2025
- IOMMU: AMD-Vi is enabled with interrupt remapping and Virtual APIC.
2. Problem Description
After upgrading to kernel 6.8.12-13-pve, the system fails to boot and enters emergency mode. The boot process is flooded with Uncorrectable (Non-Fatal) PCIe Advanced Error Reporting (AER) errors and AMD-Vi: IO_PAGE_FAULT events. These hardware-level errors appear to cause instability in storage and network controllers, ultimately leading to systemd timing out while waiting for storage devices to appear. This has occurred on two different ASUS systems with AMD processors.3. Key Log Data: PCIe & IOMMU Errors
The boot log shows a constant stream of AER errors, primarily involving the PCIe root port 0000:00:01.2 and its downstream devices.AMD-Vi IOMMU Faults:
IO page faults are repeatedly logged for the atlantic network controller before other errors cascade:- atlantic 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc5739000 flags=0x0000]
- pci 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc5733800 flags=0x0020]
Uncorrectable PCIe AER Errors:
Multiple devices report unrecoverable errors.Example 1: Atlantic Network Controller (Requester & First Agent to report error)
- Device: atlantic 0000:0a:00.0 (Vendor: 1d6a, Device: 07b1)
- Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
- Status/Mask: 00104000/00000000
- Details: [14] CmpltTO (Completion Timeout),[20] UnsupReq (Unsupported Request) (First)
- TLP Header: 4a002004 00000010 0a000d30 001873c5
- Note: The log states AER: Error of this Agent is reported first, indicating this network card may be the origin of the bus flood.
- Device: ahci 0000:01:00.1 (Vendor: 1022, Device: 43eb)
- Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Completer ID)
- Status/Mask: 00108000/00000000
- Details: [15] CmpltAbrt (Completer Abort),[20] UnsupReq (First)
- TLP Header: 04000001 0000210f 01030000 00000000
- Device: xhci_hcd 0000:01:00.0 (Vendor: 1022, Device: 43ee)
- Error: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Completer ID)
- Status/Mask: 00108000/00000000
- Details: [15] CmpltAbrt,[20] UnsupReq (First)
- TLP Header: 04000001 0000210f 01030000 00000000
4. Symptom: Storage Mount Failure
The direct consequence of the hardware instability is the failure of the OS to mount required filesystems.- Systemd Timeout: The system waits for the full timeout period before giving up on the block devices.
- Timed out waiting for device dev-disk-by\x2duuid-0676278C76277B95.device - /dev/disk/by-uuid/0676278C76277B95
- Timed out waiting for device dev-disk-by\x2duuid-665E24CA5E249539.device - /dev/disk/by-uuid/665E24CA5E249539
- Result: This leads to dependency failures for mount units and the local-fs.target, forcing the system into emergency mode.