Kernel errors from passthrough

PeterMarcusH.

Member
Apr 5, 2019
99
3
13
30
Trying to passthough my GPU. I've had succes passing through a LAN Nic, but cant seem to get the GPU working.
When i boot up a test vm, i get the following from dmesg:

Bash:
[ 1000.894860] vmbr0: port 6(tap104i0) entered disabled state
[ 1022.699301] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1022.735578] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1022.735762] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1023.536711] device tap104i0 entered promiscuous mode
[ 1023.545266] vmbr0: port 6(tap104i0) entered blocking state
[ 1023.545268] vmbr0: port 6(tap104i0) entered disabled state
[ 1023.545397] vmbr0: port 6(tap104i0) entered blocking state
[ 1023.545398] vmbr0: port 6(tap104i0) entered forwarding state
[ 1026.233927] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[ 1026.234225] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1027.723365] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[ 1027.723368] {3}[Hardware Error]: event severity: info
[ 1027.723371] {3}[Hardware Error]:  Error 0, type: fatal
[ 1027.723372] {3}[Hardware Error]:  fru_text: PcieError
[ 1027.723373] {3}[Hardware Error]:   section_type: PCIe error
[ 1027.723374] {3}[Hardware Error]:   port_type: 4, root port
[ 1027.723375] {3}[Hardware Error]:   version: 0.2
[ 1027.723377] {3}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 1027.723378] {3}[Hardware Error]:   device_id: 0000:00:03.1
[ 1027.723379] {3}[Hardware Error]:   slot: 2
[ 1027.723380] {3}[Hardware Error]:   secondary_bus: 0x01
[ 1027.723381] {3}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
[ 1027.723381] {3}[Hardware Error]:   class_code: 060400
[ 1027.723382] {3}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
[ 1027.723383] {3}[Hardware Error]:   aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000
[ 1027.723384] {3}[Hardware Error]:   aer_uncor_severity: 0x00462030
[ 1027.723385] {3}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[ 1027.723414] pcieport 0000:00:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000
[ 1027.723535] pcieport 0000:00:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 1027.723652] pcieport 0000:00:03.1: AER: aer_uncor_severity: 0x00462030
[ 1028.743320] pcieport 0000:00:03.1: AER: Root Port link has been reset
[ 1028.743363] pcieport 0000:00:03.1: AER: Device recovery successful
[ 1044.794813] vmbr0: port 6(tap104i0) entered disabled state
[17839.116581] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

When looking at the IOMMU grouping the GPU seems to be in order:

Bash:
IOMMU group 14
[RESET] 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
        01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)

Where am i going wrong?
 

Attachments

  • IOMMU_2.PNG
    IOMMU_2.PNG
    6.5 KB · Views: 11
  • IOMMU_1.PNG
    IOMMU_1.PNG
    11.3 KB · Views: 11
  • dmesg.PNG
    dmesg.PNG
    142.2 KB · Views: 13
AER is a hardware error reporting technique for PCIe devices. It's interesting that it triggers for a PCI bridge (device_id: 0000:00:03.1 in your log corresponds to the device from IOMMU_2.png if I'm not mistaken). You could try using a different physical port for your GPU, or a different device in the same port to narrow down the cause.

Alternatively, you could also disable AER. Add the following to your kernel command line: pci=noaer. Note that this completely disables error reporting and might lead to unexpected software or hardware errors later on.
 
AER is a hardware error reporting technique for PCIe devices. It's interesting that it triggers for a PCI bridge (device_id: 0000:00:03.1 in your log corresponds to the device from IOMMU_2.png if I'm not mistaken). You could try using a different physical port for your GPU, or a different device in the same port to narrow down the cause.

Alternatively, you could also disable AER. Add the following to your kernel command line: pci=noaer. Note that this completely disables error reporting and might lead to unexpected software or hardware errors later on.
Thank you for the quck reply! i have tried different ports, but with same results. I'll try disabling AER and report back.
 
In that case I'd assume your mainboard or BIOS is probably broken. Try a different PCIe slot, BIOS update, or if nothing else works a different board.