Kernel errors from passthrough

PeterMarcusH.

Member
Apr 5, 2019
99
3
13
29
Trying to passthough my GPU. I've had succes passing through a LAN Nic, but cant seem to get the GPU working.
When i boot up a test vm, i get the following from dmesg:

Bash:
[ 1000.894860] vmbr0: port 6(tap104i0) entered disabled state
[ 1022.699301] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1022.735578] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1022.735762] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 1023.536711] device tap104i0 entered promiscuous mode
[ 1023.545266] vmbr0: port 6(tap104i0) entered blocking state
[ 1023.545268] vmbr0: port 6(tap104i0) entered disabled state
[ 1023.545397] vmbr0: port 6(tap104i0) entered blocking state
[ 1023.545398] vmbr0: port 6(tap104i0) entered forwarding state
[ 1026.233927] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[ 1026.234225] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1027.723365] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[ 1027.723368] {3}[Hardware Error]: event severity: info
[ 1027.723371] {3}[Hardware Error]:  Error 0, type: fatal
[ 1027.723372] {3}[Hardware Error]:  fru_text: PcieError
[ 1027.723373] {3}[Hardware Error]:   section_type: PCIe error
[ 1027.723374] {3}[Hardware Error]:   port_type: 4, root port
[ 1027.723375] {3}[Hardware Error]:   version: 0.2
[ 1027.723377] {3}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 1027.723378] {3}[Hardware Error]:   device_id: 0000:00:03.1
[ 1027.723379] {3}[Hardware Error]:   slot: 2
[ 1027.723380] {3}[Hardware Error]:   secondary_bus: 0x01
[ 1027.723381] {3}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
[ 1027.723381] {3}[Hardware Error]:   class_code: 060400
[ 1027.723382] {3}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
[ 1027.723383] {3}[Hardware Error]:   aer_uncor_status: 0x00000000, aer_uncor_mask: 0x04500000
[ 1027.723384] {3}[Hardware Error]:   aer_uncor_severity: 0x00462030
[ 1027.723385] {3}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[ 1027.723414] pcieport 0000:00:03.1: AER: aer_status: 0x00000000, aer_mask: 0x04500000
[ 1027.723535] pcieport 0000:00:03.1: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 1027.723652] pcieport 0000:00:03.1: AER: aer_uncor_severity: 0x00462030
[ 1028.743320] pcieport 0000:00:03.1: AER: Root Port link has been reset
[ 1028.743363] pcieport 0000:00:03.1: AER: Device recovery successful
[ 1044.794813] vmbr0: port 6(tap104i0) entered disabled state
[17839.116581] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79750

When looking at the IOMMU grouping the GPU seems to be in order:

Bash:
IOMMU group 14
[RESET] 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
        01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)

Where am i going wrong?
 

Attachments

  • IOMMU_2.PNG
    IOMMU_2.PNG
    6.5 KB · Views: 11
  • IOMMU_1.PNG
    IOMMU_1.PNG
    11.3 KB · Views: 11
  • dmesg.PNG
    dmesg.PNG
    142.2 KB · Views: 13
AER is a hardware error reporting technique for PCIe devices. It's interesting that it triggers for a PCI bridge (device_id: 0000:00:03.1 in your log corresponds to the device from IOMMU_2.png if I'm not mistaken). You could try using a different physical port for your GPU, or a different device in the same port to narrow down the cause.

Alternatively, you could also disable AER. Add the following to your kernel command line: pci=noaer. Note that this completely disables error reporting and might lead to unexpected software or hardware errors later on.
 
AER is a hardware error reporting technique for PCIe devices. It's interesting that it triggers for a PCI bridge (device_id: 0000:00:03.1 in your log corresponds to the device from IOMMU_2.png if I'm not mistaken). You could try using a different physical port for your GPU, or a different device in the same port to narrow down the cause.

Alternatively, you could also disable AER. Add the following to your kernel command line: pci=noaer. Note that this completely disables error reporting and might lead to unexpected software or hardware errors later on.
Thank you for the quck reply! i have tried different ports, but with same results. I'll try disabling AER and report back.
 
In that case I'd assume your mainboard or BIOS is probably broken. Try a different PCIe slot, BIOS update, or if nothing else works a different board.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!