Linux AER errors on NVME (seemingly related to overheating)

Shlee

Member
Apr 3, 2023
18
1
8
Code:
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:  fru_text: PcieError
{1}[Hardware Error]:   section_type: PCIe error
{1}[Hardware Error]:   port_type: 0, PCIe end point
{1}[Hardware Error]:   version: 0.2
{1}[Hardware Error]:   command: 0x0406, status: 0x0010
{1}[Hardware Error]:   device_id: 0000:a1:00.0
{1}[Hardware Error]:   slot: 0
{1}[Hardware Error]:   secondary_bus: 0x00
{1}[Hardware Error]:   vendor_id: 0x2646, device_id: 0x5013
{1}[Hardware Error]:   class_code: 010802
{1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
nvme 0000:a1:00.0: AER: aer_status: 0x00002001, aer_mask: 0x00000000
nvme 0000:a1:00.0:    [ 0] RxErr                  (First)
nvme 0000:a1:00.0:    [13] NonFatalErr        
nvme 0000:a1:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

I've recently experienced a number of errors in my proxmox server related to the NVME drives. Thousands and thousands of these warnings.


After looking into this. I've found a series of articles and posts about this issue:
and sadly none of these posts seem to have a solid answer.... and even worse, most replies are just telling people to silence the errors, or even disabling the error recovery features entirely (Absolutely bonkers).

My guess...

For my server, I've found this issue is almost entirely related to the NVME drives overheating.. I've installed some thin heatsinks, and the issue has almost completely resolved itself. (So far multiple hours with only one or two of these warnings)

I cannot confirm if this is true for the others... but I've got a feeling it's the root cause for my server considering how hot these NVME storage devices are getting on GEN 4 / GEN 5.

Anybody know how to read the errors and find if this is reporting anything related to overheating?
 

Attachments

  • 1700183539077.png
    1700183539077.png
    212.9 KB · Views: 20
Last edited:
i am also coming across this issue but its not rtx 4060 gpu i am trying to passthrough and the moment i turn on VM i get pretty much the same error i have made a Post about it so far no solution