Linux AER errors on NVME (seemingly related to overheating)

Shlee

New Member
Apr 3, 2023
13
1
3
Code:
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:  fru_text: PcieError
{1}[Hardware Error]:   section_type: PCIe error
{1}[Hardware Error]:   port_type: 0, PCIe end point
{1}[Hardware Error]:   version: 0.2
{1}[Hardware Error]:   command: 0x0406, status: 0x0010
{1}[Hardware Error]:   device_id: 0000:a1:00.0
{1}[Hardware Error]:   slot: 0
{1}[Hardware Error]:   secondary_bus: 0x00
{1}[Hardware Error]:   vendor_id: 0x2646, device_id: 0x5013
{1}[Hardware Error]:   class_code: 010802
{1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
nvme 0000:a1:00.0: AER: aer_status: 0x00002001, aer_mask: 0x00000000
nvme 0000:a1:00.0:    [ 0] RxErr                  (First)
nvme 0000:a1:00.0:    [13] NonFatalErr        
nvme 0000:a1:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

I've recently experienced a number of errors in my proxmox server related to the NVME drives. Thousands and thousands of these warnings.


After looking into this. I've found a series of articles and posts about this issue:
and sadly none of these posts seem to have a solid answer.... and even worse, most replies are just telling people to silence the errors, or even disabling the error recovery features entirely (Absolutely bonkers).

My guess...

For my server, I've found this issue is almost entirely related to the NVME drives overheating.. I've installed some thin heatsinks, and the issue has almost completely resolved itself. (So far multiple hours with only one or two of these warnings)

I cannot confirm if this is true for the others... but I've got a feeling it's the root cause for my server considering how hot these NVME storage devices are getting on GEN 4 / GEN 5.

Anybody know how to read the errors and find if this is reporting anything related to overheating?
 

Attachments

  • 1700183539077.png
    1700183539077.png
    212.9 KB · Views: 13
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!