Help with console message "nvme: AER: Error of this agent is reported first"

otm271

New Member
Mar 5, 2023
2
0
1
I'm repeatedly seeing this in the console:

nvme: 0000:01:00.0: AER: Error of this agent is reported first

From googling, I can see that this is the Advanced Error Reporting on the nvme.

The nvme is a Lexar NM620.
The box is a Lenovo M900

I have installed proxmox twice. Both times I have the same error.
First install using the bootable installer: "Proxmox VE 7.3 ISO Installer - Updated on 22 November 2022 - Version: 7.3-1"
And also on an install following this guide (with LUKS encryption at boot): https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_11_Bullseye

Here's the output of smartctl:

Code:
root@proxmox:~# smartctl /dev/nvme0 -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.85-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Lexar SSD NM620 512GB
Serial Number:                      
Firmware Version:                   V1.27
PCI Vendor/Subsystem ID:            0x1d97
IEEE OUI Identifier:                0xcaf25b
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            caf25b 02d00005e7
Local Time is:                      Sun Mar  5 16:39:57 2023 GMT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005c):     DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0a):         Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.00W       -        -    0  0  0  0        5     700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         3

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,002,254 [513 GB]
Data Units Written:                 1,958,086 [1.00 TB]
Host Read Commands:                 17,752,740
Host Write Commands:                46,475,530
Controller Busy Time:               40
Power Cycles:                       36
Power On Hours:                     115
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Syslog repeats the following over and over, here is a sample:

Code:
Mar  5 16:53:22 proxmox kernel: [  938.373722] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:22 proxmox kernel: [  938.373724] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000041/0000e000
Mar  5 16:53:22 proxmox kernel: [  938.373725] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:22 proxmox kernel: [  938.373727] nvme 0000:01:00.0:    [ 6] BadTLP
Mar  5 16:53:22 proxmox kernel: [  938.373728] nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Mar  5 16:53:22 proxmox kernel: [  938.374013] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374021] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374028] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374033] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.576182] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.576190] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:22 proxmox kernel: [  938.576192] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:22 proxmox kernel: [  938.576194] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:24 proxmox kernel: [  940.084899] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:24 proxmox kernel: [  940.084907] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:24 proxmox kernel: [  940.084909] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:24 proxmox kernel: [  940.084911] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:24 proxmox kernel: [  940.085467] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
Mar  5 16:53:24 proxmox kernel: [  940.085472] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:24 proxmox kernel: [  940.085474] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:24 proxmox kernel: [  940.085489] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:25 proxmox kernel: [  941.771594] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:25 proxmox kernel: [  941.771613] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Mar  5 16:53:25 proxmox kernel: [  941.771619] pcieport 0000:00:1b.0:   device [8086:a167] error status/mask=00001100/00002000
Mar  5 16:53:25 proxmox kernel: [  941.771640] pcieport 0000:00:1b.0:    [ 8] Rollover
Mar  5 16:53:25 proxmox kernel: [  941.771641] pcieport 0000:00:1b.0:    [12] Timeout
Mar  5 16:53:25 proxmox kernel: [  941.771644] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:25 proxmox kernel: [  941.771646] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:25 proxmox kernel: [  941.771647] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:25 proxmox kernel: [  941.771648] nvme 0000:01:00.0: AER:   Error of this Agent is reported first

From what I can tell its some sort of error with the PCI bus which the OS is managing to correct.

Is there anything I can do to narrow down the cause more? For example, Is this a software/driver problem or hardware?

Edit: added smartctl output and syslog.
 
Last edited:
Created an account just to respond as I encountered the exact same situation with a few M900's and NVMe drives. I just updated the latest BIOS and the errors disappeared. I tried applying BIOS update, rebooting into latest proxmox, no errors. I also reinstalled proxmox latest and errors disappeared.
 
Created an account just to respond as I encountered the exact same situation with a few M900's and NVMe drives. I just updated the latest BIOS and the errors disappeared. I tried applying BIOS update, rebooting into latest proxmox, no errors. I also reinstalled proxmox latest and errors disappeared.
Thanks for taking the time to reply. My BIOS is quite out of date (2016). I'll try that. Thank you.
 
I created an account only to help anyone looking for this error too, as this topic was one of the top search results for this. I came across "AER: Error of this Agent is reported first" along with "Failed to start Journal Services" and a few others on fresh installs of Zorin/Mint.


*******I tried EVERYTHING I could find online, but NOTHING worked. After 2 days, i searched every possible setting in my BIOS and decided to change the CSM settings, setting everything to UEFI. It worked like a charm—not a single error since.***********

This might be a default setting for some Xeon MOBOs.
I have an Xeon E5-2696v3 TURBO UNLOCKED in a E5 MR9A PRO MAX with a somewhat recent BIOS (12/jul/2022). BIOS update MAY solve this, but if you have a Xeon/x99/x79 setup, i recommend checking/trying this first, since updating/changing BIOS takes a lot more effort.