Help with console message "nvme: AER: Error of this agent is reported first"

otm271

New Member
Mar 5, 2023
2
0
1
I'm repeatedly seeing this in the console:

nvme: 0000:01:00.0: AER: Error of this agent is reported first

From googling, I can see that this is the Advanced Error Reporting on the nvme.

The nvme is a Lexar NM620.
The box is a Lenovo M900

I have installed proxmox twice. Both times I have the same error.
First install using the bootable installer: "Proxmox VE 7.3 ISO Installer - Updated on 22 November 2022 - Version: 7.3-1"
And also on an install following this guide (with LUKS encryption at boot): https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_11_Bullseye

Here's the output of smartctl:

Code:
root@proxmox:~# smartctl /dev/nvme0 -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.85-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Lexar SSD NM620 512GB
Serial Number:                      
Firmware Version:                   V1.27
PCI Vendor/Subsystem ID:            0x1d97
IEEE OUI Identifier:                0xcaf25b
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            caf25b 02d00005e7
Local Time is:                      Sun Mar  5 16:39:57 2023 GMT
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005c):     DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0a):         Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.00W       -        -    0  0  0  0        5     700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         3

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,002,254 [513 GB]
Data Units Written:                 1,958,086 [1.00 TB]
Host Read Commands:                 17,752,740
Host Write Commands:                46,475,530
Controller Busy Time:               40
Power Cycles:                       36
Power On Hours:                     115
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Syslog repeats the following over and over, here is a sample:

Code:
Mar  5 16:53:22 proxmox kernel: [  938.373722] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:22 proxmox kernel: [  938.373724] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000041/0000e000
Mar  5 16:53:22 proxmox kernel: [  938.373725] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:22 proxmox kernel: [  938.373727] nvme 0000:01:00.0:    [ 6] BadTLP
Mar  5 16:53:22 proxmox kernel: [  938.373728] nvme 0000:01:00.0: AER:   Error of this Agent is reported first
Mar  5 16:53:22 proxmox kernel: [  938.374013] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374021] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374028] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.374033] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.576182] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:22 proxmox kernel: [  938.576190] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:22 proxmox kernel: [  938.576192] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:22 proxmox kernel: [  938.576194] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:24 proxmox kernel: [  940.084899] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:24 proxmox kernel: [  940.084907] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:24 proxmox kernel: [  940.084909] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:24 proxmox kernel: [  940.084911] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:24 proxmox kernel: [  940.085467] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
Mar  5 16:53:24 proxmox kernel: [  940.085472] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:24 proxmox kernel: [  940.085474] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:24 proxmox kernel: [  940.085489] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:25 proxmox kernel: [  941.771594] pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
Mar  5 16:53:25 proxmox kernel: [  941.771613] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Mar  5 16:53:25 proxmox kernel: [  941.771619] pcieport 0000:00:1b.0:   device [8086:a167] error status/mask=00001100/00002000
Mar  5 16:53:25 proxmox kernel: [  941.771640] pcieport 0000:00:1b.0:    [ 8] Rollover
Mar  5 16:53:25 proxmox kernel: [  941.771641] pcieport 0000:00:1b.0:    [12] Timeout
Mar  5 16:53:25 proxmox kernel: [  941.771644] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar  5 16:53:25 proxmox kernel: [  941.771646] nvme 0000:01:00.0:   device [1d97:5216] error status/mask=00000001/0000e000
Mar  5 16:53:25 proxmox kernel: [  941.771647] nvme 0000:01:00.0:    [ 0] RxErr
Mar  5 16:53:25 proxmox kernel: [  941.771648] nvme 0000:01:00.0: AER:   Error of this Agent is reported first

From what I can tell its some sort of error with the PCI bus which the OS is managing to correct.

Is there anything I can do to narrow down the cause more? For example, Is this a software/driver problem or hardware?

Edit: added smartctl output and syslog.
 
Last edited:
Created an account just to respond as I encountered the exact same situation with a few M900's and NVMe drives. I just updated the latest BIOS and the errors disappeared. I tried applying BIOS update, rebooting into latest proxmox, no errors. I also reinstalled proxmox latest and errors disappeared.
 
Created an account just to respond as I encountered the exact same situation with a few M900's and NVMe drives. I just updated the latest BIOS and the errors disappeared. I tried applying BIOS update, rebooting into latest proxmox, no errors. I also reinstalled proxmox latest and errors disappeared.
Thanks for taking the time to reply. My BIOS is quite out of date (2016). I'll try that. Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!