NVME correctable errors?

abhabitat

New Member
Oct 1, 2024
1
0
1
My proxmox install has two NVME drives and in the System Log I see the same message repeating over and over.
Code:
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
First, is this an issue or something I can ignore?
How can I tell which NVME drive this is referring to?
 
How can I tell which NVME drive this is referring to?
It's PCI(e) device 03:00.0. You can find the dev-node by looking at the output of ls -l /dev/disk/by-path/pci-0000:03:00.0*. and then look up that dev-node in the output of ls -l /dev/disk/by-id/nvme* to find the make, model and serial number of the NVMe.

First, is this an issue or something I can ignore?
You system log might fill your entire drive, which is bad. It might also indicate a problem, Re-seat the drive? Update motherboard BIOS and drive firmware? Reduce the PCIe speed?
I think I might have seen an error like it when my GPU was overheating because the fan failed. I don't really know how bad this is, sorry.
 
@abhabitat : you didn't search in the forum, right ?
;)
 
My proxmox install has two NVME drives and in the System Log I see the same message repeating over and over.
Code:
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
Mar 06 18:00:35 proxmox kernel: pcieport 0000:00:01.2: AER: Correctable error message received from 0000:03:00.0
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:   device [15b7:5006] error status/mask=00000001/0000e000
Mar 06 18:00:35 proxmox kernel: nvme 0000:03:00.0:    [ 0] RxErr                  (First)
First, is this an issue or something I can ignore?
How can I tell which NVME drive this is referring to?

Hi, did you sort this out in the end?

Is your drive a WD SN730/SN750 by any chance? I had (actually I still have in operation) a WD NVMe and I think it was giving me some kind of simliar issue and causing problems with trying to use passthrough intoand LXC so I edited grub to include:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off pci=nommconf pcie_no_flr=15b7:5006"
 
I have a similar issue with one my nodes in a cluster. I think I have a Samsung SSD. I tried the
Code:
pcie_asm=off
option, it has not worked. The node just suddenly goes offline.