Updating Proxmox led to NVMe-Bug

coffee_engine · Nov 30, 2024

Hi all,

I have a server running on Proxmox, which uses four NVMe-Drives in a ZFS-Raid-Z2. Since I have recently updated my Proxmox, since then I have the Issue that the NVMe-Drives periodically go down, and the VMs running on that datastore are crashing. Typically the Issue occurs during a Backup of the VMs with the integrated backup-solution to a Proxmox-Backup-Server, but it also happens when another IO-Intensive task runs, such as a zpool scrub. Journalctl lists the following:

Code:

Nov 30 14:54:18 mars pmxcfs[2212]: [dcdb] notice: data verification successful
Nov 30 14:57:14 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:14 mars pvestatd[2328]: status update time (7.070 seconds)
Nov 30 14:57:24 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:24 mars pvestatd[2328]: status update time (7.068 seconds)
Nov 30 14:57:34 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:34 mars pvestatd[2328]: status update time (7.065 seconds)
Nov 30 14:57:35 mars kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme 0000:05:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme1: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme nvme2: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme0: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme3: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: I/O error, dev nvme2n1, sector 2064186456 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Nov 30 14:57:35 mars kernel: I/O error, dev nvme0n1, sector 1066834128 op 0x0:(READ) flags 0x4000 phys_seg 4 prio class 0
Nov 30 14:57:35 mars kernel: I/O error, dev nvme0n1, sector 1066844160 op 0x0:(READ) flags 0x0 phys_seg 18 prio class 0
Nov 30 14:57:35 mars kernel: zio pool=NVMePool vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KX020T8_PHLJ326100VP2P0BGN_1-part1 error=5 type=2 offset=1056855076864 size=4096 flags=1572992
Nov 30 14:57:35 mars kernel: zio pool=NVMePool vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KX020T8_PHLJ326100VP2P0BGN_1-part1 error=5 type=1 offset=546211336192 size=126976 flags=1074267264

I have already added "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" to the kernel-boot-parameters accordingly, but it does not help at all. I have also disabled ASPM in the BIOS, and tried updating the Kernel to 6.11.x, but to no avail. The only thing that seems to bring my server back to a useable state is periodically setting the following values to the PCI-Devices:

echo "on" > /sys/bus/pci/devices/<device_id>/power/control
echo 0 > /sys/bus/pci/devices/<device_id>/d3cold_allowed

. I have made sure that my BIOS and my Drive-Firmwares are Up to Date. I have also tried to update again to see if there are any fixes for this, but this is unfortunately not the case.

The Server is using the following hardware:

Mainboard: Supermicro X13SCH-F
CPU: Intel Xeon E-2478
Chipset: Intel C266
The affected Drives: Intel DC-P4510 (4 TB)
RAM: 128GB Kingston ECC

Proxmox-Version: 8.3.0

I would be really grateful if somebody could look into that, since it renders my Proxmox-Setup nearly unusable.

EDIT: Corrected the issue report.

matrix1999 · Jan 31, 2025

I am using Supermicro X13SAE-F and I have a pair of nvme drives using zfs configuration. By running nvme_core.default_ps_max_latency_us=0 resolved my issue.

Looking at your log, it looks as though Proxmox is having issues with:
Nov 30 14:57:35 mars kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

I believe you can disable aspm from the bios iirc. Otherwise, I would suggest shutting down the system, unplug the power for 30 secs and then restart everything again (I have seen that actually helped to reset some of the hardware in the past).

Updating Proxmox led to NVMe-Bug

coffee_engine

New Member

matrix1999

Member

We value your privacy