Updating Proxmox led to NVMe-Bug

coffee_engine · Nov 30, 2024

Hi all,

I have a server running on Proxmox, which uses four NVMe-Drives in a ZFS-Raid-Z2. Since I have recently updated my Proxmox, since then I have the Issue that the NVMe-Drives periodically go down, and the VMs running on that datastore are crashing. Typically the Issue occurs during a Backup of the VMs with the integrated backup-solution to a Proxmox-Backup-Server, but it also happens when another IO-Intensive task runs, such as a zpool scrub. Journalctl lists the following:

Code:

Nov 30 14:54:18 mars pmxcfs[2212]: [dcdb] notice: data verification successful
Nov 30 14:57:14 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:14 mars pvestatd[2328]: status update time (7.070 seconds)
Nov 30 14:57:24 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:24 mars pvestatd[2328]: status update time (7.068 seconds)
Nov 30 14:57:34 mars pvestatd[2328]: ocean: error fetching datastores - 500 read timeout
Nov 30 14:57:34 mars pvestatd[2328]: status update time (7.065 seconds)
Nov 30 14:57:35 mars kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 30 14:57:35 mars kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Nov 30 14:57:35 mars kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 30 14:57:35 mars kernel: nvme 0000:05:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme1: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme nvme2: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme0: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: nvme 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 30 14:57:35 mars kernel: nvme nvme3: Disabling device after reset failure: -19
Nov 30 14:57:35 mars kernel: I/O error, dev nvme2n1, sector 2064186456 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Nov 30 14:57:35 mars kernel: I/O error, dev nvme0n1, sector 1066834128 op 0x0:(READ) flags 0x4000 phys_seg 4 prio class 0
Nov 30 14:57:35 mars kernel: I/O error, dev nvme0n1, sector 1066844160 op 0x0:(READ) flags 0x0 phys_seg 18 prio class 0
Nov 30 14:57:35 mars kernel: zio pool=NVMePool vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KX020T8_PHLJ326100VP2P0BGN_1-part1 error=5 type=2 offset=1056855076864 size=4096 flags=1572992
Nov 30 14:57:35 mars kernel: zio pool=NVMePool vdev=/dev/disk/by-id/nvme-INTEL_SSDPE2KX020T8_PHLJ326100VP2P0BGN_1-part1 error=5 type=1 offset=546211336192 size=126976 flags=1074267264

I have already added "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" to the kernel-boot-parameters accordingly, but it does not help at all. I have also disabled ASPM in the BIOS, and tried updating the Kernel to 6.11.x, but to no avail. The only thing that seems to bring my server back to a useable state is periodically setting the following values to the PCI-Devices:

echo "on" > /sys/bus/pci/devices/<device_id>/power/control
echo 0 > /sys/bus/pci/devices/<device_id>/d3cold_allowed

. I have made sure that my BIOS and my Drive-Firmwares are Up to Date. I have also tried to update again to see if there are any fixes for this, but this is unfortunately not the case.

The Server is using the following hardware:

Mainboard: Supermicro X13SCH-F
CPU: Intel Xeon E-2478
Chipset: Intel C266
The affected Drives: Intel DC-P4510 (4 TB)
RAM: 128GB Kingston ECC

Proxmox-Version: 8.3.0

I would be really grateful if somebody could look into that, since it renders my Proxmox-Setup nearly unusable.

EDIT: Corrected the issue report.

Search

Search

Updating Proxmox led to NVMe-Bug

coffee_engine

New Member