Proxmox 8.1 | Linux 6.5.11-6 | Strange NVMe error prevents boot.

sondzack

New Member
Dec 9, 2023
4
0
1
Hello everyone,
I've run into an issue on my Lenovo m720q tiny, where proxmox fails to reboot. I've had the machine running for over a month, when overnight it randomly decided to stop working. It was unresponsive to ssh and when I plugged in a monitor I was met with a completely frozen session and a wall of nvme errors.

I had assumed there was an issue with the notorious power saving creep that's been haunting linux kernels (and I still assume the issue somewhat persists) but I had already taken the neccessary precautions and set the
Code:
nvme_core.default_ps_max_latency_us=0
flag in the linux kernel long ago.

I rebooted the system, but the system would always hang with the following message:

Code:
lvm:242 blocked for more than 120 seconds.
Not tainted 6.5.11-6-pve #1
nvme nvme0: Device not ready: aborting reset, CSTS=0x1

Odd, I booted into recovery mode and got the following errors (see attached images)
I check the status of the nvme using the lenovo built in bios-native test and the nvme passed without issue. The bios recognizes the drive without issue.

Any help would be extremely appreciated.
Kind regards.
 

Attachments

  • IMG_20240209_141540_117.jpg
    IMG_20240209_141540_117.jpg
    666.2 KB · Views: 21
  • IMG_20240209_142357_666.jpg
    IMG_20240209_142357_666.jpg
    648.2 KB · Views: 21
Hi,
I am currently encountering the same issue with Proxmox 7.4-17 (Linux 5.15.143-1).
It seems to have appeared with the last kernel/pve upgrade imo.
 
read the linux kernel manual. the 0 value is not a setting, specific drive require delay. you will find the formula to apply the correct one for your specific system.
 
same problem
6.5.13-5 kernel
Also tried
Code:
nvme_core.default_ps_max_latency_us=0,
but didn't work for me. System hang with
Code:
nvme nvme0: Device not ready: aborting reset, CSTS=0x1
 
Hi,
I am currently encountering the same issue with Proxmox 7.4-17 (Linux 5.15.143-1).
It seems to have appeared with the last kernel/pve upgrade imo.
Did you figured out how to fix this problem?