Problem with NVME timeout and aborting

sab0

New Member
Mar 23, 2023
8
0
1
Hi all,

I have two WD Blue SN750 1TB NVME drives on my i5-8500 machine. One is used as the boot drive, and the other is passed through to an OpenMediaVault VM where it is used for primarily seeding.

The boot NVME has never had a problem.
The storage NVME has random problems, and comes up with the error in syslog.

Code:
Apr 04 06:02:45 proxmox kernel: nvme nvme1: I/O 321 (I/O Cmd) QID 6 timeout, aborting
Apr 04 06:02:45 proxmox kernel: nvme nvme1: I/O 322 (I/O Cmd) QID 6 timeout, aborting
Apr 04 06:03:16 proxmox kernel: nvme nvme1: I/O 321 QID 6 timeout, reset controller
Apr 04 06:04:28 proxmox kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Apr 04 06:04:28 proxmox kernel: nvme nvme1: Abort status: 0x371
Apr 04 06:04:28 proxmox kernel: nvme nvme1: Abort status: 0x371
Apr 04 06:04:38 proxmox kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Apr 04 06:04:38 proxmox kernel: nvme nvme1: Disabling device after reset failure: -19
Apr 04 06:07:11 proxmox smartd[984]: Device: /dev/nvme1, open() of NVMe device failed: Resource temporarily unavailable

Potentially useful info.... maybe not:
  1. This stops the VM from working as the storage as the VM can no longer see this passed through storage drive.
  2. This has happened before but normally a restart or; shutting the node down and leaving it 5 mins, then starting it again; sorts it out.
  3. The drive no longer shows up in the "discs" section, even after a restart.
  4. Previously I have checked the SMART info and all seems fine.
  5. Previously I have run GParted on the drive and it's checked out ok, with some small adjustments made but no warnings come up.
  6. I have booted windows and then looked for updated firmware for the drives, but there doesn't appear to be any.
I found this page: https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Troubleshooting, but being a bit of a noob to Linux I wasn't sure if it being written for Arch meant it wouldn't work with Debian.

Any help would be much appreciated!
 
Last edited:
Out of curiosity, I tried smartctl -c /dev/nvme01nand it gave this:

Code:
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.20W    3.70W       -    0  0  0  0        0       0
 1 +     2.70W    2.30W       -    0  0  0  0        0       0
 2 +     1.90W    1.80W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   44000

Following that link, could adding nvme_core.default_ps_max_latency_us=3900 to the grub file be the solution?

Would the grub file then go from this:

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_gvt=1"
to this?:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt i915.enable_gvt=1 nvme_core.default_ps_max_latency_us=3900"
 
The plot thickens..... the drive would not come online no matter how many times I restarted the server. So I shut it down overnight and booted it in the morning, and IT'S BACK.

This is what SMART says:

Code:
=== START OF INFORMATION SECTION ===
Model Number:                       WD Blue SN570 1TB
Serial Number:                      22481A401535
Firmware Version:                   234110WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 4a48bba505
Local Time is:                      Fri Apr  5 08:11:24 2024 BST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.20W    3.70W       -    0  0  0  0        0       0
 1 +     2.70W    2.30W       -    0  0  0  0        0       0
 2 +     1.90W    1.80W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   44000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    59,608,208 [30.5 TB]
Data Units Written:                 13,459,096 [6.89 TB]
Host Read Commands:                 274,006,788
Host Write Commands:                21,613,008
Controller Busy Time:               1,009
Power Cycles:                       31
Power On Hours:                     8,833
Unsafe Shutdowns:                   11
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Is this potentially my MOBO giving up? I thought maybe it was due to passing the drive through to a VM that was causing an issue, but if there was an issue there it would still show up as a drive on the node wouldn't it?
 
Last edited:
Yes and no. I think it was the nvme slot giving up on the mobo, as it is a bit old and that nvme drive was pretty new. I removed it and not had the problem since.
 
I am having similar issue as well except 1/ I am running two exact brand and model drives as zfs boot, 2/ I do not passthrough either drive for the VM but VMs are created and run from this pair of zfs raid drives.

My issue is that the problem is very sporadic. Sometimes the system can run for weeks without issues. Other time it run for a few days and one of the drives (always the same one) would get disconnected. I tried replacing the nvme drive and the issue still remains. How can I validate whether it is the mobo fault or something else?
 
Hi
Same issue with samsung PM1735 (pci), cluster with five servers, each server have one NVME (Ceph):

nvme nvme0: pci function 0000:18:00.0
nvme nvme0: Shutdown timeout set to 10 seconds
nvme nvme0: 63/0/0 default/read/poll queues
nvme 0000:18:00.0: Using 48-bit DMA addresses
nvme nvme0: resetting controller due to AER
 
Is there any kind of working solution? This just stopped working recently.

This says still:
Code:
nvme nvme0: Device not ready; aborting reset, CSTS=0x1

- My nve disk is Samsung 980 SSD 500 Gt M.2 SSD
- kernel: 6.8.8-2-pve
 
Is there any kind of working solution? This just stopped working recently.

This says still:
Code:
nvme nvme0: Device not ready; aborting reset, CSTS=0x1

- My nve disk is Samsung 980 SSD 500 Gt M.2 SSD
- kernel: 6.8.8-2-pve

Looks like you have having a different issue all together. Did you try to reboot Proxmox? Also, you should check to see if the problematic nvme drive is still operational. I have had a case where I had broken nvme drive.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!