NVME QID timeout

bjwe · Feb 13, 2019

Hallo zusammen,

ich teste gerade eine Konfiguration mit Proxmox VE 5.3-8
und habe ein Problen mit zwei NVME SSDs.

Die Grundinstallation hängt an zwei SATA SSD Platten auf denen ich zuerst die virtuellen Maschinen getestet habe. Hierbei gibt es auch keine Probleme - jetzt wollte ich noch zwei NVME hinzufügen und die beiden per
ZFS-Mirror einbinden. Das Verschieben der virtuellen Platten hat schon sehr lange gedauert und ich dachte eventuell wäre das ZFS schuld. Also habe ich den zpool aufgelöst und die NVME Platten mal direkt als LVM eingebunden. Hierbei zeigt sich das gleiche Bild. Die Maschinen bleiben oft komplett für einige Sekunden stehen und laufen dann weiter. Ein Blick in dmesg zeigt dann anscheinend die Ursache:

Code:

[...]
[  607.451494] nvme nvme0: I/O 180 QID 28 timeout, completion polled
[  607.451505] nvme nvme0: I/O 243 QID 34 timeout, completion polled
[  607.451511] nvme nvme0: I/O 768 QID 36 timeout, completion polled
[  607.451556] nvme nvme0: I/O 37 QID 39 timeout, completion polled
[  607.451569] nvme nvme1: I/O 311 QID 42 timeout, completion polled
[  607.451575] nvme nvme0: I/O 990 QID 51 timeout, completion polled
[  607.451581] nvme nvme0: I/O 58 QID 53 timeout, completion polled
[...]

Das einzig Sinnvolle, was ich dazu finden konnte:
bugs.launchpad.net/ubuntu/+source/linux/+bug/1807393

Was ich bereits versucht habe:
- Firmware der NVME updaten: (war up-to-date)
- Firmware des Boards updaten: (auch up-to-date)
- Karten umstecken
- OptionRom an/abschalten
- IOMMU ein/aus

Zum System:

CPU(s): 64 x AMD EPYC 7351 16-Core Processor (2 Sockets)
Kernelversion: Linux 4.15.18-10-pve #1 SMP
PVE 4.15.18-32 (Sat, 19 Jan 2019 10:09:37 +0100)

Der Server hat ein Supermicro H11DSI Board, 64GB ECC Hauptspeicher, eine X710 10GbE SFP+ Karte, zwei Samsung DC SSD (SATA), zwei Intel DC P4510 (stecken direkt auf dem PCI Bus).

Base Board Information
Manufacturer: Supermicro
Product Name: H11DSi
Version: 1.01
Serial Number:
Features:
Board is a hosting board
Board is removable
Board is replaceable
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-10-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke

=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2KX010T8
Serial Number:
Firmware Version: VDV10131
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512

smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-10-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke

=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2KX010T8
Serial Number:
Firmware Version: VDV10131
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512

Hat noch jemand eine Idee was man tun könnte, ohne auf die neuen Platten zu verzichten?
(Bzw. ob noch im Kernel was angepasst werden muss)

Gruß Björn

====

Nachtrag: Es scheint wohl wirklich am Kernel zu liegen.
Der Ubuntu-Linux-Kernel (4.18.0-10) hat keine Probleme mit den Platten.

Also jetzt auf einen Patch warten?

BBTown · Mar 16, 2019

Wenn ich es in diesem Post richtig verstanden habe, dann ist es derzeit nicht möglich Proxmox auf Basis zfs direkt auf einer NVMe SSD zu installieren. Eine Installation auf Basis ext4 hat bei mir hingegen funktioniert. Da ich meinen Intel NUC jedoch in einem Cluster laufen lassen möchte, benötige ich allerdings zfs.

bjwe · Mar 17, 2019

Hierbei gibt es aber ein anderes Problem:
Verschiedene Kernel (auch der Proxmox Kernel) haben mit der bestimmten Intel Serie Probleme.

BBTown · Mar 17, 2019

bjwe said:
Hierbei gibt es aber ein anderes Problem:
Verschiedene Kernel (auch der Proxmox Kernel) haben mit der bestimmten Intel Serie Probleme.

Bei mir handelt es sich um einen aktuellen Intel NUC7i7BNH mit Core i7 Prozessor in Verbindung mit einer Samsung EVO 970 NVMe

bjwe · Mar 17, 2019

was sagt denn das Kommando

dmesg | grep nvme

msi1 · Jan 30, 2020

[696795.851050] nvme nvme4: I/O 601 QID 13 timeout, completion polled
[697071.821001] nvme nvme3: I/O 785 QID 5 timeout, completion polled
[697287.890559] nvme nvme4: I/O 301 QID 1 timeout, completion polled
[697683.153380] nvme nvme10: I/O 784 QID 28 timeout, completion polled
[697904.338956] nvme nvme5: I/O 286 QID 23 timeout, completion polled
[698385.110378] nvme nvme2: I/O 579 QID 5 timeout, completion polled
[699014.170862] nvme nvme8: I/O 854 QID 8 timeout, completion polled
[699651.295405] nvme nvme2: I/O 407 QID 15 timeout, completion polled
[699773.664295] nvme nvme8: I/O 330 QID 32 timeout, completion polled
[699901.153199] nvme nvme3: I/O 527 QID 11 timeout, completion polled
[700048.930258] nvme nvme2: I/O 473 QID 25 timeout, completion polled
[700237.795581] nvme nvme3: I/O 624 QID 3 timeout, completion polled

[700483.045338] nvme nvme4: I/O 202 QID 8 timeout, completion polled

~# uname -a
Linux 1u05pve32 5.3.13-1-pve #1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100) x86_64 GNU/Linux

msi1 · Feb 11, 2020

tritt bei mir nur mit IOMMU enabled auf

Jean-Pierre · Apr 13, 2020

Hello

I have the same QID timeout error on some Supermicro servers with EPYC 7402P CPU's with Intel SSD and NVMe drives. The error is only on the NVMe drives.

The solution for me was to update the firmware on the NVMe drives using the intel tool called isdct from the isdct_3.0.24-1_amd64.deb package, note you need to run it on each drive reboot and run it again on the drive, you may get errors but it does work. I have done this on two servers one with Proxmox 6.0 and one with 6.1, it fixed both of them. The tool did not break any of my storage arrays or cause any issues.

If this does not work for you look at this:
https://forum.level1techs.com/t/fixing-slow-nvme-raid-performance-on-epyc/151909

Thanks

hbokh · Feb 23, 2021

Upgrading the SSD NVMe's firmware with "Intel® Memory and Storage Tool CLI (Command-Line Interface)" helped us.
Link: https://downloadcenter.intel.com/download/30162
Installing the Ubuntu package from the ZIP-file on Debian Buster on 6.3-3 worked.

Code:

dpkg -i intelmas_1.5.113-0_amd64.deb

intelmas show -o json -intelssd | jq
intelmas load -intelssd 1
intelmas load -intelssd 2
intelmas load -intelssd 3

hoss · Jul 18, 2023

I was getting:
dmesg -T
[Tue Jul 11 09:30:07 2023] nvme nvme3: I/O 867 QID 51 timeout, completion polled
[Tue Jul 11 09:30:38 2023] nvme nvme3: I/O 323 QID 23 timeout, completion polled
[Tue Jul 11 09:34:01 2023] nvme nvme3: I/O 612 QID 71 timeout, completion polled

[SOLUTION]
Solidgium bought intel's SSD/nvme stuff. Solidgium's web site is NOT tech friendly! ( search is total suck!)
The tool you need for the latest gratest firmwars is "Solidigm Storage Tool":
https://www.solidigm.com/content/solidigm/us/en/support-page/drivers-downloads/ka-00085.html

It has all the firmwares for all drives cooked into the tool - this is what I did:

smartctl -a /dev/<your nvme>
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2KX080T8
Serial Number: PHLJ22xxxxxxxxxxxx
Firmware Version: VDV10131 //<<<firmware here
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4

sst show -o json -ssd
sst load -ssd <drive SN here> ( sst load -ssd PHLJ22xxxxxxxxxxxx )

smartctl -a /dev/<your nvme>
=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPE2KX080T8
Serial Number: PHLJ22xxxxxxxxxxxx
Firmware Version: VDV10184 //<<<firmware here
PCI Vendor/Subsystem ID: 0x8086

This fixed the problem - I no longer see the IO messages from the kernel AND the system is no longer lagging during heavy disk usage.

Search

Search

NVME QID timeout

bjwe

New Member

BBTown

Member

bjwe

New Member

BBTown

Member

bjwe

New Member

msi1

New Member

msi1

New Member

Jean-Pierre

Active Member

hbokh

Active Member

hoss

Member