[SOLVED] NVMe storage issue

We will include the revert (it has already been acked on the ubuntu kernel list) in our next kernel update, which will happen when the next Ubuntu kernel is released (it might be a bit earlier on pvetest).
 
there'll probably be a test kernel later today which (among other fixes) reverts the one seemingly buggy NVME commit that has been included in our released kernel.
 
That's what I was going to ask. If we can test a patched PVE kernel that would be great for us to give you and them a feedback.
 
That's what I was going to ask. If we can test a patched PVE kernel that would be great for us to give you and them a feedback.

will be available on pvetest later today (did not make it on Friday, sorry..)
 
I have install the new testing Kernel:
Code:
 dpkg -i pve-kernel-4.4.35-2-pve_4.4.35-79_amd64.deb
and reboot the Server, but the running kernel Version does not change. I hope this is correct,

EDIT:
my first results, only a few hours old, Are up to now good:

only one Line in dmesg:
# dmesg | grep -i nvme
[ 0.893264] nvme0n1: p1 p2 p3

and up to now no errors


best regards,
maxprox

Code:
root@oprox:~# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
 
Last edited:
yes, it was an ABI compatible upgrade so the ABI did not get bumped. you can verify the exact kernel build/version with "uname -a"
 
okay, I see, with uname -a the date changed to "Fri Jan 20".
Code:
root@oprox:~# uname -a
Linux ouchprox 4.4.35-2-pve #1 SMP Fri Jan 20 13:58:17 CET 2017 x86_64 GNU/Linux

The server has been running error-free for about 20h.
Tonight I turn the system back on productively
 
Yes it looks very good, My server has been running since 48h (1/2 productively) without one nvme and without one device-mapper error.
My Question:
although I have a subscription I use the pve-no-subscription repositroy.
Do I have to pay attention to something at the next kernel update?
In other words, will the patch be included in future kernel?
 
Yes it looks very good, My server has been running since 48h (1/2 productively) without one nvme and without one device-mapper error.
My Question:
although I have a subscription I use the pve-no-subscription repositroy.
Do I have to pay attention to something at the next kernel update?
In other words, will the patch be included in future kernel?

yes, since there was no negative feedback the kernel package from pvetest will move as-is to pve-no-subscription. future updates will either contain the same or a future version of the fix.
 
May be someone have benefit from this knowledge. We have two systems which are exactly the same and having NVMe disk issue. After booting and seeing the NVMe drive in dmesg, in /dev/ and in fdisk -l, device disconnects after 60secs after being probed in boot sequence. Applying the above kernel installation solves the problem.

The issue is related to the CPU type or version. The one having problem with the NVMe has dual xeon 2609v3 cpu while the one running smoothly with default installation has dual xeon 2620v4 cpu installed.
 
So, 6/7 months later, is your NVME still working fine?
What version of Proxmox, or what kernel are you using?
 
;-)
Yes, the system has been working well in produktion since the description above


Code:
date:
Thu Aug 17 23:02:08 CEST 2017
root@prox:~# pveversion -v
proxmox-ve: 4.4-92 (running kernel: 4.4.67-1-pve)
pve-manager: 4.4-15 (running version: 4.4-15/7599e35a)
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.67-1-pve: 4.4.67-92
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-52
qemu-server: 4.0-110
pve-firmware: 1.1-11

Code:
root@ouchprox:~#  smartctl /dev/nvme0 -x
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.67-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       ADATA SX8000NP
Serial Number:                      2G49xxxxxx
Firmware Version:                   C2.1.3
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000000
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          128,035,676,160 [128 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Aug 17 23:07:45 2017 CEST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL *Other*
Optional NVM Commands (0x001d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     81 Celsius
Namespace 1 Features (0x04):        Dea/Unw_Error

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5
 1 +     4.60W       -        -    1  1  1  1       30      30
 2 +     3.80W       -        -    2  2  2  2       30      30
 3 -   0.0700W       -        -    3  3  3  3    30000   10000
 4 -   0.0050W       -        -    4  4  4  4    50000   10000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        52 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,742,477 [892 GB]
Data Units Written:                 1,125,620 [576 GB]
Host Read Commands:                 104,553,835
Host Write Commands:                34,886,622
Controller Busy Time:               683
Power Cycles:                       24
Power On Hours:                     5,507
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
 
I got similar issues - sometime unexpected reboot.
# pveversion --verbose
proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: not correctly installed
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD EPYC 7401P 24-Core Processor
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 2000.0000
CPU min MHz: 1200.0000
BogoMIPS: 3992.60
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-5,24-29
NUMA node1 CPU(s): 6-11,30-35
NUMA node2 CPU(s): 12-17,36-41
NUMA node3 CPU(s): 18-23,42-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload overflow_recov succor smca
# dmidecode -t bios
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.1.1 present.
# SMBIOS implementations newer than version 3.0 are not
# fully supported by this version of dmidecode.
Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: GIGABYTE
Version: F03e
Release Date: 09/13/2017
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16384 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.13
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 150G 0 loop
nvme1n1 259:0 0 894.3G 0 disk
nvme0n1 259:1 0 894.3G 0 disk
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!