I have a long-running server which use 2x ZFS-mirrored 7TB Ultrastar SN200 Series NVMe SSDs running Proxmox 6.2. I bought two more identical devices and setup a second server about a month ago.
I could create VMs and migrate to/from local-vmdata (the local ZFS pool on these drives).
At some point last week, the new server (which also was running 2x ZFS mirrors on these drives) write-locked the ZFS pool because the NVMe controllers no longer appeared. What is strange is that these devices appear as NVMe devices, pass smart checks, report NVMe subsystems, I just can't access them as block devices.
Note that the proxmox install is to another smaller set of 250GB SATA SSD mirrors, so the server still boots, and the server which is not showing the NVMe block devices can still live migrate Ceph-backed VMs (unrelated to this issue, just an FYI).
Anyway, I'm at my wit's end, since these drives were working perfectly. Did I have two simultaneous failures of brand-new devices at exactly the same time? I might try booting up with a ubuntu live ISO to see if I can mount the ZFS pool.
There is some chatter about disabling C-State management in the BIOS: https://www.reddit.com/r/Proxmox/co...came_unresponsive_two_times_in_three/fnkd1hr/
Kernels:
Note that I upgraded to the latest kernels while trying to resolve the NVMe issue)
Working: eEI310
Not Working: TO1-C001-PM001
Working devices in old system:
Info from NEW (Failed) system:
the devices listed by ID are only the boot SSD pair:
Neither the working nor the new failed system lists nvme as a loaded module:
I attempted to follow this thread and adjust the power save settings for nvme (but I think this setting might require the nvme module to be enabled? not sure):
https://forum.proxmox.com/threads/crash-with-intel-760p-nvme.60099/
Following: https://tekbyte.net/2020/fixing-nvme-ssd-problems-on-linux/
For additional Debian bug tracking, which may or may not be related: https://bugzilla.kernel.org/show_bug.cgi?id=195039
I then ran
Note that the drives are still working in the system that has the default set to 100000.
I installed the nvme-cli package on the new server for troubleshooting and pulled this info about the drives:
NVMe SMART reporting says that the drive is fine, what's strange is that the namespace id is ffffffff:
If I list the nvme subsystems:
I could create VMs and migrate to/from local-vmdata (the local ZFS pool on these drives).
At some point last week, the new server (which also was running 2x ZFS mirrors on these drives) write-locked the ZFS pool because the NVMe controllers no longer appeared. What is strange is that these devices appear as NVMe devices, pass smart checks, report NVMe subsystems, I just can't access them as block devices.
Note that the proxmox install is to another smaller set of 250GB SATA SSD mirrors, so the server still boots, and the server which is not showing the NVMe block devices can still live migrate Ceph-backed VMs (unrelated to this issue, just an FYI).
Anyway, I'm at my wit's end, since these drives were working perfectly. Did I have two simultaneous failures of brand-new devices at exactly the same time? I might try booting up with a ubuntu live ISO to see if I can mount the ZFS pool.
There is some chatter about disabling C-State management in the BIOS: https://www.reddit.com/r/Proxmox/co...came_unresponsive_two_times_in_three/fnkd1hr/
Kernels:
Note that I upgraded to the latest kernels while trying to resolve the NVMe issue)
Working: eEI310
Linux 5.4.34-1-pve #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200)
Not Working: TO1-C001-PM001
Linux 5.4.65-1-pve #1 SMP PVE 5.4.65-1 (Mon, 21 Sep 2020 15:40:22 +0200)
Working devices in old system:
Bash:
root@eEI310:~# lspci | grep NV
02:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
04:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
81:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
82:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
Info from NEW (Failed) system:
Bash:
root@TO1-C001-PM001:~# lspci | grep NV
03:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
04:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02)
Bash:
04:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02) (prog-if 02 [NVM Express])
Subsystem: HGST, Inc. Ultrastar SN200 Series NVMe SSD
Physical Slot: 15
Flags: bus master, fast devsel, latency 0, IRQ 39, NUMA node 0
Memory at c7330000 (64-bit, non-prefetchable) [size=16K]
Memory at c7320000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at c7300000 [disabled] [size=128K]
Capabilities: [c0] Power Management version 3
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [c8] MSI: Enable- Count=1/32 Maskable+ 64bit+
Capabilities: [e0] MSI-X: Enable+ Count=129 Masked-
Capabilities: [60] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [180] #19
Capabilities: [248] Device Serial Number 00-0c-ca-0c-01-15-e5-00
Kernel driver in use: nvme
Bash:
root@TO1-C001-PM001:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 232.4G 0 part
sdb 8:16 0 232.9G 0 disk
├─sdb1 8:17 0 1007K 0 part
├─sdb2 8:18 0 512M 0 part
└─sdb3 8:19 0 232.4G 0 part
Bash:
root@TO1-C001-PM001:~# ls /dev/nvm*
/dev/nvme0 /dev/nvme1
the devices listed by ID are only the boot SSD pair:
Bash:
root@TO1-C001-PM001:~# ls /dev/disk/by-id/
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00192W ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00262A-part2 wwn-0x5002538e39b27324
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00192W-part1 ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00262A-part3 wwn-0x5002538e39b27324-part1
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00192W-part2 wwn-0x5002538e39b27273 wwn-0x5002538e39b27324-part2
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00192W-part3 wwn-0x5002538e39b27273-part1 wwn-0x5002538e39b27324-part3
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00262A wwn-0x5002538e39b27273-part2
ata-Samsung_SSD_860_EVO_250GB_S59WNJ0MB00262A-part1 wwn-0x5002538e39b27273-part3
Neither the working nor the new failed system lists nvme as a loaded module:
Bash:
root@TO1-C001-PM001:~# lsmod | grep nvm
Bash:
root@eEI310:~# lsmod | grep nvm
I attempted to follow this thread and adjust the power save settings for nvme (but I think this setting might require the nvme module to be enabled? not sure):
https://forum.proxmox.com/threads/crash-with-intel-760p-nvme.60099/
Following: https://tekbyte.net/2020/fixing-nvme-ssd-problems-on-linux/
For additional Debian bug tracking, which may or may not be related: https://bugzilla.kernel.org/show_bug.cgi?id=195039
Bash:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=30 quiet nvme_core.default_ps_max_latency_us=0"
GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/pve-1 boot=zfs"
update-grub
and rebooted.Note that the drives are still working in the system that has the default set to 100000.
Bash:
root@TO1-C001-PM001:~# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
0
root@eEI310:~# cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000
I installed the nvme-cli package on the new server for troubleshooting and pulled this info about the drives:
Bash:
root@TO1-C001-PM001:~# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x1c58
ssvid : 0x1c58
sn : SDM000038D42
mn : HUSMR7676BHP3Y1
fr : KNGND101
rab : 7
ieee : 000cca
cmic : 0
mdts : 0
cntlid : 23
ver : 10201
rtd3r : 5b8d80
rtd3e : 30d400
oaes : 0x100
ctratt : 0
rrls : 0
oacs : 0xe
acl : 255
aerl : 7
frmw : 0xb
lpa : 0x3
elpe : 255
npss : 11
avscc : 0x1
apsta : 0
wctemp : 357
cctemp : 360
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 7687991459840
unvmcap : 0
rpmbs : 0
edstt : 0
dsto : 0
fwug : 0
kas : 0
hctma : 0
mntmt : 0
mxtmt : 0
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 128
oncs : 0x3f
fuses : 0
fna : 0x2
vwc : 0
awun : 0
awupf : 0
nvscc : 1
nwpc : 0
acwu : 0
sgls : d0001
mnan : 0
subnqn : nqn.2017-03.com.wdc:nvme-solid-state-drive. VID:1C58. MN:HUSMR7676BHP3Y1 .SN:SDM000038D42
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:25.00W operational enlat:15000 exlat:15000 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:24.00W operational enlat:15000 exlat:15000 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:23.00W operational enlat:15000 exlat:15000 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:22.00W operational enlat:15000 exlat:15000 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:21.00W operational enlat:15000 exlat:15000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
ps 5 : mp:20.00W operational enlat:15000 exlat:15000 rrt:5 rrl:5
rwt:5 rwl:5 idle_power:- active_power:-
ps 6 : mp:19.00W operational enlat:15000 exlat:15000 rrt:6 rrl:6
rwt:6 rwl:6 idle_power:- active_power:-
ps 7 : mp:18.00W operational enlat:15000 exlat:15000 rrt:7 rrl:7
rwt:7 rwl:7 idle_power:- active_power:-
ps 8 : mp:17.00W operational enlat:15000 exlat:15000 rrt:8 rrl:8
rwt:8 rwl:8 idle_power:- active_power:-
ps 9 : mp:16.00W operational enlat:15000 exlat:15000 rrt:9 rrl:9
rwt:9 rwl:9 idle_power:- active_power:-
ps 10 : mp:15.00W operational enlat:15000 exlat:15000 rrt:10 rrl:10
rwt:10 rwl:10 idle_power:- active_power:-
ps 11 : mp:14.00W operational enlat:15000 exlat:15000 rrt:11 rrl:11
rwt:11 rwl:11 idle_power:- active_power:-
NVMe SMART reporting says that the drive is fine, what's strange is that the namespace id is ffffffff:
Bash:
root@TO1-C001-PM001:~# smartctl -x /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.65-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: HUSMR7676BHP3Y1
Serial Number: SDM000038D42
Firmware Version: KNGND101
PCI Vendor/Subsystem ID: 0x1c58
IEEE OUI Identifier: 0x000cca
Total NVM Capacity: 7,687,991,459,840 [7.68 TB]
Unallocated NVM Capacity: 0
Controller ID: 35
Number of Namespaces: 128
Local Time is: Fri Oct 23 14:57:11 2020 EDT
Firmware Updates (0x0b): 5 Slots, Slot 1 R/O
Optional Admin Commands (0x000e): Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x003f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 87 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 24.00W - - 1 1 1 1 15000 15000
2 + 23.00W - - 2 2 2 2 15000 15000
3 + 22.00W - - 3 3 3 3 15000 15000
4 + 21.00W - - 4 4 4 4 15000 15000
5 + 20.00W - - 5 5 5 5 15000 15000
6 + 19.00W - - 6 6 6 6 15000 15000
7 + 18.00W - - 7 7 7 7 15000 15000
8 + 17.00W - - 8 8 8 8 15000 15000
9 + 16.00W - - 9 9 9 9 15000 15000
10 + 15.00W - - 10 10 10 10 15000 15000
11 + 14.00W - - 11 11 11 11 15000 15000
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 0 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 0
Data Units Written: 0
Host Read Commands: 0
Host Write Commands: 0
Controller Busy Time: 0
Power Cycles: 0
Power On Hours: 354
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 12
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 38 Celsius
Temperature Sensor 2: 32 Celsius
Temperature Sensor 3: 36 Celsius
Temperature Sensor 4: 38 Celsius
Error Information (NVMe Log 0x01, max 256 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1 - - 0x0fa6 - 0 0 -
255 1 - - 0x0fa6 - 0 0 -
Bash:
root@TO1-C001-PM001:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 0 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
data_units_read : 0
data_units_written : 0
host_read_commands : 0
host_write_commands : 0
controller_busy_time : 0
power_cycles : 0
power_on_hours : 354
unsafe_shutdowns : 0
media_errors : 0
num_err_log_entries : 12
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 38 C
Temperature Sensor 2 : 32 C
Temperature Sensor 3 : 36 C
Temperature Sensor 4 : 38 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Bash:
root@TO1-C001-PM001:~# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning : 0
temperature : 0 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
data_units_read : 0
data_units_written : 0
host_read_commands : 0
host_write_commands : 0
controller_busy_time : 0
power_cycles : 0
power_on_hours : 370
unsafe_shutdowns : 0
media_errors : 0
num_err_log_entries : 11
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 39 C
Temperature Sensor 2 : 33 C
Temperature Sensor 3 : 38 C
Temperature Sensor 4 : 38 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
If I list the nvme subsystems:
Bash:
root@TO1-C001-PM001:~# nvme list-subsys
nvme-subsys0 - NQN=nqn.2017-03.com.wdc:nvme-solid-state-drive. VID:1C58. MN:HUSMR7676BHP3Y1 .SN:SDM000038D42
\
+- nvme0 pcie 0000:03:00.0 live
nvme-subsys1 - NQN=nqn.2017-03.com.wdc:nvme-solid-state-drive. VID:1C58. MN:HUSMR7676BHP3Y1 .SN:SDM000038D2E
\
+- nvme1 pcie 0000:04:00.0 live
Last edited: