PVE 7.1: New SMART Error on every hard disk at every reboot?

Apr 4, 2020
36
1
13
Hello all,

since I upgraded to Proxmox 7.x I observe the following error:
at each normal reboot of my server the number of smart errors on each of the 5 installed disks increases by +1.


Code:
The following warning/error was logged by the smartd daemon:
Device: /dev/nvme1, number of Error Log entries increased from 33 to 34
Device info:
SAMSUNG MZQLW960HMJP-00003, S/N:S35XNB0Jxxxx, FW:CXV8601Q, 960 GB

Code:
uname -a
Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100) x86_64 GNU/Linux

I found the following Debian bug report. Could this be the root cause and will a possible fix be implemented by the PVE team?
https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1823812.html


Code:
root@pegasus ~ # nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 38 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 1%
endurance group critical warning summary: 0
data_units_read                         : 38,829,632
data_units_written                      : 72,275,936
host_read_commands                      : 10,143,470,680
host_write_commands                     : 3,643,068,506
controller_busy_time                    : 2,615
power_cycles                            : 26
power_on_hours                          : 25,940
unsafe_shutdowns                        : 9
media_errors                            : 0
num_err_log_entries                     : 34
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 38 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
root@pegasus ~ # nvme error-log /dev/nvme1n1
Error Log Entries for device:nvme1n1 entries:64
.................
 Entry[ 0]
.................
error_count     : 34
sqid            : 0
cmdid           : 0x1006
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 1]
.................
error_count     : 33
sqid            : 0
cmdid           : 0x201c
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 2]
.................
error_count     : 32
sqid            : 0
cmdid           : 0x1016
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 3]
.................
error_count     : 31
sqid            : 0
cmdid           : 0x201c
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 4]
.................
error_count     : 30
sqid            : 0
cmdid           : 0xe
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 5]
.................
error_count     : 29
sqid            : 0
cmdid           : 0xa
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 6]
.................
error_count     : 28
sqid            : 0
cmdid           : 0x16
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 7]
.................
error_count     : 27
sqid            : 0
cmdid           : 0x2
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 8]
.................
error_count     : 26
sqid            : 0
cmdid           : 0x1b
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0x28
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[ 9]
.................
error_count     : 25
sqid            : 0
cmdid           : 0x1b
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0x28
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[10]
.................
error_count     : 24
sqid            : 0
cmdid           : 0x1b
status_field    : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc    : 0x28
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[11]

[B][I].....[/I][/B]

.................
 Entry[34]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[35]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 
Last edited:
  • Like
Reactions: Thalix
Have you checked your NVMes for newer firmware versions?
 
Have you checked your NVMes for newer firmware versions?
Thank you for your answer.

Yes, I checked, all SSDs have the latest firmware available. I only use Samsung Enterprise SSD type PM983 and PM963. Both are showing the problem when using PVE. For reference, with Ubuntu 20.04 and Ubuntu 21.10 the problem does not occur.
 
It seems this is an incompatibility between the kernel nvme interface and the NVMe. A future firmware update might fix this issue.
 
Same here with an older Samsung 950 Pro. No negative impact detected, other than an increasing counter.
 
Same here using a Samsung PM983 (and Samsung 970 Evo Plus), PVE 7.1-10, Kernel version Linux 5.15.19-1-pve #1 SMP PVE 5.15.19-1.
 
Last edited:
It seems this is an incompatibility between the kernel nvme interface and the NVMe. A future firmware update might fix this issue.
This is not a firmware issue, if it was Samsung would have fixed it by now. The "Error Information Log Entries" has been there since forever for Samsung SSDs and it has nothing to do with drive health. I have never faced any errors regarding this on any other OS or Hypervisor.
 
This is not a firmware issue, if it was Samsung would have fixed it by now.
While your conclusion is right the premise is rather not. Samsung, as most other (HW) vendors don't are big on getting out updates after they already sold their stuff, especially if it's just for an inconvenience, so I'd never use that as proof.

I have never faced any errors regarding this on any other OS or Hypervisor.
As the above linked reports show it's present at least three distros, that combined make up the biggest share of Linux distros in use on servers (ref), did you even try that same drive on another OS, especially one where a new enough smartmontools is actually a thing? As if this OS don't have any nvme drive health monitoring at all this bug naturally cannot exist..

Anyhow, this is not adding any new/useful information, the upstream ticket for actually coping with that behavior in smartmontools to avoid such false positive is still open here:
https://www.smartmontools.org/ticket/1222
 
  • Like
Reactions: Thalix and Feni
While your conclusion is right the premise is rather not. Samsung, as most other (HW) vendors don't are big on getting out updates after they already sold their stuff, especially if it's just for an inconvenience, so I'd never use that as proof.


As the above linked reports show it's present at least three distros, that combined make up the biggest share of Linux distros in use on servers (ref), did you even try that same drive on another OS, especially one where a new enough smartmontools is actually a thing? As if this OS don't have any nvme drive health monitoring at all this bug naturally cannot exist..

Anyhow, this is not adding any new/useful information, the upstream ticket for actually coping with that behavior in smartmontools to avoid such false positive is still open here:
https://www.smartmontools.org/ticket/1222
The issues seems to be fixed now.

pve-kernel-5.15.19-2-pve/stable,now 5.15.19-3 amd64
smartmontools/now 7.2-pve2 amd64
 
I was facing this issue with kernel `5.15.104-1-pve`, using Samsung SSD 960 PRO. I just upgraded to `6.2.6-1-pve`, but the issue persists.
 
Same here, new server with proxmox 8.0beta Kernel 6.2.16-1-pve, Error count is growing

root@pbs-host:~# nvme error-log /dev/nvme0n1 |head -n 30
Error Log Entries for device:nvme0n1 entries:64
.................
Entry[ 0]
.................
error_count : 43
sqid : 0
cmdid : 0x8
status_field : 0x6002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0x28
lba : 0xffffffffffffffff
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0xffffffffffffffff
trtype_spec_info: 0
.................
Entry[ 1]
.................
error_count : 42
sqid : 0
cmdid : 0x1000
status_field : 0x6002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0x2b
lba : 0xffffffffffffffff
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0xffffffffffffffff
 
Can you provide the exact model of your NVMe?
 
I can reproduce the problem on Proxmox VE 8.0-BETA-1 using a Samsung SSD 970 (S/N:S462NF0K804667F, FW:1B2QEXP7, 1.02 TB) . The error comes from the disc not supporting TP 4056 ("Namespace Types"), see github.com/linux-nvme/nvme-cli/issues/1142.

There is also this discussion at the kernel issue tracker.

To check if this is the case of your drive you can run follow the github thread above and run e.g.

Code:
python3 -c 'print((0x303c033fff >> 37) & 0xf == 0x1)'

were 303c033fff comes from the cap field in nvme show-regs YOUR_NVME.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!