5.3 - Mismatch of SMART info on dual NVME system

mmassez

Member
Dec 7, 2018
4
0
6
38
Hello,

I found something strange in regards to SMART info retrieved from my proxmox install.
I have 2 NVME Drives:
  • 1x Toshiba RC-100 240 GB as boot drive with LVM
  • 1x Samsung 970 Evor 1TB as ZFS drive
When I looked at the smart values the values seemed strange, only 20GB written for the Samsung with the VMs on and over 600GB for the Toshiba with only the base system and some ISOs.
upload_2018-12-7_14-17-23.png

Samsung Drive:
upload_2018-12-7_14-4-29.png
Toshiba Drive:
upload_2018-12-7_14-19-5.png

However after some digging around it seems that the mapping of the controllers and block devices are switched. So the info shown is probably received from the SMART data from the controller /dev/nvme0 and /dev/nvme1 but when showing the SMART values it takes them from /dev/nvme0n1 and /dev/nvme1n1 which created the mixup that I had excessive writes on the boot drive and almost none on the ZFS drive.
I've added the outputs from smartctl below.
Code:
root@proxmox01:~# smartctl -i /dev/nvme0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Model Number:                       TOSHIBA-RC100
Serial Number:                      XXXXX
Firmware Version:                   ADRA0101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 14:07:02 2018 CET

root@proxmox01:~# smartctl -i /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      XXXXX
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            399,893,221,376 [399 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 14:07:07 2018 CET

root@proxmox01:~# smartctl -i /dev/nvme1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      XXXXX
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            399,893,426,176 [399 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 14:07:11 2018 CET

root@proxmox01:~# smartctl -i /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Model Number:                       TOSHIBA-RC100
Serial Number:                      XXXXX
Firmware Version:                   ADRA0101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 14:07:13 2018 CET

The following shows the output from /sys/block and you can see that the paths are mixed up, nvme0n1 via /nvme/nvme0 and vice versa. I've checked the PCI IDs as well so.
Code:
root@proxmox01:~# lspci |grep -i Non-Volatile
02:00.0 Non-Volatile memory controller: Toshiba America Info Systems Device 0113 (rev 01)
65:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a808
Code:
root@proxmox01:~# ls -l /sys/block/nvme0n1
lrwxrwxrwx 1 root root 0 Dec  7 12:10 /sys/block/nvme0n1 -> ../devices/pci0000:64/0000:64:02.0/0000:65:00.0/nvme/nvme1/nvme0n1
root@proxmox01:~# ls -l /sys/block/nvme1n1
lrwxrwxrwx 1 root root 0 Dec  7 12:10 /sys/block/nvme1n1 -> ../devices/pci0000:00/0000:00:1c.2/0000:02:00.0/nvme/nvme0/nvme1n1

Does anybody have an idea of what happened here since the block device is mapped on the other controller on the system. Can't find much info about the mapping on the web.

Thanks!
 
can you try smartctl -a DEVICE
where DEVICE is /dev/nvmeXn1
and see if it matches the webinterface?
 
I've checked and it doesn't match the interface for the SMART Values.
So general info on "Disks" tab is correct but when requesting the SMART Values the values are from the other NVME Drive.

nvme0n1:
Code:
root@proxmox01:~# smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      XXXX
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            400,909,557,760 [400 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 15:15:02 2018 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    302,402 [154 GB]
Data Units Written:                 1,216,046 [622 GB]
Host Read Commands:                 12,784,207
Host Write Commands:                26,061,912
Controller Busy Time:               29
Power Cycles:                       3
Power On Hours:                     47
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               44 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

nvme1n1:
Code:
root@proxmox01:~# smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, 

=== START OF INFORMATION SECTION ===
Model Number:                       TOSHIBA-RC100
Serial Number:                      XXXX
Firmware Version:                   ADRA0101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec  7 15:15:11 2018 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x0017):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.30W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.30W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    4  4  4  4    10000   45000
 4 -   0.0050W       -        -    4  4  4  4    10000   50000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -    4096       0         0
 1 +     512       0         3

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x6002

Could it be since the block devices are mapped like this /dev/nvme0/nvme1n1 and /dev/nvme1/nvme0n1 that it reads the model name from the controller, being nvme0 and nvme1, but when requesting the SMART values it goes to nvme1n1 for nvme0 and nvme0n1 for nvme1 and thus it mixes controller and block device for the different statistics?
 
ok, can you open a bug report here: https://bugzilla.proxmox.com/

it seems that older versions of smartctl could not handle nvmeXnY so we dropped the namespace part
but current versions apparently can handle it
 
Sure, I'll open a bug report. Is it sufficient if I copy the contents of both posts?
Do you happen to have an idea what is going on with the mapping between controller and block device.
I would expect that nvme0 has nvme0n1 as block device.
 
Last edited:
Sure, I'll open a bug report. Is it sufficient if I copy the contents of both posts?
yes this is enough

Do you happen to have an idea what is going on with the mapping between controller and block device.
I would expect that nvme0 has nvme0n1 as block device.
no but could you open a second bug for this? this may be kernel / udev bug ?
 
@dcsapak I've created bug report #2020 for the SMART information in the GUI and bug report #2021 for the block id.

For the SMART values, if I use the namespace part for the toshiba nvme1n1 I can't get the SMART info, only when using the controller nvme0 I can get the correct SMART info, the Samsung gives me the correct SMART info on both controller and block device.

My guess it has something to do with kernel/udev too in this case. Probably if the naming is correct then the SMART values would be correct as well as you probably cut the nY part to get the SMART info from the controller.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!