5.3 - Mismatch of SMART info on dual NVME system

Discussion in 'Proxmox VE: Installation and configuration' started by mmassez, Dec 7, 2018.

  1. mmassez

    mmassez New Member

    Joined:
    Dec 7, 2018
    Messages:
    4
    Likes Received:
    0
    Hello,

    I found something strange in regards to SMART info retrieved from my proxmox install.
    I have 2 NVME Drives:
    • 1x Toshiba RC-100 240 GB as boot drive with LVM
    • 1x Samsung 970 Evor 1TB as ZFS drive
    When I looked at the smart values the values seemed strange, only 20GB written for the Samsung with the VMs on and over 600GB for the Toshiba with only the base system and some ISOs.
    upload_2018-12-7_14-17-23.png

    Samsung Drive:
    upload_2018-12-7_14-4-29.png
    Toshiba Drive:
    upload_2018-12-7_14-19-5.png

    However after some digging around it seems that the mapping of the controllers and block devices are switched. So the info shown is probably received from the SMART data from the controller /dev/nvme0 and /dev/nvme1 but when showing the SMART values it takes them from /dev/nvme0n1 and /dev/nvme1n1 which created the mixup that I had excessive writes on the boot drive and almost none on the ZFS drive.
    I've added the outputs from smartctl below.
    Code:
    root@proxmox01:~# smartctl -i /dev/nvme0
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
    
    === START OF INFORMATION SECTION ===
    Model Number:                       TOSHIBA-RC100
    Serial Number:                      XXXXX
    Firmware Version:                   ADRA0101
    PCI Vendor/Subsystem ID:            0x1179
    IEEE OUI Identifier:                0x00080d
    Controller ID:                      0
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 14:07:02 2018 CET
    
    root@proxmox01:~# smartctl -i /dev/nvme0n1
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
    
    === START OF INFORMATION SECTION ===
    Model Number:                       Samsung SSD 970 EVO 1TB
    Serial Number:                      XXXXX
    Firmware Version:                   2B2QEXE7
    PCI Vendor/Subsystem ID:            0x144d
    IEEE OUI Identifier:                0x002538
    Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
    Unallocated NVM Capacity:           0
    Controller ID:                      4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
    Namespace 1 Utilization:            399,893,221,376 [399 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 14:07:07 2018 CET
    
    root@proxmox01:~# smartctl -i /dev/nvme1
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
    
    === START OF INFORMATION SECTION ===
    Model Number:                       Samsung SSD 970 EVO 1TB
    Serial Number:                      XXXXX
    Firmware Version:                   2B2QEXE7
    PCI Vendor/Subsystem ID:            0x144d
    IEEE OUI Identifier:                0x002538
    Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
    Unallocated NVM Capacity:           0
    Controller ID:                      4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
    Namespace 1 Utilization:            399,893,426,176 [399 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 14:07:11 2018 CET
    
    root@proxmox01:~# smartctl -i /dev/nvme1n1
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
    
    === START OF INFORMATION SECTION ===
    Model Number:                       TOSHIBA-RC100
    Serial Number:                      XXXXX
    Firmware Version:                   ADRA0101
    PCI Vendor/Subsystem ID:            0x1179
    IEEE OUI Identifier:                0x00080d
    Controller ID:                      0
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 14:07:13 2018 CET
    The following shows the output from /sys/block and you can see that the paths are mixed up, nvme0n1 via /nvme/nvme0 and vice versa. I've checked the PCI IDs as well so.
    Code:
    root@proxmox01:~# lspci |grep -i Non-Volatile
    02:00.0 Non-Volatile memory controller: Toshiba America Info Systems Device 0113 (rev 01)
    65:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a808
    Code:
    root@proxmox01:~# ls -l /sys/block/nvme0n1
    lrwxrwxrwx 1 root root 0 Dec  7 12:10 /sys/block/nvme0n1 -> ../devices/pci0000:64/0000:64:02.0/0000:65:00.0/nvme/nvme1/nvme0n1
    root@proxmox01:~# ls -l /sys/block/nvme1n1
    lrwxrwxrwx 1 root root 0 Dec  7 12:10 /sys/block/nvme1n1 -> ../devices/pci0000:00/0000:00:1c.2/0000:02:00.0/nvme/nvme0/nvme1n1
    Does anybody have an idea of what happened here since the block device is mapped on the other controller on the system. Can't find much info about the mapping on the web.

    Thanks!
     
  2. dcsapak

    dcsapak Proxmox Staff Member
    Staff Member

    Joined:
    Feb 1, 2016
    Messages:
    2,971
    Likes Received:
    268
    can you try smartctl -a DEVICE
    where DEVICE is /dev/nvmeXn1
    and see if it matches the webinterface?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. mmassez

    mmassez New Member

    Joined:
    Dec 7, 2018
    Messages:
    4
    Likes Received:
    0
    I've checked and it doesn't match the interface for the SMART Values.
    So general info on "Disks" tab is correct but when requesting the SMART Values the values are from the other NVME Drive.

    nvme0n1:
    Code:
    root@proxmox01:~# smartctl -a /dev/nvme0n1
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
    
    === START OF INFORMATION SECTION ===
    Model Number:                       Samsung SSD 970 EVO 1TB
    Serial Number:                      XXXX
    Firmware Version:                   2B2QEXE7
    PCI Vendor/Subsystem ID:            0x144d
    IEEE OUI Identifier:                0x002538
    Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
    Unallocated NVM Capacity:           0
    Controller ID:                      4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
    Namespace 1 Utilization:            400,909,557,760 [400 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 15:15:02 2018 CET
    Firmware Updates (0x16):            3 Slots, no Reset required
    Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
    Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
    Maximum Data Transfer Size:         512 Pages
    Warning  Comp. Temp. Threshold:     85 Celsius
    Critical Comp. Temp. Threshold:     85 Celsius
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     6.20W       -        -    0  0  0  0        0       0
     1 +     4.30W       -        -    1  1  1  1        0       0
     2 +     2.10W       -        -    2  2  2  2        0       0
     3 -   0.0400W       -        -    3  3  3  3      210    1200
     4 -   0.0050W       -        -    4  4  4  4     2000    8000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         0
    
    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x00
    Temperature:                        39 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    0%
    Data Units Read:                    302,402 [154 GB]
    Data Units Written:                 1,216,046 [622 GB]
    Host Read Commands:                 12,784,207
    Host Write Commands:                26,061,912
    Controller Busy Time:               29
    Power Cycles:                       3
    Power On Hours:                     47
    Unsafe Shutdowns:                   0
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               39 Celsius
    Temperature Sensor 2:               44 Celsius
    
    Error Information (NVMe Log 0x01, max 64 entries)
    No Errors Logged
    
    nvme1n1:
    Code:
    root@proxmox01:~# smartctl -a /dev/nvme1n1
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, 
    
    === START OF INFORMATION SECTION ===
    Model Number:                       TOSHIBA-RC100
    Serial Number:                      XXXX
    Firmware Version:                   ADRA0101
    PCI Vendor/Subsystem ID:            0x1179
    IEEE OUI Identifier:                0x00080d
    Controller ID:                      0
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          240,057,409,536 [240 GB]
    Namespace 1 Formatted LBA Size:     512
    Local Time is:                      Fri Dec  7 15:15:11 2018 CET
    Firmware Updates (0x12):            1 Slot, no Reset required
    Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
    Optional NVM Commands (0x0017):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
    Maximum Data Transfer Size:         512 Pages
    Warning  Comp. Temp. Threshold:     82 Celsius
    Critical Comp. Temp. Threshold:     85 Celsius
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     3.30W       -        -    0  0  0  0        0       0
     1 +     2.70W       -        -    1  1  1  1        0       0
     2 +     2.30W       -        -    2  2  2  2        0       0
     3 -   0.0500W       -        -    4  4  4  4    10000   45000
     4 -   0.0050W       -        -    4  4  4  4    10000   50000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 -    4096       0         0
     1 +     512       0         3
    
    === START OF SMART DATA SECTION ===
    Read NVMe SMART/Health Information failed: NVMe Status 0x6002
    Could it be since the block devices are mapped like this /dev/nvme0/nvme1n1 and /dev/nvme1/nvme0n1 that it reads the model name from the controller, being nvme0 and nvme1, but when requesting the SMART values it goes to nvme1n1 for nvme0 and nvme0n1 for nvme1 and thus it mixes controller and block device for the different statistics?
     
  4. dcsapak

    dcsapak Proxmox Staff Member
    Staff Member

    Joined:
    Feb 1, 2016
    Messages:
    2,971
    Likes Received:
    268
    ok, can you open a bug report here: https://bugzilla.proxmox.com/

    it seems that older versions of smartctl could not handle nvmeXnY so we dropped the namespace part
    but current versions apparently can handle it
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. mmassez

    mmassez New Member

    Joined:
    Dec 7, 2018
    Messages:
    4
    Likes Received:
    0
    Sure, I'll open a bug report. Is it sufficient if I copy the contents of both posts?
    Do you happen to have an idea what is going on with the mapping between controller and block device.
    I would expect that nvme0 has nvme0n1 as block device.
     
    #5 mmassez, Dec 7, 2018
    Last edited: Dec 7, 2018
  6. dcsapak

    dcsapak Proxmox Staff Member
    Staff Member

    Joined:
    Feb 1, 2016
    Messages:
    2,971
    Likes Received:
    268
    yes this is enough

    no but could you open a second bug for this? this may be kernel / udev bug ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. mmassez

    mmassez New Member

    Joined:
    Dec 7, 2018
    Messages:
    4
    Likes Received:
    0
    @dcsapak I've created bug report #2020 for the SMART information in the GUI and bug report #2021 for the block id.

    For the SMART values, if I use the namespace part for the toshiba nvme1n1 I can't get the SMART info, only when using the controller nvme0 I can get the correct SMART info, the Samsung gives me the correct SMART info on both controller and block device.

    My guess it has something to do with kernel/udev too in this case. Probably if the naming is correct then the SMART values would be correct as well as you probably cut the nY part to get the SMART info from the controller.
     
    #7 mmassez, Dec 7, 2018
    Last edited: Dec 7, 2018
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice