nvme disappears after few hours, could restore only after physical reboot

naisam

New Member
Jan 22, 2024
2
0
1
I have proxmox installed on a dell optiplex.
Recently the nvme disappears after few hours and the only way to bring it back is by physically rebooting.

Code:
root@pve:~# smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.16-6-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       XPG GAMMIX S70 BLADE
Serial Number:                      2N242L1J4KJA
Firmware Version:                   3.2.F.83
PCI Vendor/Subsystem ID:            0x1cc1
IEEE OUI Identifier:                0x707c18
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            707c18 242010400a
Local Time is:                      Tue Jan 23 12:28:28 2024 NZDT
Firmware Updates (0x0e):            7 Slots
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     2.80W       -        -    2  2  2  2       50     200
 3 -   0.1700W       -        -    3  3  3  3      500    7500
 4 -   0.0200W       -        -    4  4  4  4     2000   70000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        51 Celsius
Available Spare:                    98%
Available Spare Threshold:          25%
Percentage Used:                    0%
Data Units Read:                    2,499,837,497 [1.27 PB]
Data Units Written:                 1,971,315 [1.00 TB]
Host Read Commands:                 35,663,715,100
Host Write Commands:                112,663,478
Controller Busy Time:               540
Power Cycles:                       78
Power On Hours:                     931
Unsafe Shutdowns:                   41
Media and Data Integrity Errors:    41
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               51 Celsius
Thermal Temp. 1 Transition Count:   176
Thermal Temp. 1 Total Time:         39422

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

free(): invalid pointer
Aborted

When the nvme disappears,
Output of lvs :
Code:
 LV            VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data          pve twi-aotz-- 360.68g             17.17  1.03                           
  root          pve -wi-ao----  96.00g                                                   
  swap          pve -wi-ao----   8.00g                                                   
  vm-215-disk-0 pve Vwi-a-tz--   4.00m data        14.06                                 
  vm-215-disk-1 pve Vwi-a-tz-- 200.00g data        30.96                                 
  vm-215-disk-2 pve Vwi-a-tz--   4.00m data        1.56
Output of pvs:
Code:
  PV         VG  Fmt  Attr PSize    PFree
  /dev/sda3  pve lvm2 a--  <488.05g 16.00g

As soon as I physically reboot the system, everything comes up normal for few hours,

Output of lvs when rebooted:
Code:
  LV            VG            Attr       LSize   Pool          Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data          pve           twi-aotz-- 360.68g                      17.27  1.03                           
  root          pve           -wi-ao----  96.00g                                                             
  swap          pve           -wi-ao----   8.00g                                                             
  vm-215-disk-0 pve           Vwi-aotz--   4.00m data                 14.06                                 
  vm-215-disk-1 pve           Vwi-aotz-- 200.00g data                 31.14                                 
  vm-215-disk-2 pve           Vwi-aotz--   4.00m data                 1.56                                   
  vm-212-disk-0 vmstoragenvme Vwi-aotz-- 200.00g vmstoragenvme        18.92                                 
  vm-213-disk-0 vmstoragenvme Vwi-aotz-- 100.00g vmstoragenvme        38.02                                 
  vmstoragenvme vmstoragenvme twi-aotz--   1.83t                      4.04   0.31

Output of pvs when rebooted:
Code:
  PV           VG            Fmt  Attr PSize    PFree 
  /dev/nvme0n1 vmstoragenvme lvm2 a--     1.86t 376.00m
  /dev/sda3    pve           lvm2 a--  <488.05g  16.00g

Could someone please help with what else I should check ?
 
Output of lsblk when operating normally:
Code:
NAME                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                     8:0    0   489G  0 disk
├─sda1                                  8:1    0  1007K  0 part
├─sda2                                  8:2    0     1G  0 part /boot/efi
└─sda3                                  8:3    0   488G  0 part
  ├─pve-swap                          253:2    0     8G  0 lvm  [SWAP]
  ├─pve-root                          253:3    0    96G  0 lvm  /
  ├─pve-data_tmeta                    253:4    0   3.7G  0 lvm 
  │ └─pve-data-tpool                  253:6    0 360.7G  0 lvm 
  │   ├─pve-data                      253:7    0 360.7G  1 lvm 
  │   ├─pve-vm--215--disk--0          253:8    0     4M  0 lvm 
  │   ├─pve-vm--215--disk--1          253:9    0   200G  0 lvm 
  │   └─pve-vm--215--disk--2          253:10   0     4M  0 lvm 
  └─pve-data_tdata                    253:5    0 360.7G  0 lvm 
    └─pve-data-tpool                  253:6    0 360.7G  0 lvm 
      ├─pve-data                      253:7    0 360.7G  1 lvm 
      ├─pve-vm--215--disk--0          253:8    0     4M  0 lvm 
      ├─pve-vm--215--disk--1          253:9    0   200G  0 lvm 
      └─pve-vm--215--disk--2          253:10   0     4M  0 lvm 
nvme0n1                               259:0    0   1.9T  0 disk
├─vmstoragenvme-vmstoragenvme_tmeta   253:0    0  15.9G  0 lvm 
│ └─vmstoragenvme-vmstoragenvme-tpool 253:11   0   1.8T  0 lvm 
│   ├─vmstoragenvme-vmstoragenvme     253:12   0   1.8T  1 lvm 
│   ├─vmstoragenvme-vm--213--disk--0  253:13   0   100G  0 lvm 
│   └─vmstoragenvme-vm--212--disk--0  253:14   0   200G  0 lvm 
└─vmstoragenvme-vmstoragenvme_tdata   253:1    0   1.8T  0 lvm 
  └─vmstoragenvme-vmstoragenvme-tpool 253:11   0   1.8T  0 lvm 
    ├─vmstoragenvme-vmstoragenvme     253:12   0   1.8T  1 lvm 
    ├─vmstoragenvme-vm--213--disk--0  253:13   0   100G  0 lvm 
    └─vmstoragenvme-vm--212--disk--0  253:14   0   200G  0 lvm

I also see these errors in dmesg:
Code:
[  993.219376] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[  993.219382] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  993.220064] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[  993.220069] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1155.743628] nvme0n1: I/O Cmd(0x2) @ LBA 94493952, 104 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.743638] critical medium error, dev nvme0n1, sector 94493952 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 2
[ 1155.779358] nvme0n1: I/O Cmd(0x2) @ LBA 94507264, 80 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.779369] critical medium error, dev nvme0n1, sector 94507264 op 0x0:(READ) flags 0x0 phys_seg 10 prio class 2
[ 1155.781878] nvme0n1: I/O Cmd(0x2) @ LBA 94507136, 128 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.781889] critical medium error, dev nvme0n1, sector 94507136 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
[ 1155.791401] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.791411] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1155.792359] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.792368] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
 
Output of lsblk when operating normally:
Code:
NAME                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                     8:0    0   489G  0 disk
├─sda1                                  8:1    0  1007K  0 part
├─sda2                                  8:2    0     1G  0 part /boot/efi
└─sda3                                  8:3    0   488G  0 part
  ├─pve-swap                          253:2    0     8G  0 lvm  [SWAP]
  ├─pve-root                          253:3    0    96G  0 lvm  /
  ├─pve-data_tmeta                    253:4    0   3.7G  0 lvm
  │ └─pve-data-tpool                  253:6    0 360.7G  0 lvm
  │   ├─pve-data                      253:7    0 360.7G  1 lvm
  │   ├─pve-vm--215--disk--0          253:8    0     4M  0 lvm
  │   ├─pve-vm--215--disk--1          253:9    0   200G  0 lvm
  │   └─pve-vm--215--disk--2          253:10   0     4M  0 lvm
  └─pve-data_tdata                    253:5    0 360.7G  0 lvm
    └─pve-data-tpool                  253:6    0 360.7G  0 lvm
      ├─pve-data                      253:7    0 360.7G  1 lvm
      ├─pve-vm--215--disk--0          253:8    0     4M  0 lvm
      ├─pve-vm--215--disk--1          253:9    0   200G  0 lvm
      └─pve-vm--215--disk--2          253:10   0     4M  0 lvm
nvme0n1                               259:0    0   1.9T  0 disk
├─vmstoragenvme-vmstoragenvme_tmeta   253:0    0  15.9G  0 lvm
│ └─vmstoragenvme-vmstoragenvme-tpool 253:11   0   1.8T  0 lvm
│   ├─vmstoragenvme-vmstoragenvme     253:12   0   1.8T  1 lvm
│   ├─vmstoragenvme-vm--213--disk--0  253:13   0   100G  0 lvm
│   └─vmstoragenvme-vm--212--disk--0  253:14   0   200G  0 lvm
└─vmstoragenvme-vmstoragenvme_tdata   253:1    0   1.8T  0 lvm
  └─vmstoragenvme-vmstoragenvme-tpool 253:11   0   1.8T  0 lvm
    ├─vmstoragenvme-vmstoragenvme     253:12   0   1.8T  1 lvm
    ├─vmstoragenvme-vm--213--disk--0  253:13   0   100G  0 lvm
    └─vmstoragenvme-vm--212--disk--0  253:14   0   200G  0 lvm

I also see these errors in dmesg:
Code:
[  993.219376] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[  993.219382] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  993.220064] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[  993.220069] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1155.743628] nvme0n1: I/O Cmd(0x2) @ LBA 94493952, 104 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.743638] critical medium error, dev nvme0n1, sector 94493952 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 2
[ 1155.779358] nvme0n1: I/O Cmd(0x2) @ LBA 94507264, 80 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.779369] critical medium error, dev nvme0n1, sector 94507264 op 0x0:(READ) flags 0x0 phys_seg 10 prio class 2
[ 1155.781878] nvme0n1: I/O Cmd(0x2) @ LBA 94507136, 128 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.781889] critical medium error, dev nvme0n1, sector 94507136 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 2
[ 1155.791401] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.791411] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1155.792359] nvme0n1: I/O Cmd(0x2) @ LBA 94507248, 8 blocks, I/O Error (sct 0x2 / sc 0x81)
[ 1155.792368] critical medium error, dev nvme0n1, sector 94507248 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Starting to see similar stuff with new nvme ssd. Did you get it fixed, or was the drive toast?