Nvme drive disappeared

da-alb

Member
Jan 18, 2021
116
3
23
Hi,

This morning one of my nvme drives "disappeared" from the server but the server ilo reports it as online and healthy.

The drive is a wdc gold sn600 and its label is /dev/nvme1n1. I'm using this drive in a Ceph pool.

Here is my pveversion -v:
Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-4
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

lsblk -l output without rbd blocks:
Code:
NAME                                                                                                  MAJ:MIN  RM   SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:0     0 447.1G  0 disk
├─sda1                                                                                                  8:1     0  1007K  0 part
├─sda2                                                                                                  8:2     0   512M  0 part /boot/efi
└─sda3                                                                                                  8:3     0 446.6G  0 part
  ├─pve-swap                                                                                          253:1     0     8G  0 lvm  [SWAP]
  ├─pve-root                                                                                          253:2     0    96G  0 lvm  /
  ├─pve-data_tmeta                                                                                    253:3     0   3.3G  0 lvm
  │ └─pve-data                                                                                        253:5     0 320.1G  0 lvm
  └─pve-data_tdata                                                                                    253:4     0 320.1G  0 lvm
    └─pve-data                                                                                        253:5     0 320.1G  0 lvm
nvme0n1                                                                                               259:0     0   1.8T  0 disk
└─nvme0n1p1                                                                                           259:5     0   1.8T  0 part
  ├─vg--local3-lv--local3_tmeta                                                                       253:0     0   112M  0 lvm
  │ └─vg--local3-lv--local3-tpool                                                                     253:7     0   1.8T  0 lvm
  │   ├─vg--local3-lv--local3                                                                         253:12    0   1.8T  0 lvm
  │   ├─vg--local3-vm--6000--disk--0                                                                  253:13    0    20G  0 lvm
  │   ├─vg--local3-vm--6006--disk--0                                                                  253:14    0    20G  0 lvm
  │   ├─vg--local3-vm--6007--disk--0                                                                  253:15    0    20G  0 lvm
  │   ├─vg--local3-vm--6010--disk--0                                                                  253:16    0    20G  0 lvm
  │   ├─vg--local3-vm--6011--disk--0                                                                  253:17    0    20G  0 lvm
  │   ├─vg--local3-vm--6013--disk--0                                                                  253:18    0    20G  0 lvm
  │   ├─vg--local3-vm--6026--disk--2                                                                  253:19    0    15G  0 lvm
  │   ├─vg--local3-vm--6004--disk--0                                                                  253:20    0    20G  0 lvm
  │   ├─vg--local3-vm--6018--disk--0                                                                  253:21    0    20G  0 lvm
  │   ├─vg--local3-vm--6023--disk--0                                                                  253:22    0    20G  0 lvm
  │   ├─vg--local3-vm--6025--disk--0                                                                  253:23    0    20G  0 lvm
  │   ├─vg--local3-vm--6025--disk--1                                                                  253:24    0    25G  0 lvm
  │   ├─vg--local3-vm--6026--disk--0                                                                  253:25    0    20G  0 lvm
  │   ├─vg--local3-vm--6026--disk--1                                                                  253:26    0    60G  0 lvm
  │   ├─vg--local3-vm--1003--disk--0                                                                  253:27    0    20G  0 lvm
  │   └─vg--local3-vm--1003--disk--1                                                                  253:28    0    60G  0 lvm
  └─vg--local3-lv--local3_tdata                                                                       253:6     0   1.8T  0 lvm
    └─vg--local3-lv--local3-tpool                                                                     253:7     0   1.8T  0 lvm
      ├─vg--local3-lv--local3                                                                         253:12    0   1.8T  0 lvm
      ├─vg--local3-vm--6000--disk--0                                                                  253:13    0    20G  0 lvm
      ├─vg--local3-vm--6006--disk--0                                                                  253:14    0    20G  0 lvm
      ├─vg--local3-vm--6007--disk--0                                                                  253:15    0    20G  0 lvm
      ├─vg--local3-vm--6010--disk--0                                                                  253:16    0    20G  0 lvm
      ├─vg--local3-vm--6011--disk--0                                                                  253:17    0    20G  0 lvm
      ├─vg--local3-vm--6013--disk--0                                                                  253:18    0    20G  0 lvm
      ├─vg--local3-vm--6026--disk--2                                                                  253:19    0    15G  0 lvm
      ├─vg--local3-vm--6004--disk--0                                                                  253:20    0    20G  0 lvm
      ├─vg--local3-vm--6018--disk--0                                                                  253:21    0    20G  0 lvm
      ├─vg--local3-vm--6023--disk--0                                                                  253:22    0    20G  0 lvm
      ├─vg--local3-vm--6025--disk--0                                                                  253:23    0    20G  0 lvm
      ├─vg--local3-vm--6025--disk--1                                                                  253:24    0    25G  0 lvm
      ├─vg--local3-vm--6026--disk--0                                                                  253:25    0    20G  0 lvm
      ├─vg--local3-vm--6026--disk--1                                                                  253:26    0    60G  0 lvm
      ├─vg--local3-vm--1003--disk--0                                                                  253:27    0    20G  0 lvm
      └─vg--local3-vm--1003--disk--1                                                                  253:28    0    60G  0 lvm
nvme1n1                                                                                               259:1     0   1.8T  0 disk
└─ceph--f107a279--77ae--4003--8523--b62d356df5bd-osd--block--95f1cb37--6324--472a--9cd7--0ba92770f3b5 253:8     0   1.8T  0 lvm
nvme3n1                                                                                               259:2     0   1.8T  0 disk
└─ceph--6ed9e93b--685a--4b3a--ae25--ca14e772d7ee-osd--block--0b46114b--f375--4ad7--9f80--914e4b802ea4 253:10    0   1.8T  0 lvm
nvme4n1                                                                                               259:3     0   1.8T  0 disk
└─ceph--cd46f637--5533--49bb--aa42--09efc22d89dc-osd--block--f74e3f69--c6a9--49fb--af63--ebac9a98bc22 253:11    0   1.8T  0 lvm

and also dmesg | grep nvme:
Code:
[8599639.793030] nvme nvme2: I/O 677 QID 7 timeout, aborting
[8599639.793038] nvme nvme2: I/O 678 QID 7 timeout, aborting
[8599639.793040] nvme nvme2: I/O 679 QID 7 timeout, aborting
[8599639.793043] nvme nvme2: I/O 680 QID 7 timeout, aborting
[8599670.508517] nvme nvme2: I/O 677 QID 7 timeout, reset controller
[8599701.231967] nvme nvme2: I/O 0 QID 0 timeout, reset controller
[8599742.451266] nvme nvme2: Device not ready; aborting reset
[8599742.492142] nvme nvme2: Abort status: 0x371
[8599742.492144] nvme nvme2: Abort status: 0x371
[8599742.492145] nvme nvme2: Abort status: 0x371
[8599742.492146] nvme nvme2: Abort status: 0x371
[8599753.139132] nvme nvme2: Device not ready; aborting reset
[8599753.139528] nvme nvme2: Removing after probe failure status: -19
[8599763.758945] nvme nvme2: Device not ready; aborting reset
[8599763.759494] blk_update_request: I/O error, dev nvme2n1, sector 1802065328 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[8599763.759497] blk_update_request: I/O error, dev nvme2n1, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[8599763.759501] blk_update_request: I/O error, dev nvme2n1, sector 2797415056 op 0x1:(WRITE) flags 0x8800 phys_seg 3 prio class 0
[8599763.759503] blk_update_request: I/O error, dev nvme2n1, sector 1802230728 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[8599763.759506] blk_update_request: I/O error, dev nvme2n1, sector 2285580784 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[8599763.759508] blk_update_request: I/O error, dev nvme2n1, sector 1802234456 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[8599763.759514] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[8599763.759515] blk_update_request: I/O error, dev nvme2n1, sector 2797406496 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[8599763.759519] blk_update_request: I/O error, dev nvme2n1, sector 1802229048 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[8599763.759520] blk_update_request: I/O error, dev nvme2n1, sector 2287970768 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Any ideas? I can't reboot the server right right now.

Thanks
 
Could you post the output of
Bash:
smartctl -a /dev/nvme1n1
and
Bash:
smartctl -a /dev/nvme2n1
?
 
Could you post the output of
Bash:
smartctl -a /dev/nvme1n1
Code:
root@pm-81:~# smartctl -a /dev/nvme1n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC-WDS192T1D0D-01AJB0
Serial Number:                      A067-hidden
Firmware Version:                   W111000D
PCI Vendor/Subsystem ID:            0x1b96
IEEE OUI Identifier:                0x0014ee
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Formatted LBA Size:     4096
Namespace 1 IEEE EUI-64:            0014ee 8301be8600
Local Time is:                      Wed May  5 10:49:15 2021 CEST
Firmware Updates (0x19):            4 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f):   Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +    12.00W       -        -    0  0  0  0       50      50
1 +    11.00W       -        -    1  1  1  1       50      50
2 +    10.00W       -        -    2  2  2  2       50      50
3 +     9.00W       -        -    3  3  3  3       50      50
4 +     8.00W       -        -    4  4  4  4       50      50

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 +    4096       0         0
1 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        54 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    20,776,913 [10.6 TB]
Data Units Written:                 44,976,114 [23.0 TB]
Host Read Commands:                 269,690,925
Host Write Commands:                2,425,491,679
Controller Busy Time:               215,187
Power Cycles:                       42
Power On Hours:                     5,026
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      231
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               52 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        231     0  0x000b  0xc004  0x029            0     0     -
and
Bash:
smartctl -a /dev/nvme2n1
?
Code:
root@pm-81:~# smartctl -a /dev/nvme2n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.78-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC-WDS192T1D0D-01AJB0
Serial Number:                      A061-hidden
Firmware Version:                   W111000D
PCI Vendor/Subsystem ID:            0x1b96
IEEE OUI Identifier:                0x0014ee
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Formatted LBA Size:     4096
Namespace 1 IEEE EUI-64:            0014ee 8300b25300
Local Time is:                      Wed May  5 10:49:21 2021 CEST
Firmware Updates (0x19):            4 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x001f):   Security Format Frmw_DL NS_Mngmt Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    12.00W       -        -    0  0  0  0       50      50
 1 +    11.00W       -        -    1  1  1  1       50      50
 2 +    10.00W       -        -    2  2  2  2       50      50
 3 +     9.00W       -        -    3  3  3  3       50      50
 4 +     8.00W       -        -    4  4  4  4       50      50

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +    4096       0         0
 1 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        54 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    0
Data Units Written:                 0
Host Read Commands:                 9,742
Host Write Commands:                4
Controller Busy Time:               85
Power Cycles:                       0
Power On Hours:                     5,039
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               53 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          2     0  0x000b  0xc004  0x029            0     0     -
 
Originally I wrote /dev/nvme1 but it's /dev/nvme2 that it's not working, I have even powered off the server (the reboot didn't do anything helpful) last night and the drive appears in lsblk -l but with dmesg i can see many errors in kern.log:

Code:
kernel: [48206.099238] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48215.664426] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48215.801820] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48226.432077] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48226.552193] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48236.231765] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48236.343529] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48245.843013] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48245.962491] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48255.573382] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48255.689129] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48266.318625] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 18 prio class 0
kernel: [48266.458434] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 17 prio class 0
kernel: [48276.030256] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48276.151809] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48285.788769] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48285.919040] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48295.677259] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48295.814393] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48306.432219] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48306.558280] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48316.167333] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48316.301746] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48325.808063] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48325.937518] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48335.475759] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48335.585588] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48346.174478] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48346.282133] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48355.826348] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
kernel: [48355.933182] blk_update_request: I/O error, dev nvme2n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0

Also the S.M.A.R.T has been reset (never seen that before), only power on hours is correct, Data Units Written: should be around 23 TB like the others.
 
S.M.A.R.T has been reset
That actually sounds like some kind of firmware issue. Are there any firmware updates available for this drive? Could you install the misbehaving NVMe drive in a different slot? If those issues still remain you may want to contact your seller or wdc to get this drive replaced.
 
That actually sounds like some kind of firmware issue. Are there any firmware updates available for this drive? Could you install the misbehaving NVMe drive in a different slot? If those issues still remain you may want to contact your seller or wdc to get this drive replaced.
Sadly, it seems that there isn't any update (wdc site doesn't even feature a download section for that drive). At the moment the drive is hundreds of kms away from me so I can't do much, I'll try to work on it later tonight
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!