Errors on NVMe Device during upgrade...bad device?

Mujizac · Nov 20, 2017

As I am running a apt-get dist-upgrade, I am getting i/o errors. It's always complaining about sector 76540200. The Smart information looks pretty good, aside from the crazy amount of errors about that single sector. I suppose I'm surprised by the idea that this is a failure, but I can accept it if it is.
The machine still boots and seems okay. I just can't install anything without getting errors.
Below is some of the info I have dug up:

At the console for the system I am seeing
blk_update_request: I/O error, dev nvme0n1, sector 76540200

Running smartctl reports:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.4.67-1-pve] (local build)

=== START OF INFORMATION SECTION ===
Model Number: INTEL SSDPEKKW512G7
Serial Number: BTPY65110PKZ512F
Firmware Version: PSF109C
PCI Vendor/Subsystem ID: 0x8086
IEEE OUI Identifier: 0x5cd2e4
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Mon Nov 20 00:41:14 2017 MST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 80 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W - - 0 0 0 0 5 5
1 + 4.60W - - 1 1 1 1 30 30
2 + 3.80W - - 2 2 2 2 30 30
3 - 0.0700W - - 3 3 3 3 10000 300
4 - 0.0050W - - 4 4 4 4 2000 10000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 55 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 388,058 [198 GB]
Data Units Written: 12,693,111 [6.49 TB]
Host Read Commands: 117,766,680
Host Write Commands: 377,240,502
Controller Busy Time: 2,555
Power Cycles: 4
Power On Hours: 4,472
Unsafe Shutdowns: 1
Media and Data Integrity Errors: 974,015
Error Information Log Entries: 974,015
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 974016 4 0x000c 0x0281 - 76540200 1 -
1 974015 4 0x000c 0x0281 - 76540200 1 -
2 974014 4 0x000c 0x0281 - 76540200 1 -
3 974013 4 0x000c 0x0281 - 76540200 1 -
4 974012 4 0x000c 0x0281 - 76540200 1 -
5 974011 4 0x000c 0x0281 - 76540200 1 -
6 974010 4 0x000c 0x0281 - 76540200 1 -
7 974009 4 0x000c 0x0281 - 76540200 1 -
8 974008 4 0x000c 0x0281 - 76540200 1 -
9 974007 4 0x000c 0x0281 - 76540200 1 -
10 974006 4 0x000c 0x0281 - 76540200 1 -
11 974005 4 0x000c 0x0281 - 76540200 1 -
12 974004 4 0x000c 0x0281 - 76540200 1 -
13 974003 4 0x000c 0x0281 - 76540200 1 -
14 974002 4 0x000c 0x0281 - 76540200 1 -
15 974001 4 0x000c 0x0281 - 76540200 1 -
... (48 entries not shown)

Alwin · Nov 20, 2017

Mujizac said:
blk_update_request: I/O error, dev nvme0n1, sector 76540200

That sector is dead, but your spare is 100%, so it shouldn't affect the server. But to keep a close watch is a good idea. Check if you find any firmware update for the device, maybe there is some issue fixed already. Also with the 'nvme-cli' you can get the logs from the NVMe and see if something is in there.

Mujizac · Nov 20, 2017

Alwin said:
That sector is dead, but your spare is 100%, so it shouldn't affect the server. But to keep a close watch is a good idea. Check if you find any firmware update for the device, maybe there is some issue fixed already. Also with the 'nvme-cli' you can get the logs from the NVMe and see if something is in there.

Yeah! I totally agree. So what I'm unclear on is procedure. At the moment, it won't let go of that sector, doesn't seem to want to mark it bad and move on. From the perspective of the SSD, how do we have it do that?

Alwin · Nov 21, 2017

I would go with the firmware update, as this sounds strange that only the one sector is called very often.

Search

Search

Errors on NVMe Device during upgrade...bad device?

Mujizac

Member

Alwin

Proxmox Retired Staff

Mujizac

Member

Alwin

Proxmox Retired Staff

We value your privacy