Proxmox with WD Blue SN570 1TB NVMe SSD randomly causes I/O errors

G0ldmember

Active Member
Oct 2, 2019
40
6
28
Germany
I'm a little bit lost again, I've set up a ThinkCentre Tiny M900 with a WD SN570 NVMe SSD and installed Proxmox.
It uses to run without any problems for days, sometimes for weeks and suddenly the LXCs are malfunctioning, the host system as well. SSH access is not possible anymore, nor is the web interface accessible.

I still have access over meshcentral and can get a console, but commands using the disk are failing. Mostly they result in "I/O error". Even reboot does not work anymore.

When I pull the plug and power on the system again, everything is fine as if nothing ever happened. Neither nvme nor smartctl are reporting any issues with the (brand new) SSD.

Code:
Model Number:                       WD Blue SN570 1TB
Serial Number:                      232403801710
Firmware Version:                   234110WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 8b4ab89bd9
Local Time is:                      Fri Mar 22 19:04:50 2024 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.20W    3.70W       -    0  0  0  0        0       0
 1 +     2.70W    2.30W       -    0  0  0  0        0       0
 2 +     1.90W    1.80W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   44000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,390,120 [711 GB]
Data Units Written:                 426,415 [218 GB]
Host Read Commands:                 6,502,671
Host Write Commands:                17,564,740
Controller Busy Time:               72
Power Cycles:                       28
Power On Hours:                     2,379
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      5
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged


There is no overprovisioning or something like this.

Code:
root@pve:~# pvs
  PV             VG  Fmt  Attr PSize    PFree
  /dev/nvme0n1p3 pve lvm2 a--  <930.51g 16.00g
root@pve:~# vgs
  VG  #PV #LV #SN Attr   VSize    VFree
  pve   1   5   0 wz--n- <930.51g 16.00g
root@pve:~# lvs
  LV            VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data          pve twi-aotz-- 794.66g             1.04   0.27                           
  root          pve -wi-ao----  96.00g                                                   
  swap          pve -wi-ao----   7.63g                                                   
  vm-100-disk-0 pve Vwi-aotz--  20.00g data        9.55                                   
  vm-101-disk-0 pve Vwi-aotz-- 100.00g data        6.35

This is driving me insane.

Code:
root@pve:~# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:16.3 Serial controller: Intel Corporation 100 Series/C230 Series Chipset Family KT Redirection (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Q170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN570 NVMe SSD 1TB


Anyone has a clue where to search for potential problems?
 
I'm using the same SSD 1TB for already two years in my small proxmox server. As your box is also quite small the SSD is totally fine in my view. The SN570 is using TLC flash and the endurance values of the drive are quite OK for the use case. And IO errors should not just happen with any SSD/Harddrive.

You could check the following:
- Your system seems to lock up randomly, IO errors, weird behaviour. Try a memory test first, a faulty memory could cause this behaviour.
- Check the SMART log of your SSD using smartctl or nvme
- If this also fails, maybe it is the mainboard. Try another disk or maybe an USB stick to boot from.
- Use the NVME drive in another computer and see if the same happens there.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!