I'm a little bit lost again, I've set up a ThinkCentre Tiny M900 with a WD SN570 NVMe SSD and installed Proxmox.
It uses to run without any problems for days, sometimes for weeks and suddenly the LXCs are malfunctioning, the host system as well. SSH access is not possible anymore, nor is the web interface accessible.
I still have access over meshcentral and can get a console, but commands using the disk are failing. Mostly they result in "I/O error". Even reboot does not work anymore.
When I pull the plug and power on the system again, everything is fine as if nothing ever happened. Neither nvme nor smartctl are reporting any issues with the (brand new) SSD.
There is no overprovisioning or something like this.
This is driving me insane.
Anyone has a clue where to search for potential problems?
It uses to run without any problems for days, sometimes for weeks and suddenly the LXCs are malfunctioning, the host system as well. SSH access is not possible anymore, nor is the web interface accessible.
I still have access over meshcentral and can get a console, but commands using the disk are failing. Mostly they result in "I/O error". Even reboot does not work anymore.
When I pull the plug and power on the system again, everything is fine as if nothing ever happened. Neither nvme nor smartctl are reporting any issues with the (brand new) SSD.
Code:
Model Number: WD Blue SN570 1TB
Serial Number: 232403801710
Firmware Version: 234110WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b4ab89bd9
Local Time is: Fri Mar 22 19:04:50 2024 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.20W 3.70W - 0 0 0 0 0 0
1 + 2.70W 2.30W - 0 0 0 0 0 0
2 + 1.90W 1.80W - 0 0 0 0 0 0
3 - 0.0250W - - 3 3 3 3 3900 11000
4 - 0.0050W - - 4 4 4 4 5000 44000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 55 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,390,120 [711 GB]
Data Units Written: 426,415 [218 GB]
Host Read Commands: 6,502,671
Host Write Commands: 17,564,740
Controller Busy Time: 72
Power Cycles: 28
Power On Hours: 2,379
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 5
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
There is no overprovisioning or something like this.
Code:
root@pve:~# pvs
PV VG Fmt Attr PSize PFree
/dev/nvme0n1p3 pve lvm2 a-- <930.51g 16.00g
root@pve:~# vgs
VG #PV #LV #SN Attr VSize VFree
pve 1 5 0 wz--n- <930.51g 16.00g
root@pve:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-aotz-- 794.66g 1.04 0.27
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 7.63g
vm-100-disk-0 pve Vwi-aotz-- 20.00g data 9.55
vm-101-disk-0 pve Vwi-aotz-- 100.00g data 6.35
This is driving me insane.
Code:
root@pve:~# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:16.3 Serial controller: Intel Corporation 100 Series/C230 Series Chipset Family KT Redirection (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Q170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN570 NVMe SSD 1TB
Anyone has a clue where to search for potential problems?