Hi!
I have an issue where one of my two NVMe Drives "disappears" some time after boot.
I have two Samsung 990 PRO 4TB in a ZFS Mirror for my VM Storage, one of which seems to have this issue, the other one is perfectly fine.
It doesn't seem to be a temperature issue, like in other threads I found by googling, since both drives sit at around 34-38°C most of the time.
dmesg suggests "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug", on other threads I have found people setting it to values like 5500 instead of 0, does this make a big difference, as I'm not exactly sure that this does. It seems to disable the sleep modes?
What's weird, I think, is that this is only happening to one drive, not both of them.
Thanks a lot in advance for any suggestions!
HW:
Ryzen 7 PRO 5750G
128GB 3200 ECC Micron RAM
ASRock Rack X570D4U-L2LT
2x Lexar NS100 Boot Drives (ZFS Mirror)
2x Samsung 990 PRO 4TB for VMs (ZFS Mirror)
SW:
This is the dmesg output:
And this is the smartctl output of both drives:
nvme0:
nvme1:
I have an issue where one of my two NVMe Drives "disappears" some time after boot.
I have two Samsung 990 PRO 4TB in a ZFS Mirror for my VM Storage, one of which seems to have this issue, the other one is perfectly fine.
It doesn't seem to be a temperature issue, like in other threads I found by googling, since both drives sit at around 34-38°C most of the time.
dmesg suggests "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug", on other threads I have found people setting it to values like 5500 instead of 0, does this make a big difference, as I'm not exactly sure that this does. It seems to disable the sleep modes?
What's weird, I think, is that this is only happening to one drive, not both of them.
Thanks a lot in advance for any suggestions!
HW:
Ryzen 7 PRO 5750G
128GB 3200 ECC Micron RAM
ASRock Rack X570D4U-L2LT
2x Lexar NS100 Boot Drives (ZFS Mirror)
2x Samsung 990 PRO 4TB for VMs (ZFS Mirror)
SW:
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.5
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
This is the dmesg output:
Code:
[48196.959378] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[48196.959386] nvme nvme1: Does your device have a faulty power saving mode enabled?
[48196.959388] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[48197.039398] nvme 0000:2e:00.0: Unable to change power state from D3cold to D0, device inaccessible
[48197.039529] nvme nvme1: Disabling device after reset failure: -19
[48197.063402] I/O error, dev nvme1n1, sector 16020376 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[48197.063413] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157128192 size=12288 flags=1572992
[48197.063413] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729152671744 size=94208 flags=1572992
[48197.063416] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=8201388032 size=4096 flags=1572992
[48197.063420] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3728571559936 size=40960 flags=1572992
[48197.063423] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729154637824 size=131072 flags=1572992
[48197.063429] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3728576020480 size=32768 flags=1572992
[48197.063427] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=8201383936 size=4096 flags=1572992
[48197.063432] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157914624 size=131072 flags=1572992
[48197.063437] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729153589248 size=106496 flags=1572992
[48197.063439] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157783552 size=131072 flags=1572992
[48197.063439] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158045696 size=131072 flags=1572992
[48197.063443] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157652480 size=131072 flags=1572992
[48197.063454] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158176768 size=131072 flags=1572992
[48197.063465] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158307840 size=131072 flags=1572992
[48197.063465] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3750801182720 size=12288 flags=1074267264
[48197.063467] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158438912 size=131072 flags=1572992
[48197.063469] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158569984 size=131072 flags=1572992
[48197.063471] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158701056 size=131072 flags=1572992
[48197.063494] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063497] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063500] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063501] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063512] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063639] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063641] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063642] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.077806] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.077806] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.079373] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.080430] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.081528] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.081533] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.084160] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.084464] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.095073] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.096042] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
And this is the smartctl output of both drives:
nvme0:
Code:
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Serial Number: S7DSNJ0WA10140K
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization: 620,724,887,552 [620 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4a31414320
Local Time is: Mon Dec 25 21:01:52 2023 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 540,396 [276 GB]
Data Units Written: 1,311,915 [671 GB]
Host Read Commands: 15,101,770
Host Write Commands: 8,330,016
Controller Busy Time: 12
Power Cycles: 20
Power On Hours: 5
Unsafe Shutdowns: 13
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 43 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
nvme1:
Code:
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 990 PRO with Heatsink 4TB
Serial Number: S7DSNJ0WA10181J
Firmware Version: 0B2QJXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 2.0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization: 595,195,559,936 [595 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 4a31414349
Local Time is: Mon Dec 25 21:01:55 2023 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f): S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.39W - - 0 0 0 0 0 0
1 + 9.39W - - 1 1 1 1 0 0
2 + 9.39W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 4200 2700
4 - 0.0050W - - 4 4 4 4 500 21800
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 38 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 58,688 [30.0 GB]
Data Units Written: 1,233,095 [631 GB]
Host Read Commands: 3,559,637
Host Write Commands: 8,641,638
Controller Busy Time: 5
Power Cycles: 21
Power On Hours: 2
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 38 Celsius
Temperature Sensor 2: 43 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)