NVMe Issue: Unable to change power state from D3cold to D0, device inaccessible

Dec 25, 2023
11
5
3
22
Germany
Hi!

I have an issue where one of my two NVMe Drives "disappears" some time after boot.
I have two Samsung 990 PRO 4TB in a ZFS Mirror for my VM Storage, one of which seems to have this issue, the other one is perfectly fine.

It doesn't seem to be a temperature issue, like in other threads I found by googling, since both drives sit at around 34-38°C most of the time.

dmesg suggests "Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug", on other threads I have found people setting it to values like 5500 instead of 0, does this make a big difference, as I'm not exactly sure that this does. It seems to disable the sleep modes?

What's weird, I think, is that this is only happening to one drive, not both of them.

Thanks a lot in advance for any suggestions!

HW:
Ryzen 7 PRO 5750G
128GB 3200 ECC Micron RAM
ASRock Rack X570D4U-L2LT
2x Lexar NS100 Boot Drives (ZFS Mirror)
2x Samsung 990 PRO 4TB for VMs (ZFS Mirror)

SW:

Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.5
pve-qemu-kvm: 8.1.2-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1


This is the dmesg output:

Code:
[48196.959378] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[48196.959386] nvme nvme1: Does your device have a faulty power saving mode enabled?
[48196.959388] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[48197.039398] nvme 0000:2e:00.0: Unable to change power state from D3cold to D0, device inaccessible
[48197.039529] nvme nvme1: Disabling device after reset failure: -19
[48197.063402] I/O error, dev nvme1n1, sector 16020376 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[48197.063413] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157128192 size=12288 flags=1572992
[48197.063413] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729152671744 size=94208 flags=1572992
[48197.063416] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=8201388032 size=4096 flags=1572992
[48197.063420] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3728571559936 size=40960 flags=1572992
[48197.063423] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729154637824 size=131072 flags=1572992
[48197.063429] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3728576020480 size=32768 flags=1572992
[48197.063427] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=8201383936 size=4096 flags=1572992
[48197.063432] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157914624 size=131072 flags=1572992
[48197.063437] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729153589248 size=106496 flags=1572992
[48197.063439] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157783552 size=131072 flags=1572992
[48197.063439] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158045696 size=131072 flags=1572992
[48197.063443] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729157652480 size=131072 flags=1572992
[48197.063454] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158176768 size=131072 flags=1572992
[48197.063465] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158307840 size=131072 flags=1572992
[48197.063465] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3750801182720 size=12288 flags=1074267264
[48197.063467] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158438912 size=131072 flags=1572992
[48197.063469] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158569984 size=131072 flags=1572992
[48197.063471] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=2 offset=3729158701056 size=131072 flags=1572992
[48197.063494] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063497] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063500] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063501] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063512] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063639] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063641] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.063642] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.077806] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.077806] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.079373] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.080430] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.081528] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.081533] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.084160] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.084464] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.095073] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728
[48197.096042] zio pool=VMs vdev=/dev/disk/by-id/nvme-Samsung_SSD_990_PRO_with_Heatsink_4TB_S7DSNJ0WA10181J_1-part1 error=5 type=5 offset=0 size=0 flags=1049728

And this is the smartctl output of both drives:

nvme0:
Code:
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Serial Number:                      S7DSNJ0WA10140K
Firmware Version:                   0B2QJXG7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            620,724,887,552 [620 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4a31414320
Local Time is:                      Mon Dec 25 21:01:52 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    540,396 [276 GB]
Data Units Written:                 1,311,915 [671 GB]
Host Read Commands:                 15,101,770
Host Write Commands:                8,330,016
Controller Busy Time:               12
Power Cycles:                       20
Power On Hours:                     5
Unsafe Shutdowns:                   13
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               43 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

nvme1:
Code:
=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO with Heatsink 4TB
Serial Number:                      S7DSNJ0WA10181J
Firmware Version:                   0B2QJXG7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            595,195,559,936 [595 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4a31414349
Local Time is:                      Mon Dec 25 21:01:55 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    58,688 [30.0 GB]
Data Units Written:                 1,233,095 [631 GB]
Host Read Commands:                 3,559,637
Host Write Commands:                8,641,638
Controller Busy Time:               5
Power Cycles:                       21
Power On Hours:                     2
Unsafe Shutdowns:                   16
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               43 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
 
Hi there,

I've 4 Samsung 990 Pro 4TB with heatsink attached to 4 motherboards (one Samsung 990 Pro on each MoBo) and I've exactly the same issue.

3 Samsung 990 Pro are running properly and one Samsung 990 Pro "disappears" some time after boot with more or less the same error messages.

nvme nvme0: Does your device have a faulty power saving mode enabled? nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug nvme 0000:26:00.0: Unable to change power state from D3cold to D0, device inaccessible nvme nvme0: Disabling device after reset failure: -19 nvme0n1: rw=34817, sector=145112, nr_sectors = 8 limit=0 nvme0n1: rw=34817, sector=145120, nr_sectors = 16 limit=0 ... ...

I also have the same kernel version (6.5.11-7-pve) with the latest pve updates.

It seems related to the kernel but perhaps one SSD is defective.

I'll do more tests soon.

Regards
 
  • Like
Reactions: Trace8773
Hello,

I've updated all the SSD with the latest firmware 4B2QJXD7.

I'll send you an update if it works or not.

Regards
 
  • Like
Reactions: Trace8773
I just registered here to say that I have exactly the same issue on a plain Debian 12 (Bookworm) system.

I replaced my 980 Pro 2 TB with a 990 Pro 4 TB (with heatsink) on December 24th. Everything ran well until January 3rd when I had the first crash. (`nvme: unable to change power state from d3cold to d0 device inaccessible`) Then I had this problem roughly about every day.

With the 980 Pro I never had any stability issues. The box ran 24/7 - only occasional reboots for system updates every few weeks or months.

I'm using firmware 0B2QJXG7, which is the up-to-date version for my model (the "G" at the end indicates, that it is a V8 model, so there is no 4B...-firmware, see this reddit post: https://www.reddit.com/r/buildapc/comments/1857wpw/comment/kbp32g2/ )

I now went back to kernel 6.1.55 (because that still was on my system) - but I don't think this will help. I will try the suggested nvme command. (I never saw this help text, because this is my system ssd an the box froze with a kernel panic everytime it happened)

There is also a Linux kernel bug, but I am not sure if it is our problem: https://bugzilla.kernel.org/show_bug.cgi?id=217705

UPDATE: My CPU is a Xeon E3-1245. Because this is very different to the Ryzen 7 Pro mentioned above, we can rule out that it is a CPU/Chipset problem. It must be a problem with that particular SSD.
 
Last edited:
I can confirm setting nvme_core.default_ps_max_latency_us=0 seems to mitigate the issue. The drives have been operational for 3 days so far, although disabling the sleep states isn't ideal.

Regards,
Trace
 
  • Like
Reactions: flomp
I've updated all my SSD to the latest firmware version i e. 4B2QJXD7 and it works so far.

Please let me know if that works for you.

Regards
 
Thanks for checking in, I currently am unable to use fwupd because my system apparently still uses legacy boot without uefi. I will attempt the firmware update after reinstalling Proxmox on the host with UEFI enabled

Regards,
Trace
 
Last edited:
Thanks for checking in, I currently am unable to use fwupd because my system apparently still uses legacy boot without uefi. I will attempt the firmware update after reinstalling Proxmox on the host with UEFI enabled

Regards,
Trace
Looking at your smartctl outputs, you also have V8 SSDs where 0B2QJXG7 is the current version. (See reddit link in my previous post)

(Also, I do have a UEFI system and fwupdmgr does not show any available updates)
 
Last edited:
Looking at your smartctl outputs, you also have V8 SSDs where 0B2QJXG7 is the current version. (See reddit link in my previous post)

(Also, I do have a UEFI system and fwupdmgr does not show any available updates)
I forgot to mention this in my earlier post, but there appears to be an update from the poster on the techpowerup thread (specifically this comment, dated after the reddit comment was posted), in which they say there should be an update from 0B2QJXG7 to 4B2QJXD7, if I'm not mistaken

Sadly, I don't currently have the ability to check this using Samsung Magician

Regards,
Trace
 
Last edited:
  • Like
Reactions: flomp
I forgot to mention this in my earlier post, but there appears to be an update from the poster on the techpowerup thread (specifically this comment, dated after the reddit comment was posted), in which they say there should be an update from 0B2QJXG7 to 4B2QJXD7, if I'm not mistaken

Sadly, I don't currently have the ability to check this using Samsung Magician

Regards,
Trace
You are right! I just tried it.

I removed the SSD from my linux box, put it on a PCIe card and mounted it in my Windows box. Immediately after booting, Magician wanted to update the firmware. I am running 4B2QJXD7 now.

I removed the kernel parameters and am running the current debian kernel now (6.1.69-1). Let's see if it is fixed now...
 
Last edited:
  • Like
Reactions: Trace8773 and YAGA
Just got around to updating the SSDs too, I'm also on 4B2QJXD7 now. Interestingly, the ZFS Fragmentation of the pool went from 11% to 3%.
Hope this resolves the issues.

Thanks a lot to both of you for checking in.

Regards,
Trace
 
Unfortunately, I had another crash last night. This time, no kernel panic but the device was read only. I could not see the initial error message because the console was full of errors due to the read only file system. I rebooted and the system is running again, but I don't have much hope. Guess I will have to add the kernel command line parameter again (`nvme_core.default...`).

Already thinking what I should do now... Keep the SSD with sleep states disabled and wait for a firmware update OR return it, ask for a replacement and then hope that I will get one without this bug. (For both of you, it was always the same SSD that failed, correct?)

Samsung acknowledging that this is a known bug and announcing a fix would make the decision somewhat easier...
 
Yes, in my case it was the same SSD that failed. I'll keep them running for the moment, hoping the issue is somehow resolved, like with YAGA.
I'm currently getting a unused one (same model) from a friend, will see if the issue persists and possibly switch it out.

Regards,
Trace
 
  • Like
Reactions: flomp and YAGA
Cool, that will be interesting to see. I appreciate that you both report your findings here. I will do the same.
 
My Server just crashed again. So I activated `nvme_core.default_ps_max_latency_us=0` now (but not `pcie_aspm=off`)
 
I previously also only had "nvme_core.default_ps_max_latency_us=0" set, without "pcie_aspm=off".
My drives are still online, no crash / zfs mirror degradation so far since updating the firmware on Sunday

Regards,
Trace
 
My drives are still online, no crash / zfs mirror degradation so far since updating the firmware on Sunday

Regards,
Trace
I suppose you are running without the kernel parameter currently? Could you double check `/proc/cmdline`?
Unfortunately, for me, upgrading the firmware did not help. I still had the crashes without the nvme-Parameter.
Now, that I added the nvme-Parameter again, the system is stable.
 
Yes, I'm running without it:
Bash:
root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.5.11-7-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet iommu=pt

The only change from default there is iommu=pt to pass thorugh a HBA card.

I'd be interested in seeing how long the drives actually stay in sleep modes and if it really does matter a lot, but I don't think I can look that up similarly to powertop with idle states

Regards,
Trace
 
  • Like
Reactions: flomp
So the same drive just failed again, with the same error as before. I'll also revert to using the kernel parameter for now.
I might be able swap out the drive this weekend and will post here when I do.

Regards,
Trace
 
So the same drive just failed again, with the same error as before. I'll also revert to using the kernel parameter for now.
I might be able swap out the drive this weekend and will post here when I do.
Sorry to hear that!
My drive is running without problems since I added the parameter 3 days ago.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!