Today it happend again. Looking forward to the help of professionals too.Hi have a B650D4U with the latest BIOS and BMC. I downgraded to the previous version and no fail/error so far.
Maybe it helps you do investigate the error.
Today it happend again. Looking forward to the help of professionals too.Hi have a B650D4U with the latest BIOS and BMC. I downgraded to the previous version and no fail/error so far.
Maybe it helps you do investigate the error.
Hi everyone, for now, I think disable `Pci Express Clock Gating` may be the solution. As I understand it, Pci Express Clock Gating is to turn off the clock signal when the device is idle to save power consumption. Maybe some nvme devices do not support this.Today it happend again. Looking forward to the help of professionals too.
I´m pretty sure that this was the problem. Normally we had the error on one of our 10 Server 1-2 times per week (Random). Now no issues since weeks. No other changes.Are you sure this has fixed it and it's not just a long time since you had error? I had it like this for long long stretches.
I also definitely got the error while ASPM was disabled. (Disabled at bios level)
All that - And I'm sat 100% stable with ASPM enabled on kernel 6.5.13-6-pve, now
Thanks! But... already have those in my GRUB file, and can confirm APSTE is Disabled for the two 990 Pro drives, but I get the following for my two u.2 drives:I found a solution that worked for me:
- (install apt-get install nvme-cli)
- Check "nvme get-feature /dev/nvme0 -f 0xc -H" // Scroll Up, first Line "Autonomous Power State Transition Enable (APSTE): Enabled"
- nano /etc/kernel/cmdline
- Append "boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
- "root=ZFS=rpool/ROOT/pve-1 boot=zfs boot=zfs nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
- proxmox-boot-tool refresh
- Reboot
- Check "nvme get-feature /dev/nvme0 -f 0xc -H" again // Autonomous Power State Transition Enable (APSTE): Disabled
No Problems so far, since weeks.
# nvme get-feature /dev/nvme2 -f 0xc -H
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)
Sadly enough that's the one drive I'm running the OS from. I will have to replace and rebuild.We didnt find any solution for the Samsung 990 Pro's problem - tried everything. They just seem to get dropped every ~30 days (most runtime we got). We ended up replacing all of them from our servers and use a different vendor.
Interesstingly the issue itself seems to be firmware related to the samsung 990 pro. For example: we had multiple servers with 990 pro running and updated them (patchday). All of them dropped atleast one SSD after around ~30 days after that - and all servers within almost the same time (max. 1 hour difference, could be due to uptime not 100% identical). So i guess its some counter or something running over and causing this. This can probably only be fixed by samsung...but as its not a server/datacenter product, who knows if they will do anything....
This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.system crashed overnight again.
my bad, this thread came up after I search for a system hanging and high load average numbers. I'll start a new thread.This thread was/is about the samsung 990 pro dropping out of the system (with the error message from the subject of the thread). If your system crashes then you probably have different logs/effects and you should probably open a different thread.