[SOLVED] Are SSD NVMe issues resurfacing with the latest PVE kernel upgrade?

Chipping in here (unfortunately) with a Seagate FireCuda 520 1TB nvme dropping after about 4 years without issues:

Code:
Jun 27 18:49:07 pve1 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jun 27 18:49:07 pve1 kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jun 27 18:49:07 pve1 kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
Jun 27 18:49:07 pve1 kernel: nvme 0000:02:00.0: enabling device (0000 -> 0002)
Jun 27 18:49:07 pve1 kernel: nvme nvme0: Disabling device after reset failure: -19

pve-manager/9.2.3/d0fde103346cf89a (running kernel: 7.0.12-1-pve)

Luckily, all is fine again after a reboot.

For reference:
I've been updating Proxmox versions and kernels more or less right after they became available for the past year or so (I ran debian on this system before) without any nvme issues (or any other issues actually).
The system was sending backups at the time of the drop, but nothing out of the ordinary.
I have two of these nvme's in mirror and each are on a separate PCIe-M.2 (SST-ECM20) adapter card.
Both nvme's have always been on what turns out to be the latest firmare already (STNSC016).
SMART is (still) fine and the drives both have about 4% wear.

I didn't do the mentioned kernel parameters yet, but in the last 4 years I also didn't need them.
Same for changing the CPU power governor to performance, which has always been on powersave without problems.

As it is exceptionally hot right now, could this be a temperature thing after all?
The system reports an ambient temperature of 29 degrees, where this is normally about 10 degrees lower.
The nvme that dropped now reports about ~57 Celcius idle and the mirrored one that didn't drop shows ~50 Celcius idle.
Under the backup load this was probably a bit more. That's definitely hotter than usual but afaik not yet alarming. Unless it is?

Edit: looking into this, the dropped drive has Thermal Management log values >0 where the other one doesn't:
Code:
Thermal Management T1 Trans Count       : 11
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 713
Thermal Management T2 Total Time        : 0
I can't find when these happened (yet).
Although throttling shouldn't cause a drop, investing in a pair of heat sinks probably won't hurt.

Edit2:
Some temperature related posts here too:

Edit3:
Fired up a Grafana dashboard to monitor the nvme/ssd temps for a few days, before and after installing heat sinks. Besides a modest decrease in base temperature the heat sinks seem to diminish temporary increases quite well. Before, running a backup easily increased the nvme temperature by 5-10 degrees Celsius. After installing heat sinks this increase is only 1-2 degrees.
 
Last edited: