SSD NVMe Suddenly no more recognized

phdb44

New Member
Dec 15, 2025
2
0
1
Hello, I'm encountering a rather unusual problem.
My setup : Ryzen 7 5700X, 64GB RAM, a 4TB Samsung 990 Pro NVMe SSD, two 120GB SSDs (one for ISOs and one for Snap), two 3TB HDDs (for backups), and a 650W power supply.

My VMs : 45 VMs, 9 of which are operational. Occasionally, I can start a few more for testing, but it's infrequent. The disk space used is 27% (for the 45 VMs). As for memory usage, I don't recall ever exceeding 50%. The VMs are primarily Linux machines running in character mode. Sometimes I run one or two graphical VMs (Windows or Linux), but again, it's infrequent.

My problem : All of this worked perfectly for several months. Then one day, the NVMe SSD was no longer recognized. I disassembled and checked the NVMe SSD – which worked perfectly. I reassembled everything and immediately reinstalled the latest version of Proxmox (and cleaned all medias in the server).
Everything worked fine again for several months. Then - once again - suddenly, the NVMe SSD was no longer recognized.
I unplugged the server, brought it back to my office, turned it back on... and everything was fine! The SSD was recognized!

Could someone explain the nature of the problem, any possible steps I could take to trace and understand it, and most importantly, how to anticipate and fix it?

Thank you in advance for your help :).
 
Add to that - flaky bus controller on the MB that when it cools down/caps get fully discharged it begins working again. Also check PSU (alternate) & RAM.
 
  • Like
Reactions: leesteken
Thank you very much for these interesting comments and advice :)
To begin, I installed nvme-cli and started a process that traces the temperature evolution every minute - just to see what might happen depending on the sequence...