nvme drive data randomly disappearing from servers

pumapumapumas

Member
May 10, 2022
8
1
8
So a week ago today, on host 009 on a 5 piece cluster, the metadata for the local drive that hosted all the VM storage just magically disappeared. I may have misdiagnosed the issue, assuming it was a failed drive (this host is really old) and I retired the host, and rebuilt my VMs, and migrated them to a much newer host.

Today on a much newer host, 007 it did something similar, in that everything associated with the nvme disappeared except the drive itself. even the partition is gone.

Total mind blown......

I recently switched to the newest version of proxmox to take advantage of the awesome GPU passthrough support, but other wise this has been a reliable cluster for years.
1. Are there any known issues with the new instance of Proxmox that could cause this?
2. Are there any known gotchas on the new version with naming additional local VM storage local-lvm1 , 2 , 3 etc? (pve1/data1)

I have been thinking about this for a while and I just can't image what would cause this.
I thought maybe there is a possibility that the kernel wasnt seeing it, or malfunctioning so I did the trick with an Ubuntu desktop "try" and from there there was no partition on the drive. It's literally GONE.

system info:
pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-10-pve)

Thanks in advanced for any advice you might have.
 
Last edited:
I have seen data disappear from SSDs (without PLP) due to a power outage. Especially when trimming or otherwise rearranging data in the background, SSDs are very vulnerable to unexpected power loss (that wipes much more than just a few recent files).
 
Typically I use the Crucial 2TB for Gen4PCIE for the VM local storage, then I use Ceph RBD for various other things in HA mode.

https://www.amazon.com/gp/product/B0B25ML2FH/ref=ewc_pr_img_1?smid=ATVPDKIKX0DER

whats weird is that it's not just removing a few files, its removing the entirety of the drive.
the first time It totally removed all metadata from an entire drive. Data was still there but worthless without the metadata. It was just literally GONE. "local-lvm1"
second time on a totally different host with nothing reused off of the old host, (old host was removed from the cluster) it removed the entire partition from the local VM storage drive "local-lvm1"

both times have caused me rework, so I am really hoping to find a root cause. So far I am totally drawing blanks.
both times this happened was on a Sunday exactly 1 week apart.

I see that power outages have been mentioned multiple times, I saw the same concern online, but both times this happened I was on my work station and saw it happen immediately in my monitoring. No power flicker no storms....
 
Typically I use the Crucial 2TB for Gen4PCIE for the VM local storage, then I use Ceph RBD for various other things in HA mode.

https://www.amazon.com/gp/product/B0B25ML2FH/ref=ewc_pr_img_1?smid=ATVPDKIKX0DER

whats weird is that it's not just removing a few files, its removing the entirety of the drive.
That's what I tried to say before. Consumer SSDs (like your QLC drive, which is terrible for VMs by the way!) can lose all data easily on unexpected power loss as it moves "data at rest" around (including the drive's internal data allocation tables).
the first time It totally removed all metadata from an entire drive. Data was still there but worthless without the metadata. It was just literally GONE. "local-lvm1"
second time on a totally different host with nothing reused off of the old host, (old host was removed from the cluster) it removed the entire partition from the local VM storage drive "local-lvm1"

both times have caused me rework, so I am really hoping to find a root cause. So far I am totally drawing blanks.
both times this happened was on a Sunday exactly 1 week apart.

I see that power outages have been mentioned multiple times, I saw the same concern online, but both times this happened I was on my work station and saw it happen immediately in my monitoring. No power flicker no storms....
I cannot explain that, sorry. There have been reports about drives dropping out because the PSU cannot deliver the necessary power when all drives are writing and the CPU is high also (for example). I fear that you cannot look at the logs from around that time, as they are on the same drive? Any error message on a connected physical display?
Maybe use better drives in the future? Or continuously send your Proxmox logs to another server to investigate when this happens again?
 
  • Like
Reactions: fireon