Had the literal worst experience with Proxmox VE (iSCSI LVM datastore corrupted)

beta_2017

New Member
Apr 10, 2025
1
0
1
With the recent shitcom dumpster fire, I wanted to test and see how Proxmox would look in my personal homelab, and then give my findings to my team at work. I have 2 identical hosts with a TrueNAS Core install running iSCSI storage Datastores over 10G DAC cables to the hosts on another host.

I set up one of the hosts to run Proxmox and start the migration, which I will say, was awesome during this process. I had some issues getting the initial network set up and running, but after I got the networks how I wanted them, I set up the iSCSI (not multipathed, since I didn't have redundant links to either of the hosts, but it was marked as shared in Proxmox) to the one host to start with so I could get storage going for the VMs.

I didn't have enough room on my TrueNAS to do the migration, so I had a spare QNAP with spinnys that held the big boy VMs while I migrated smaller VMs to a smaller datastore that I could run side-by-side with the VMFS datastores I had from ESXi. I then installed Proxmox on the other host and made a cluster. Same config minus different IP addresses obviously. The iSCSI datastores I had on the first were immediately detected and used on the 2nd, allowing for hot migration (which is a shitload faster than VMware, nice!!), HA, the works...

I created a single datastore that had all the VMs running on it... which I now know is a terrible idea for IOPS (and because I'm an idiot and didn't really think that through). Once I noticed that everything slowed to a crawl if a VM was doing literally anything, I decided that I should make another datastore. This is where everything went to shit.

I'll list my process, hopefully someone can tell me where I fucked up:

(To preface: I had a single iSCSI target in VMware that had multiple datastores (extents) under it. I intended to follow the same in Proxmox because that's what I expected to work without issue.)

  1. I went into TrueNAS and made another datastore volume, with a completely different LUN ID that has never been known to Proxmox, and placed it under the same target I had already created previously
  2. I then went to Proxmox and told it to refresh storage, I restarted iscsiadm too because right away it wasn't coming up. I did not restart iscsid.
  3. I didn't see the new LUN under available storage, so I migrated what VMs were on one of the hosts and rebooted it.
  4. When that host came up, all the VMs went from green to ? in the console. I was wondering what was up with that, because they all seemed like they were running fine without issue.
    1. I now know that they all may have been looking like they were running, but man oh man they were NOT.
  5. I then dig deeper in the CLI to look at the available LVMs, and the "small" datastore that I was using during the migration was just gone. 100% nonexistent. I then had a mild hernia.
  6. I rebooted, restarted iscsid, iscsiadm, proxmox's services... all to no avail.
    1. During this time, the iSCSI path was up, it just wasn't seeing the LVMs.
  7. I got desperate, and started looking at filesystem recovery.
    1. I did a testdisk scan on the storage that was attached via iSCSI, and it didn't see anything for the first 200 blocks or so of the datastore, but all of the VM's files were intact, without a way for me to recover them (I determined that it would have taken too much time to extract/re-migrate)!
  8. Whatever happened between steps 1-4 corrupted the LVMs headers to the point of no recovery. I tried all of the LVM recovery commands, none of which worked because the UUID of the LVM was gone...
I said enough is enough, disaster recoveried to VMware (got NFR keys to keep the lab running) from Veeam (thank god I didn't delete the chains from the VMware environment), and haven't even given Proxmox a second thought.

Something as simple as adding an iSCSI LUN to the same target point absolutely destroying a completely separate datastore??? What am I missing?! Was it actually because I didn't set up multipathing?? It was such a bizzare and quite literally the scariest thing I've ever done, and I want to learn so that if we do decide on moving to Proxmox in the future for work, this doesn't happen again.

TL;DR - I (or Proxmox, idk) corrupted an entire "production" LVM header with VM data after adding a second LUN to an extent in Proxmox, and I could not recover the LVM.
 
Hi @beta_2017 , welcome to the forum.

I am sad to hear about your experience. That said, my initial recommendation is to keep the emotions and irrelevant information out of your technical problem report when looking for help.

It appears from your report that you created a two node cluster and then rebooted one of the nodes. Note, that loosing 50% of your cluster's node population means that you no longer have a majority on either side. When HA is enabled, that means the remaining members of the cluster (or member in your case) must take an action, such as self-reboot or service stop to prevent potential HA brain split.
Whether or not it happened in your case is unclear.

It is recommended to have a 3 node cluster, or, at the very least, enable "maintenance" mode prior to maintenance.

  1. I then went to Proxmox and told it to refresh storage, I restarted iscsiadm too because right away it wasn't coming up. I did not restart iscsid.
You likely would have been better of running: pvesm scan iscsi <portal> , lsscsi, and checking "dmesg"
When that host came up, all the VMs went from green to ? in the console.
An indication of communication issues within the cluster.
the "small" datastore that I was using during the migration was just gone
An indication of iSCSI communication issues. Together with prior point, possibly a network configuration problem.
During this time, the iSCSI path was up, it just wasn't seeing the LVMs.
Are you saying the LUNs were visible via "lsscsi" , "lsblk" and other methods? Yet, "lvs/pvs/vgs" would report nothing?
TL;DR - I (or Proxmox, idk) corrupted an entire "production" LVM header with VM data after adding a second LUN to an extent in Proxmox, and I could not recover the LVM.
It is not clear to me that you did anything with LVM. It seems that that you added a second LUN to a target and then lost access after manual reboot.
Extending Physical Volume Group (roughly an equivalent of ESXi's "adding extend") requires manual steps in Linux shell, outside of PVE interface.

Based on the totality of your report, it seems that your system stopped seeing LVM disk signature after you manipulated SAN configuration via SAN interface. Absent any logs, it is as likely that the SAN manipulation caused the data loss. I am not aware of any process in PVE that would overwrite first 200 blocks on a disk after a reboot.

Given the lab nature of your setup, I would recommend that you start from scratch. Either configure your environment using best practices, or try to repeat the steps you did to produce a reproduction of the issue. Keeping a good log of steps, commands and outputs would allow community to provide assistance.

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: fabian
Hi, unfortunately it's hard to tell in hindsight what exactly went wrong here without logs or diagnostic output. To add to what @bbgeek17 wrote, I have a few questions to understand your steps better:
With the recent shitcom dumpster fire, I wanted to test and see how Proxmox would look in my personal homelab, and then give my findings to my team at work. I have 2 identical hosts with a TrueNAS Core install running iSCSI storage Datastores over 10G DAC cables to the hosts on another host.

I set up one of the hosts to run Proxmox and start the migration, which I will say, was awesome during this process. I had some issues getting the initial network set up and running, but after I got the networks how I wanted them, I set up the iSCSI (not multipathed, since I didn't have redundant links to either of the hosts, but it was marked as shared in Proxmox) to the one host to start with so I could get storage going for the VMs.
Just to make sure, do I understand correctly you
  • set up an iSCSI storage for your TrueNAS target (was "Use LUNs directly" unchecked?)
  • set up an LVM storage pointing to a new LVM volume group on top of one big iSCSI LUN
  • and then created the VM disks in that LVM storage?
Do you still happen to have any of the pvs/vgs/lvs output?
I didn't have enough room on my TrueNAS to do the migration, so I had a spare QNAP with spinnys that held the big boy VMs while I migrated smaller VMs to a smaller datastore that I could run side-by-side with the VMFS datastores I had from ESXi. I then installed Proxmox on the other host and made a cluster. Same config minus different IP addresses obviously. The iSCSI datastores I had on the first were immediately detected and used on the 2nd, allowing for hot migration (which is a shitload faster than VMware, nice!!), HA, the works...
So at this point, you had two LUNs on the TrueNAS (one for VMFS/ESXi, one for your two-node Proxmox VE cluster), and one LUN on a QNAP. Was the QNAP one used as a storage for ESXi or Proxmox VE (or put differently, were the big VMs running on ESXi or Proxmox VE at this point)?
  1. I went into TrueNAS and made another datastore volume, with a completely different LUN ID that has never been known to Proxmox, and placed it under the same target I had already created previously
  2. I then went to Proxmox and told it to refresh storage, I restarted iscsiadm too because right away it wasn't coming up. I did not restart iscsid.
  3. I didn't see the new LUN under available storage, so I migrated what VMs were on one of the hosts and rebooted it.
Can you clarify what you mean by "refresh storage"? Where did you check for the new LUN under "available storage"?
  1. When that host came up, all the VMs went from green to ?in the console. I was wondering what was up with that, because they all seemed like they were running fine without issue.
    1. I now know that they all may have been looking like they were running, but man oh man they were NOT.
  2. I then dig deeper in the CLI to look at the available LVMs, and the "small" datastore that I was using during the migration was just gone. 100% nonexistent. I then had a mild hernia.
  3. I rebooted, restarted iscsid, iscsiadm, proxmox's services... all to no avail.
    1. During this time, the iSCSI path was up, it just wasn't seeing the LVMs.
  4. I got desperate, and started looking at filesystem recovery.
    1. I did a testdisk scan on the storage that was attached via iSCSI, and it didn't see anything for the first 200 blocks or so of the datastore, but all of the VM's files were intact, without a way for me to recover them (I determined that it would have taken too much time to extract/re-migrate)!
I realize it's unlikely but still: Do you happen to have logs from this period, from any of the two nodes? The fact that VMs were displayed grey in the GUI indicates that there were some issues with pvestatd, and the logs would contain useful further information.