We have a Proxmox cluster consisting of 3 nodes, each of those nodes are set up in a similar way, they're all Dell PowerEdge R640 servers with SSDs directly attached and set up in two Hardware RAID-5 on each of the devices, and on each of these RAIDs an LVM-Thin volume is built in Proxmox.
Twice in relatively short succession the LVM metadata of some of the volumes on these nodes has become corrupted or just lost.
When that happens this is the status:
- On the VMs
- The VMs hosted on that LVM storage are still running normally
- VMs can be powered down, but not powered on (it will complain about the missing storage)
- In Proxmox GUI
- Status is Unknown in GUI sidebar
- Disk is seen under `Disks`
- The second time the Usage of the affected LVM was reporting as `Device Mapper` instead of `LVM` like the non-affected LVM
- Under LVM and LVM-Thin the affected LVM is not seen at all
- In the Shell
- `lsblk` still lists the disk
- `vgscan` does not report the affected LVM as existing at all
- `lvscan` does not report the affected LVM as existing at all, not even as INACTIVE
- `/etc/pve/storage.cfg` still shows the setup like before
- When trying to restore
- `/etc/lvm/backup` is as I would expect it to be
- `vgcfgrestore` complains that it can'd find a device with that specific `UUID`
- `pvcreate` complains about the same
---
I can restore the LVM metadata using the steps documented by Red Hat here: [https://docs.redhat.com/en/document...n-an-lvm-physical-volume_troubleshooting-lvm)
Basically I need to power down the VMs, unmount all the device mappers with `dmsetup`, then run `pvcreate` pass it the missing UUID and point it at the last backup file, and finally run `vgcfgrestore`
But my question is if anyone has any idea what might be causing this and how I can prevent this in the future, because I'm a bit at a loss right now. Plus if anyone knows a way to restore things without powering down the VMs that would be great.
I don't even really know where I'd begin to look at the cause of this issue. But here's what happened before the issue occurred each time:
The first time this happened a new node (PRX04) was added to the cluster and two new LVM-thin storages where created on the disks of that server. Afterwards the LVM metadata was lost on the *other* node (PRX03).
The second time this has happened a VM was migrated, using live Migration from one node (PRX04) to another node (PRX02) and afterwards on *both* PRX04, the source node, and PRX03, an unrelated node, the LVM metadata was lost on one of the two LVMs on each of them. But not on PRX02, the destination node.
Twice in relatively short succession the LVM metadata of some of the volumes on these nodes has become corrupted or just lost.
When that happens this is the status:
- On the VMs
- The VMs hosted on that LVM storage are still running normally
- VMs can be powered down, but not powered on (it will complain about the missing storage)
- In Proxmox GUI
- Status is Unknown in GUI sidebar
- Disk is seen under `Disks`
- The second time the Usage of the affected LVM was reporting as `Device Mapper` instead of `LVM` like the non-affected LVM
- Under LVM and LVM-Thin the affected LVM is not seen at all
- In the Shell
- `lsblk` still lists the disk
- `vgscan` does not report the affected LVM as existing at all
- `lvscan` does not report the affected LVM as existing at all, not even as INACTIVE
- `/etc/pve/storage.cfg` still shows the setup like before
- When trying to restore
- `/etc/lvm/backup` is as I would expect it to be
- `vgcfgrestore` complains that it can'd find a device with that specific `UUID`
- `pvcreate` complains about the same
---
I can restore the LVM metadata using the steps documented by Red Hat here: [https://docs.redhat.com/en/document...n-an-lvm-physical-volume_troubleshooting-lvm)
Basically I need to power down the VMs, unmount all the device mappers with `dmsetup`, then run `pvcreate` pass it the missing UUID and point it at the last backup file, and finally run `vgcfgrestore`
But my question is if anyone has any idea what might be causing this and how I can prevent this in the future, because I'm a bit at a loss right now. Plus if anyone knows a way to restore things without powering down the VMs that would be great.
I don't even really know where I'd begin to look at the cause of this issue. But here's what happened before the issue occurred each time:
The first time this happened a new node (PRX04) was added to the cluster and two new LVM-thin storages where created on the disks of that server. Afterwards the LVM metadata was lost on the *other* node (PRX03).
The second time this has happened a VM was migrated, using live Migration from one node (PRX04) to another node (PRX02) and afterwards on *both* PRX04, the source node, and PRX03, an unrelated node, the LVM metadata was lost on one of the two LVMs on each of them. But not on PRX02, the destination node.