[SOLVED] TASK ERROR: no such logical volume local-lvm/local-lvm

filou

New Member
Jan 13, 2024
12
0
1
HELP!

Proxmox seems to have 'lost' my local-lvm storage somehow. Most of my VMs and containers live on this drive. In the Proxmox GUI, status shows as 'Unknown' in the sidebar and Status shows Active: No Most of my VMs are still running, but I am afraid to shut down or restart the host. If I try to start a VM which is not already running I get the error: TASK ERROR: no such logical volume local-lvm/local-lvm. I can not seem to mount it or otherwise get Proxmox to recognize the local-lvm storage, even though the running VMs are stored on that drive!

I have no idea how to troubleshoot this situation. Any help would be GREATLY appreciated!

Thank you.

Screenshot 2024-04-03 at 12.53.45 AM.png
Screenshot 2024-04-03 at 12.54.01 AM.png

I can see it in lsblk
Code:
lsblk
NAME                                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                       8:0    0   1.9T  0 disk
└─sda1                                    8:1    0   1.9T  0 part
  ├─local--lvm-local--lvm_tmeta         252:0    0  15.9G  0 lvm 
  │ └─local--lvm-local--lvm-tpool       252:2    0   1.8T  0 lvm 
  │   ├─local--lvm-local--lvm           252:3    0   1.8T  1 lvm 
  │   ├─local--lvm-vm--101--disk--0     252:4    0  1000G  0 lvm 
  │   │ ├─local--lvm-vm--101--disk--0p1 252:11   0     1M  0 part
  │   │ └─local--lvm-vm--101--disk--0p2 252:12   0 976.6G  0 part
  │   ├─local--lvm-vm--103--disk--0     252:5    0    50G  0 lvm 
  │   ├─local--lvm-vm--102--disk--0     252:6    0     8G  0 lvm 
  │   ├─local--lvm-vm--105--disk--0     252:7    0     8G  0 lvm 
  │   ├─local--lvm-vm--106--disk--0     252:8    0     8G  0 lvm 
  │   ├─local--lvm-vm--107--disk--0     252:9    0     8G  0 lvm 
  │   └─local--lvm-vm--108--disk--0     252:10   0     2G  0 lvm 
  └─local--lvm-local--lvm_tdata         252:1    0   1.8T  0 lvm 
    └─local--lvm-local--lvm-tpool       252:2    0   1.8T  0 lvm 
      ├─local--lvm-local--lvm           252:3    0   1.8T  1 lvm 
      ├─local--lvm-vm--101--disk--0     252:4    0  1000G  0 lvm 
      │ ├─local--lvm-vm--101--disk--0p1 252:11   0     1M  0 part
      │ └─local--lvm-vm--101--disk--0p2 252:12   0 976.6G  0 part
      ├─local--lvm-vm--103--disk--0     252:5    0    50G  0 lvm 
      ├─local--lvm-vm--102--disk--0     252:6    0     8G  0 lvm 
      ├─local--lvm-vm--105--disk--0     252:7    0     8G  0 lvm 
      ├─local--lvm-vm--106--disk--0     252:8    0     8G  0 lvm 
      ├─local--lvm-vm--107--disk--0     252:9    0     8G  0 lvm 
      └─local--lvm-vm--108--disk--0     252:10   0     2G  0 lvm

And fdisk -l
Code:
Disk /dev/sda: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: Samsung SSD 850
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: CB63BC22-2519-44F1-A915-4CF8D8F6EFDA


Device     Start        End    Sectors  Size Type
/dev/sda1   2048 4000796671 4000794624  1.9T Linux LVM
 
Not knowing your full setup/installation:

Is that the full output from lsblk or have you redacted some of it?
I don't see a root or a swap. (Nor any boot partition)

Do you boot from a different drive?

Your naming conventions also seem to have been manually adjusted. Normally the PVE installation will create an LVM named data with a corresponding pve-data-tpool, you've got something else completely. What is the installation/history behind this.

What does lvscan show

What does df -h show

What does (GUI) Datacenter, Storage show.

I expect you have full (external) backups of your VMs & LXCs.

Maybe I'm completely missing something about your setup.
 
Here is the full output from lsblk that shows the boot drives which are the (two) nvme drives set up as ZFS.
Code:
lsblk
NAME                                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                       8:0    0   1.9T  0 disk
└─sda1                                    8:1    0   1.9T  0 part
  ├─local--lvm-local--lvm_tmeta         252:0    0  15.9G  0 lvm 
  │ └─local--lvm-local--lvm-tpool       252:2    0   1.8T  0 lvm 
  │   ├─local--lvm-local--lvm           252:3    0   1.8T  1 lvm 
  │   ├─local--lvm-vm--101--disk--0     252:4    0  1000G  0 lvm 
  │   │ ├─local--lvm-vm--101--disk--0p1 252:11   0     1M  0 part
  │   │ └─local--lvm-vm--101--disk--0p2 252:12   0 976.6G  0 part
  │   ├─local--lvm-vm--103--disk--0     252:5    0    50G  0 lvm 
  │   ├─local--lvm-vm--102--disk--0     252:6    0     8G  0 lvm 
  │   ├─local--lvm-vm--105--disk--0     252:7    0     8G  0 lvm 
  │   ├─local--lvm-vm--106--disk--0     252:8    0     8G  0 lvm 
  │   ├─local--lvm-vm--107--disk--0     252:9    0     8G  0 lvm 
  │   └─local--lvm-vm--108--disk--0     252:10   0     2G  0 lvm 
  └─local--lvm-local--lvm_tdata         252:1    0   1.8T  0 lvm 
    └─local--lvm-local--lvm-tpool       252:2    0   1.8T  0 lvm 
      ├─local--lvm-local--lvm           252:3    0   1.8T  1 lvm 
      ├─local--lvm-vm--101--disk--0     252:4    0  1000G  0 lvm 
      │ ├─local--lvm-vm--101--disk--0p1 252:11   0     1M  0 part
      │ └─local--lvm-vm--101--disk--0p2 252:12   0 976.6G  0 part
      ├─local--lvm-vm--103--disk--0     252:5    0    50G  0 lvm 
      ├─local--lvm-vm--102--disk--0     252:6    0     8G  0 lvm 
      ├─local--lvm-vm--105--disk--0     252:7    0     8G  0 lvm 
      ├─local--lvm-vm--106--disk--0     252:8    0     8G  0 lvm 
      ├─local--lvm-vm--107--disk--0     252:9    0     8G  0 lvm 
      └─local--lvm-vm--108--disk--0     252:10   0     2G  0 lvm 
sdb                                       8:16   0   3.6T  0 disk
└─sdb1                                    8:17   0   3.6T  0 part /mnt/pve/passport
sdc                                       8:32   1     0B  0 disk
zd0                                     230:0    0  1000G  0 disk
├─zd0p1                                 230:1    0     1M  0 part
└─zd0p2                                 230:2    0  1000G  0 part
nvme0n1                                 259:0    0 931.5G  0 disk
├─nvme0n1p1                             259:1    0  1007K  0 part
├─nvme0n1p2                             259:2    0     1G  0 part
└─nvme0n1p3                             259:3    0 930.5G  0 part
nvme1n1                                 259:4    0 931.5G  0 disk
├─nvme1n1p1                             259:5    0  1007K  0 part
├─nvme1n1p2                             259:6    0     1G  0 part
└─nvme1n1p3                             259:7    0 930.5G  0 part

lvscan shows nothing, which means my backup disk (seen above sbd1) is ALSO unavailable...

df -h shows:
Code:
df -h
Filesystem        Size  Used Avail Use% Mounted on
udev               32G     0   32G   0% /dev
tmpfs             6.3G  1.5M  6.3G   1% /run
rpool/ROOT/pve-1  174G  2.5G  171G   2% /
tmpfs              32G   46M   32G   1% /dev/shm
tmpfs             5.0M     0  5.0M   0% /run/lock
efivarfs          192K   54K  134K  29% /sys/firmware/efi/efivars
rpool             171G  128K  171G   1% /rpool
rpool/ROOT        171G  128K  171G   1% /rpool/ROOT
rpool/data        171G  128K  171G   1% /rpool/data
rpool/pveconf     171G  3.8M  171G   1% /rpool/pveconf
/dev/sdb1          64Z   64Z  3.1T 100% /mnt/pve/passport
/dev/fuse         128M   20K  128M   1% /etc/pve
tmpfs             6.3G     0  6.3G   0% /run/user/0

My backup disk is /dev/sdb1 64Z 64Z 3.1T 100% /mnt/pve/passport so something is obviously not right there too! The size is completely wrong, although available space matches what would be expected on that drive. Corrupted partition table? I can cd /mnt/pve/passport but the directory appears empty.

(GUI) Datacenter, Storage:Screenshot 2024-04-03 at 4.20.08 AM.png

I am really worried. The data seems to still be there(?), since most of the VMs are currently running. I just can't seem to mount the drive(s) any more... What are my options to reattach the lvm storage? Can I extract backups from the currently running VMs somehow??? Worst nightmare.
 
Lets first look at your backups. I understand this is sdb1 supposedly mounted on /mnt/pve/passport. I assume this is the Storage passport in your Storage config (shown GUI).

Normally, I would expect such output from df -h, if a device was used in a raw-block write mode only, with no FS, so even though df -h shows it, it does not read any correct FS-derived sizes. So how was/is this drive setup. What FS does it use? What is the health of this drive, maybe try: smartctl -a /dev/sdb & fsck /dev/sdb1

Also lets look at cat /etc/fstab



After this you're going to have to check your main ZFS Zpool/s on the NVME's.
 
Last edited:
Yes, sdb1 is mounted on /mnt/pve/passport which corresponds to the Storage passport in the Storage config (shown GUI)– This was set up in the GUI, file system is ext4. But that location shows no files, and in GUI, all my backups and ISO files are gone. The GUI shows that something happened recently makes the system think this drive has 75.56 ZB and 100% usage.

Screenshot 2024-04-03 at 10.17.17 AM.png

smartctl -a /dev/sdb passes with no errors:
Code:
smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Elements / My Passport (USB, AF)
Device Model:     WDC WD40NMZW-11GX6S1
Serial Number:    WD-WX21D98AAR6H
LU WWN Device Id: 5 0014ee 60910297e
Firmware Version: 01.01A01
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr  3 09:50:18 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (20820) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 610) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       2800
  4 Start_Stop_Count        0x0032   052   052   000    Old_age   Always       -       48022
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   065   064   000    Old_age   Always       -       26011
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       315
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       75
193 Load_Cycle_Count        0x0032   149   149   000    Old_age   Always       -       154010
194 Temperature_Celsius     0x0022   130   097   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0


SMART Error Log Version: 1
No Errors Logged

fsck /dev/sdb1 gives error message Cannot continue and aborts:
Code:
fsck /dev/sdb1
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
/dev/sdb1 is mounted.
e2fsck: Cannot continue, aborting.

cat /etc/fstab is empty.
Code:
cat /etc/fstab
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0

I have been looking into vgcfgrestore and testdisk, but am unsure how to use either of those utilities and don't want to do anything to further exasperate the problems,.
 
Last edited:
cat /etc/fstab # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0
The empty fstab probably means your going to need to reinstall. However I imagine, in all likelihood the backups drive is probably actually ok. Its your OS/FS system that is messed up, so its not correctly mounting the FS correctly.
I want you to try fsck -n /dev/sdb1 to check the disk one more time.

This was set up in the GUI, file system is ext4
This will make it quite easy - after reinstall, to attach the drive & retrieve the backups.
 
Sorry, I forgot to originally ask, but how long had the system been installed & running, when all this started. Did you suffer a power-outage or some other event?
Another point; I understand you installed the OS/BOOT in ZFS on the 2 NVME's (mirrored) - What type of NVME's are these?

EDIT: Looking at the image (GUI) the event happened around 10:30pm on 2024-04-02. Any recollection? Power/System/Backup event done around then?
 
Last edited:
Is there some type of utility that I could connect the (removable) backup drive to another machine and recover the VM backups? If I can somehow access those backups, I would be able to very easily restore the effected VMs.

The other option is to get the local-lvm/local-lvm back online somehow. This would obviously be the optimal solution.

fdisk -l shows:
Code:
Disk /dev/sda: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: Samsung SSD 850
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: CB63BC22-2519-44F1-A915-4CF8D8F6EFDA


Device     Start        End    Sectors  Size Type
/dev/sda1   2048 4000796671 4000794624  1.9T Linux LVM
GPT PMBR size mismatch (4294967294 != 7813969919) will be corrected by write.

I don't know what you mean by reinstall– You mean to reinstall PVE from scratch?? My main Proxmox install seems OK? The host is booted and running from the NVMEs. It's only the local-lvm storage that is offline/corrupted file system. Yes, I installed the OS/BOOT in ZFS on the 2 NVME's (mirrored) - 2x Samsung 980 Pro (1tb). I would strongly prefer to avoid reinstallation at all costs, since my most important VM is stored in the ZFS storage and is currently unaffected by the partition problems of the two other drives...

The system has been running without issue for months. No recent power outages. Current uptime is 10 days.

I am really at a loss how to proceed. Thank you btw for taking the time to respond.
 
Output of fsck -n /dev/sdb1 (currently running)
Code:
fsck -n /dev/sdb1                                           
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
Warning!  /dev/sdb1 is mounted.
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
/dev/sdb1 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
 
Last edited:
Is there some type of utility that I could connect the (removable) backup drive to another machine and recover the VM backups? If I can somehow access those backups, I would be able to very easily restore the effected VMs.
Assuming the drive is working - which I imagine it is - (the fsck -n output you show would indicate its "probably" healthy - take into account that officially its still mounted) - since its ext4 format (& was attached as vanilla directory storage in PVE) - all the backup files "should be" intact - and easily accessible - by connecting it to another Linux PC & mounting it. You should try issuing umount /dev/sdb1 in PVE host before doing so. I'm not sure that will be successful. They will probably be available on the drive (sdb1) in /dump/
My main Proxmox install seems OK
I'm not sure of that. Your (almost) empty fstab is very concerning. If you know why this is - by all means try recreating it. I also know nothing of the NVME's health & ZFS/pool situation.
2x Samsung 980 Pro (1tb)
These are consumer grade NVMEs that have a DWPD of about 0.33 - Thats pretty bad for a ZFS mirror situation. But probably this isn't your main focus/concern right now.
I would strongly prefer to avoid reinstallation at all costs, since my most important VM is stored in the ZFS storage
If you have proper & working backups, this should not concern you, (except the time-consuming affair + downtime). I very much hope you realize that a mirrored ZFS does not constitute a backup in any way (add to that the non-enterprise media used).

The system has been running without issue for months. No recent power outages. Current uptime is 10 days.
I don't know what HW your running on - but it's possible you have a power-supply / HW problem



-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I'm now going to finally tell you what I would do if I were in your situation:

1. Try rebooting the system. It may actually work. Can't quantify the chance.
2. Check the backups on a different system. (Using above-mentioned procedure).
3. Reinstall from scratch, using enterprise HW.

(Also if you reinstall; record/document every step you take for your own info - this ALWAYS proves the most invaluable info in the future).

This is my personal opinion - DO NOT RELY ON ME - maybe you'll find some other path or option(s)


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
One thing I would point out - in step 1 above - I wouldn't just plain reboot; rather shutdown (then) power down/up
 
Thank you sincerely for all of this invaluable info. I am currently waiting on another spare drive to run ddrescue and also ensure I have a working backup of the VM located on the ZFS pool before continuing further.
Then I will try to shutdown/reboot the host. If that fails and I also can not recover the backup drive, I will have to then consider the best way forward to rebuild/reinstall.

These are consumer grade NVMEs that have a DWPD of about 0.33 - Thats pretty bad for a ZFS mirror situation. But probably this isn't your main focus/concern right now.
I realize that, but they are both brand new drives with wearout showing 0% and 1% respectively.
If you have proper & working backups, this should not concern you, (except the time-consuming affair + downtime). I very much hope you realize that a mirrored ZFS does not constitute a backup in any way (add to that the non-enterprise media used).
Yes, it's exactly that- extremely time consuming to set up from scratch and the downtime would be an issue for that particular VM.

fsck -n /dev/sdb1 finally finished, with errors. It's an extremely long output but ends with:
Code:
Group 29791 block bitmap does not match checksum.
IGNORED.
Group 29793 block bitmap does not match checksum.
IGNORED.
Group 29794 block bitmap does not match checksum.
IGNORED.
Group 29795 block bitmap does not match checksum.
IGNORED.
Group 29796 block bitmap does not match checksum.
IGNORED.
Group 29797 block bitmap does not match checksum.
IGNORED.
Group 29798 block bitmap does not match checksum.
IGNORED.
Group 29799 block bitmap does not match checksum.
IGNORED.
Group 29800 block bitmap does not match checksum.
IGNORED.
Group 29801 block bitmap does not match checksum.
IGNORED.
Group 29802 block bitmap does not match checksum.
IGNORED.
Group 29803 block bitmap does not match checksum.
IGNORED.
Group 29804 block bitmap does not match checksum.
IGNORED.
Group 29805 block bitmap does not match checksum.
IGNORED.
Group 29806 block bitmap does not match checksum.
IGNORED.

/dev/sdb1: ********** WARNING: Filesystem still has errors **********
/dev/sdb1: 11/244187136 files (9.1% non-contiguous), 15616265/976745728 blocks
 
I am currently waiting on another spare drive to run ddrescue
Just remember - don't run ddrescue on a mounted drive
As I said above I still believe your sdb drive "should" be ok. I would probably try it without attempting ddrescue. You can always ddrescue it later. I would probably make a plain zipped dd image of it - before trying further rescue attempts. This way you can always revert it. However the choice is obviously yours.

fsck -n /dev/sdb1 finally finished, with errors.
As above this output may not be accurate at all.

I'd also like to point out - in my personal (home) PVE - I also regularly copy all the backups to another medium (usually external USB) - which then gets removed from the system, so I have another backup. In production / datacenter setup there is a similar mechanism. One other thing I do from time to time (usually before major change/update) is make a complete dd image of my main boot NVME. This way I can always revert the system - in emergency. It does take some downtime - but its well worth it.

Anyway enough ranting,
Best of luck.
 
  • Like
Reactions: filou
Update: Seems the backup disk is toast, or rather I tried mounting it on another Linux machine and checked it in gparted and there does not seem to be any partition. I belive the partition map on that drive was deleted or overwritten somehow.
 
EDIT: Looking at the image (GUI) the event happened around 10:30pm on 2024-04-02. Any recollection? Power/System/Backup event done around then?
I finally spoke with the other user of this system and have more information as to HOW this apparently happened.

They were shrinking a VM, which upon inspection of dmesg seems to have gone as expected. The VM booted successfully after the operation. However at the end of that process, they did gdisk /dev/sda, mistakenly entering the command in the wrong terminal window– Meaning they ran gdisk /dev/sda on PVE instead of inside the VM. I believe this must be the root of the partitioning issues within the OS.

Ref: https://dallas.lu/pve-reduce-ubuntu-vm-disk-size/
 
Doesn't sound good.


To summarize it would appear your system suffers from various problems:

1. Host drive /dev/sda (containing Local-Lvm /root /swap etc ) got tampered with.

2. The fstab appears empty.

3. Backup drive /dev/sdb is also corrupted.

1. & 2. are possibly connected, but 3. would seem like an independent issue.


You probably want to save as much data as possible from within your "critical" running VM on the NVME's. If it has access from within your network, there may be various ways of trying todo a full system backup as much as possible.
Also, maybe some of the ZFS gurus, have some innovative ways of getting a system copy of the VM from the mirrored ZFS on the RAID ? This has to be done on a mostly non-operational/functional PVE, which is beyond my scope.

Point 3. above is as concerning as the other points, and in a way is the most system-critical and has to be researched independently as to how it came about. Configuration/Settings/Management errors will always happen at some point in a system. For this we have backups that we simply can't afford to lose.


I again wish you the best of luck.


Edit: something that occurred to me concerning point 2. above. Possibly in your ZFS-boot system - you use some other form (scripting or autofs maybe) for mounting sda & sdb. This may explain the single line of proc /proc proc defaults 0 0 in your fstab. So maybe we can limit the current fault(s) of your PVE.
 
OK, post mortem:

TL;DR
I was able to successfully re-initialize /dev/sda as a physical volume using pvcreate and then restore the lvm metadata using vgcfgrestore.

Once I tracked the source of the partition issues the system was experiencing, it was slightly easier to develop a solution. The original issue occurred after mistakenly running gdisk /dev/sda (pertaining to the LVM volume) inside PVE, rather than inside a VM. Unfortunately, the user also ran gdisk /dev/sdb (pertaining to the backup disk) which more-or-less simultaneously nuked the backup drive, turning a small problem turn into a much bigger one. Obviously, the two consecutive mistakes severely complicated the situation!

However the good news is we could rule out device failures or OS issues and could attribute this to human error– The data on the drive(s) were still in tact, however we would have to try to recover the partition map and volume metadata.

The two pertinent commands are:

pvcreate command initializes a disk or partition as a physical volume. The physical volumes are used to create a volume group and the LVM logical volumes are created on the volume groups.

vgcfgrestore restores the metadata of a volume group from a text backup file produced by vgcfgbackup. (These are located at /etc/lvm/archive/)

----------

Steps to recover:

1)
In a tmux session, ran ddrescue as a preliminary precaution to clone /dev/sda/ to new drive before further manipulation in order to preserve current device state in case of additional problems arising during the restore process (unmount the drive before continuing):

ddrescue /dev/sda /dev/sdc /var/log/ddrescue.log

Result:
Code:
ddrescue /dev/sda /dev/sdc /var/log/ddrescue.log
# Mapfile. Created by GNU ddrescue version 1.27                            │GNU ddrescue 1.27
# Command line: ddrescue /dev/sda /dev/sdc /var/log/ddrescue.log --force   │Press Ctrl-C to interrupt
# Start time:   2024-04-03 23:19:30                                        │Initial status (read from mapfile)
# Current time: 2024-04-04 05:07:54                                        │rescued: 2031 GB, tried: 0 B, bad-sector: 0 B, bad areas: 0
# Copying non-tried blocks... Pass 1 (forwards)                            │
# current_pos  current_status  current_pass                                │Current status
0x1D8F2750000     ?               1                                        │     ipos:    2048 GB, non-trimmed:        0 B,  current rate:    188 MB/s
#      pos        size  status                                             │     opos:    2048 GB, non-scraped:        0 B,  average rate:    518 MB/s
0x00000000  0x1D8F2760000  +                                               │non-tried:        0 B,  bad-sector:        0 B,    error rate:       0 B/s
0x1D8F2760000  0x3FC2F6000  ?                                              │  rescued:    2048 GB,   bad areas:        0,        run time:         32s
tail: /var/log/ddrescue.log: file truncated                                │pct rescued:  100.00%, read errors:        0,  remaining time:         n/a
# Mapfile. Created by GNU ddrescue version 1.27                            │                              time since last successful read:         n/a
# Command line: ddrescue /dev/sda /dev/sdc /var/log/ddrescue.log           │Copying non-tried blocks... Pass 1 (forwards)

This process took approximately 6 hours in total for a 2TB drive. After completion and verification, remove clone from system.


2) List available restore files for volume group local-lvm:

vgcfgrestore -l local-lvm


3) Choose restore file to work with (/etc/lvm/backup/local-lvm is latest, but you can select an earlier restore point if applicable) and create PV.

pvcreate /dev/sda1 --uuid fv9EoP-lKzG-XXpU-rFdj-EjqU-wwSm-keZI89 --restorefile /etc/lvm/backup/local-lvm


4) Wipe are replace PMBR and GPT signatures:

Code:
pvcreate /dev/sda1 --uuid fv9EoP-lKzG-XXpU-rFdj-EjqU-wwSm-keZI89 --restorefile /etc/lvm/backup/local-lvm
  WARNING: Couldn't find device with uuid fv9EoP-lKzG-XXpU-rFdj-EjqU-wwSm-keZI89.
WARNING: gpt signature detected on /dev/sda1 at offset 2048406846976. Wipe it? [y/n]: y
  Wiping gpt signature on /dev/sda1.
WARNING: PMBR signature detected on /dev/sda1 at offset 510. Wipe it? [y/n]: y
  Wiping PMBR signature on /dev/sda1.
  Physical volume "/dev/sda1" successfully created.


5) vgcfgrestore to restore the LVM metadata:
Code:
vgcfgrestore local-lvm --force
  WARNING: Forced restore of Volume Group local-lvm with thin volumes.
  Restored volume group local-lvm.


6) lvscan to check if the deleted logical volume was restored:
Code:
lvscan
  ACTIVE            '/dev/local-lvm/local-lvm' [1.83 TiB] inherit
  inactive          '/dev/local-lvm/vm-101-disk-0' [1000.00 GiB] inherit
  inactive          '/dev/local-lvm/vm-103-disk-0' [50.00 GiB] inherit
  inactive          '/dev/local-lvm/vm-102-disk-0' [8.00 GiB] inherit
  inactive          '/dev/local-lvm/vm-105-disk-0' [8.00 GiB] inherit
  inactive          '/dev/local-lvm/vm-106-disk-0' [8.00 GiB] inherit
  inactive          '/dev/local-lvm/vm-108-disk-0' [2.00 GiB] inherit


7) lvchange with -ay option activates the logical volume in the volume group (repeat this for each LV in the group).

lvchange -ay local-lvm/vm-101-disk-0


8) Once all LVs are active (lvscan again to check), you can start the VMs/Containers.


9) Reformat and reinitialize backup disk, and (re)create backups of all VMs/Containers.


10) A final check using fsck to check the file systems inside the VMs.


----------

Takeaways, maybe obvious but bears repeating:
–Always double check which terminal window you are entering potentially destructive commands.
–NEVER keep the backup disk attached to the system during FS manipulations.

Hope this may help someone in the future who stumbles into a similar situation.

Refs:
https://www.golinuxcloud.com/pvcreate-command-in-linux/
https://www.golinuxcloud.com/vgcfgrestore-recover-lvm-without-backup/
https://www.thegeekdiary.com/how-to-recover-deleted-logical-volume-lv-in-lvm-using-vgcfgrestore/
 
Last edited:
I'm really happy for you that you managed to fix your problem. I like your complete documentation of your solution, this may really help someone in the future.

You probably should edit the thread title (top lower right hand side "Edit") & add the tag [SOLVED].

Edit: Just out of interest, as above, why was your fstab in fact empty? Was it also edited by mistake?
 
Last edited:
Edit: Just out of interest, as above, why was your fstab in fact empty? Was it also edited by mistake?
I am not 100% certain, but it is still empty– however both drives (sda,sdb) mount automatically on system startup. I think, as you mentioned above because the main OS is installed on ZFS– they are mounted automatically somehow without fstab? It's a very vanilla installation, so no additional scripting involved.

It looks like fstab has not been updated since June 2023, which afaik, was the setup date for this server.
Code:
stat -c "%y" /etc/fstab
2023-06-24 22:41:50.083471448 -0400
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!