All LXCs and VMs fail to start

geoffh

New Member
Aug 24, 2020
11
0
1
71
I have ProxMox 6.3-3 running a test environment. There are 17 LXCs and 5 VMs. At most, I would possibly 3-4 simultaneously. I have had the host machine running with no LXCs or VMs running over the past 5 days or so (i.e. it has just been idle whilst I was working on something else).
When I got back to it, I used the web interface to log in and was going to start a number of machines to test something,
I first noticed that there was a number of updates, so I installed them. Given there was nothing running and it had been a few months since rebooting the machine, I decided to do that.
Once rebooted, I tried to start up a LXC with failed, I then tried a VM and it failed.
I then tried all LXCs and VMs sequentially and they all failed to load.
01 Image of machines.JPG

I shut down the machine and after it came back up and tried again, but got the same results.
I then looked to see if the drives were available and they were.
02 pvesm status.JPG

When trying to start an LXC, the following error message is provided:
03 LXC error message.JPG

If you look at the status tab, it shows:
04 status.JPG


I am not sure where to go from here to find the solution.

As an aside, within the last 30 days, I have sequentially started each of the LXCs and VMs successfully and ran update/upgrade commands on each (or the equivalent).

The only other change that I made in this intervening period was to move the location of the ProxMox_NAS (pve-test) from one shared folder group to another on the NAS for load balancing. The IP address remained the same and the target mount point name remained the same.

05 NAS.JPG

For the new share to be recognised I needed to remove the original entry and re-add it with the new location on the NAS (even though all the names were the same (& IP Addr)). I thought a reboot would have reconnected the logical links. But the above has been the only change since updating/upgrading the LXCs and VMs and the actual updates to ProxMox via the console.

It just seems strange that all LXCs and VMs would stop at the same time. I would be grateful for any guidance about thing to try, where to look etc.

Regards
Geoff
 
hi,

what error do you get when starting a VM? does it complain about anything specific?

your storage change might have been the cause but to be sure we need to check the configs:
* /etc/pve/storage.cfg
* pct config CTID (from any CT)
* qm config VMID (from any VM)
* a failing VM start task log
 
When trying to start a VM, I get the following:

06 VM Start - Output.JPG

And the status is:
07 VM Start Status.JPG

The requested storage.cfg is:
08 config.JPG

The pct config for the same LXC in the original post is:
09 101 config.JPG

The qm config for the VM at the start of this post (116) is:
10 116 config.JPG

And the log file entry for action of trying to start the VM is:
11 log file.JPG

I hope this is helpful

Regards
Geoff
 
As requested:

root@pve-test:~# pvs
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- <446.63g <16.00g
root@pve-test:~# vgs
VG #PV #LV #SN Attr VSize VFree
pve 1 39 0 wz--n- <446.63g <16.00g
root@pve-test:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi---tz-- <324.02g
root pve -wi-ao---- 96.00g
snap_vm-101-disk-0_Ubuntu2004R1 pve Vri---tz-k 20.00g data vm-101-disk-0
snap_vm-103-disk-0_Ubuntu2004ScratchR1 pve Vri---tz-k 20.00g data vm-103-disk-0
snap_vm-104-disk-0_UInfluxDBBase pve Vri---tz-k 20.00g data vm-104-disk-0
snap_vm-110-disk-0_UGrafanaBaseR0 pve Vri---tz-k 20.00g data vm-110-disk-0
snap_vm-112-disk-0_UGrafanaScratchR0 pve Vri---tz-k 20.00g data vm-112-disk-0
snap_vm-113-disk-0_KaliLinuxBaseR1 pve Vri---tz-k 32.00g data vm-113-disk-0
snap_vm-114-disk-0_KaliLinuxScratchR0 pve Vri---tz-k 32.00g data vm-114-disk-0
snap_vm-115-disk-0_Windows10ProBaseR1 pve Vri---tz-k 50.00g data vm-115-disk-0
snap_vm-117-disk-0_DockerPortainer20Base pve Vri---tz-k 50.00g data vm-117-disk-0
snap_vm-120-disk-0_UbuntuDockerYachtR0 pve Vri---tz-k 50.00g data vm-120-disk-0
snap_vm-123-disk-0_BasiczshInstalled pve Vri---tz-k 20.00g data vm-123-disk-0
snap_vm-123-disk-0_Ubuntu2004OnMyZshR0 pve Vri---tz-k 20.00g data vm-123-disk-0
snap_vm-124-disk-0_R1_HADataLogger pve Vri---tz-k 20.00g data vm-124-disk-0
swap pve -wi-ao---- 4.00g
vm-101-disk-0 pve Vwi---tz-- 20.00g data
vm-102-disk-0 pve Vwi---tz-- 10.00g data
vm-103-disk-0 pve Vwi---tz-- 20.00g data
vm-104-disk-0 pve Vwi---tz-- 20.00g data
vm-105-disk-0 pve Vwi---tz-- 50.00g data
vm-108-disk-0 pve Vwi---tz-- 10.00g data
vm-109-disk-0 pve Vwi---tz-- 20.00g data
vm-110-disk-0 pve Vwi---tz-- 20.00g data
vm-111-disk-0 pve Vwi---tz-- 50.00g data
vm-112-disk-0 pve Vwi---tz-- 20.00g data
vm-113-disk-0 pve Vwi---tz-- 32.00g data
vm-113-state-KaliLinuxBaseR1 pve Vwi---tz-- <6.49g data
vm-114-disk-0 pve Vwi---tz-- 32.00g data
vm-115-disk-0 pve Vwi---tz-- 50.00g data
vm-116-disk-0 pve Vwi---tz-- 50.00g data
vm-117-disk-0 pve Vwi---tz-- 50.00g data
vm-118-disk-0 pve Vwi---tz-- 50.00g data
vm-119-disk-0 pve Vwi---tz-- 50.00g data
vm-120-disk-0 pve Vwi---tz-- 50.00g data
vm-121-disk-0 pve Vwi---tz-- 50.00g data
vm-122-disk-0 pve Vwi---tz-- 25.00g data
vm-123-disk-0 pve Vwi---tz-- 20.00g data
vm-124-disk-0 pve Vwi---tz-- 20.00g data
root@pve-test:~#
 
root@pve-test:~# ls -al /dev/pve/
total 0
drwxr-xr-x 2 root root 80 Feb 5 12:52 .
drwxr-xr-x 20 root root 4520 Feb 10 17:02 ..
lrwxrwxrwx 1 root root 7 Feb 5 12:53 root -> ../dm-1
lrwxrwxrwx 1 root root 7 Feb 5 12:53 swap -> ../dm-0
root@pve-test:~#
 
As Requested:

root@pve-test:~# ls -al /dev/dm*
brw-rw---- 1 root disk 253, 0 Feb 5 12:53 /dev/dm-0
brw-rw---- 1 root disk 253, 1 Feb 5 12:53 /dev/dm-1
root@pve-test:~#
root@pve-test:~# dmsetup ls
pve-swap (253:0)
pve-root (253:1)
root@pve-test:~#
root@pve-test:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part /boot/efi
└─sda3 8:3 0 446.6G 0 part
├─pve-swap 253:0 0 4G 0 lvm [SWAP]
└─pve-root 253:1 0 96G 0 lvm /
sdb 8:16 0 447.1G 0 disk
└─sdb1 8:17 0 1K 0 part
root@pve-test:~# ^C
root@pve-test:~#
 
Just in case there was some confusion with the output of the lsblk command - it was complete above:

root@pve-test:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part /boot/efi
└─sda3 8:3 0 446.6G 0 part
├─pve-swap 253:0 0 4G 0 lvm [SWAP]
└─pve-root 253:1 0 96G 0 lvm /
sdb 8:16 0 447.1G 0 disk
└─sdb1 8:17 0 1K 0 part
root@pve-test:~#
root@pve-test:~#
 
My last response to you was at 0310 local time. I read the paper, didn't trust myself to comprehend it properly at that time of the morning :)
So have re-read it once awake properly.

So working through the paper:

root@pve-test:~# cd /dev/mapper
root@pve-test:/dev/mapper# vgscan --mknodes
Reading all physical volumes. This may take a while...
Found volume group "pve" using metadata type lvm2
root@pve-test:/dev/mapper#
root@pve-test:/dev/mapper# ls -la
total 0
drwxr-xr-x 2 root root 100 Feb 10 17:02 .
drwxr-xr-x 20 root root 4520 Feb 10 17:02 ..
crw------- 1 root root 10, 236 Feb 5 12:53 control
lrwxrwxrwx 1 root root 7 Feb 5 12:53 pve-root -> ../dm-1
lrwxrwxrwx 1 root root 7 Feb 5 12:53 pve-swap -> ../dm-0
root@pve-test:/dev/mapper#

I then tried to start a LXC and VM resulting in error messages in both cases.
I then moved to the next suggestion:

root@pve-test:~# /etc/init.d/lvm2 start
root@pve-test:~#

I then tried to start a LXC and VM resulting in error messages in both cases.

I then moved to the next suggestion (where 17 is in the left hand margin). It seemed that they were talking themselves out of their suggestion and then finishing off with "It hardly seems worth the effort though when a reboot will fix it too." So that is what I did - rebooted ProxMox.

I then tried to start a LXC and VM resulting in error messages in both cases.

I then moved to the next suggestion (just below 45 in the left hand margin):

root@pve-test:~# apt get install lvm2
E: Invalid operation get
root@pve-test:~# apt install lvm2
Reading package lists... Done
Building dependency tree
Reading state information... Done
lvm2 is already the newest version (2.03.02-pve4).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@pve-test:~#
root@pve-test:~# vgscan --mknodes -v
Reading all physical volumes. This may take a while...
Found volume group "pve" using metadata type lvm2
root@pve-test:~#
root@pve-test:~# lvscan -v
ACTIVE '/dev/pve/swap' [4.00 GiB] inherit
ACTIVE '/dev/pve/root' [96.00 GiB] inherit
inactive '/dev/pve/data' [<324.02 GiB] inherit
inactive '/dev/pve/vm-102-disk-0' [10.00 GiB] inherit
inactive '/dev/pve/vm-105-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-108-disk-0' [10.00 GiB] inherit
inactive '/dev/pve/vm-111-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-101-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/vm-103-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/vm-104-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-104-disk-0_UInfluxDBBase' [20.00 GiB] inherit
inactive '/dev/pve/vm-109-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/vm-110-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-110-disk-0_UGrafanaBaseR0' [20.00 GiB] inherit
inactive '/dev/pve/vm-112-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-112-disk-0_UGrafanaScratchR0' [20.00 GiB] inherit
inactive '/dev/pve/vm-113-disk-0' [32.00 GiB] inherit
inactive '/dev/pve/vm-114-disk-0' [32.00 GiB] inherit
inactive '/dev/pve/vm-115-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/snap_vm-115-disk-0_Windows10ProBaseR1' [50.00 GiB] inherit
inactive '/dev/pve/vm-116-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-117-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/snap_vm-117-disk-0_DockerPortainer20Base' [50.00 GiB] inherit
inactive '/dev/pve/vm-118-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-119-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-113-state-KaliLinuxBaseR1' [<6.49 GiB] inherit
inactive '/dev/pve/snap_vm-113-disk-0_KaliLinuxBaseR1' [32.00 GiB] inherit
inactive '/dev/pve/snap_vm-114-disk-0_KaliLinuxScratchR0' [32.00 GiB] inherit
inactive '/dev/pve/snap_vm-103-disk-0_Ubuntu2004ScratchR1' [20.00 GiB] inherit
inactive '/dev/pve/vm-120-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/snap_vm-120-disk-0_UbuntuDockerYachtR0' [50.00 GiB] inherit
inactive '/dev/pve/vm-121-disk-0' [50.00 GiB] inherit
inactive '/dev/pve/vm-122-disk-0' [25.00 GiB] inherit
inactive '/dev/pve/snap_vm-101-disk-0_Ubuntu2004R1' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-123-disk-0_BasiczshInstalled' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-123-disk-0_Ubuntu2004OnMyZshR0' [20.00 GiB] inherit
inactive '/dev/pve/vm-124-disk-0' [20.00 GiB] inherit
inactive '/dev/pve/snap_vm-124-disk-0_R1_HADataLogger' [20.00 GiB] inherit
root@pve-test:~#
root@pve-test:~#
root@pve-test:~# vgchange -a y pve
Check of pool pve/data failed (status:1). Manual repair required!
2 logical volume(s) in volume group "pve" now active
root@pve-test:~#
root@pve-test:~#

All the "inactive: and the "Check of pool pve/data failed (status:1). Manual repair required!" seem to be the problem.
I searched on "check of pool failed (status:1). Manual repair required" and got conflicting information and got nervous about proceeding without some guidance (from those with much more experience) as to the steps from here in case I screw it up

regards
Geoff
 
Sounds like you have some sort of LVM corruption. You are going to have to decide how valuable your data is vs your time. And if you have a backup of important data - I would start restores going.

The proper way to go about recovering is to start with "what happened", that means going with "fine tooth comb" over every single log line in the system to find when the issue first occurred, what happened just before this event, etc. There must be something in the logs even when you try to activate the volume and it fails.

After that you would need to research the various sources on internet on possible diagnostics/recovery methods. There are many reasons why you could have data corruption - bad disk, disk without power loss prevention and loss of power, concurrent/unsynchronized LVM manipulation, bad configuration, even bad memory. Or you could have simply ran out of metadata space.

LVM is part of core OS, Proxmox is just a client app for LVM. As you've seen many people ran into LVM issues with or without Proxmox.

https://www.google.com/search?q=lvm++Manual+repair+required!+site:forum.proxmox.com

Good luck
 
Thank you for your help.

As I indicated in the first post this was just a test machine.

Whilst I have some 40+ years experience in the IT industry none of it was in the Unix/Linux world (apart from 1 semester at Uni in the 70's and that was very, very high level). The remainder of that experience has been in the mainframe world (early on) and the Windows world later on.
I wanted to learn about Linux (and its variants) and now I've the time to do so. Whilst I also have lots of experience in the VM world, I had none in ProxMox, but again it seemed interesting.

I set up the test machine specifically to learn more about Linux and ProxMox before putting anything into production. As you can see from the list of machines, I've been exploring the extent of its capability. However, being a test machine, I was only backing up some of the LXCs and VMs but I was doing was taking snapshots on all the machines - mainly because there was no data that could be lost (apart from the LXCs and VMs themselves) - the only exception was data in the 3CXPBX - but that's easy to recreate.

So whilst I'd like to go back and find out what went wrong, I think it's a exercise in diminishing returns - it would be quicker to blow the ProxMox instance away and rebuild it - I have plenty of notes that I took when building test environment. Only this time I will take regular backups of the entire environment.

I may see is any of the back-ups can be restored - but to get to a clean environment, I think the "nuclear" option will be best :-0)
 
I tried to restore CT101 (LXC) from what I thought would have been a good known point and it returned with:

recovering backed-up configuration from 'ProxMox_NAS:backup/vzdump-lxc-101-2021_01_06-23_00_04.tar.zst'
Formatting '/mnt/pve/ProxMox_NAS/images/101/vm-101-disk-0.raw', fmt=raw size=21474836480
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 0122b19a-b290-4dad-a995-e1ac8112322f
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
Check of pool pve/data failed (status:1). Manual repair required!
TASK ERROR: unable to restore CT 101 - lvremove snapshot 'pve/snap_vm-101-disk-0_Ubuntu2004R1' error: Failed to update pool pve/data.


I then tried to restore CT116 (VM) from the earliest listed back=up file and the following message error message was displayed:

restore vma archive: zstd -q -d -c /mnt/pve/ProxMox_NAS/dump/vzdump-qemu-116-2020_12_23-23_59_54.vma.zst | vma extract -v -r /var/tmp/vzdumptmp22681.fifo - /var/tmp/vzdumptmp22681
CFG: size: 659 name: qemu-server.conf
DEV: dev_id=1 size: 53687091200 devname: drive-sata0
CTIME: Wed Dec 23 23:59:58 2020
Check of pool pve/data failed (status:1). Manual repair required!
no lock found trying to remove 'create' lock
TASK ERROR: command 'set -o pipefail && zstd -q -d -c /mnt/pve/ProxMox_NAS/dump/vzdump-qemu-116-2020_12_23-23_59_54.vma.zst | vma extract -v -r /var/tmp/vzdumptmp22681.fifo - /var/tmp/vzdumptmp22681' failed: lvcreate 'pve/vm-116-disk-1' error: Aborting. Failed to locally activate thin pool pve/data.

"Nuc" time - start again

Again. I thank you for your assistance in trying to resolve this problem
 
I think it somehow got messed up when you changed around the NAS stuff, just by reading through these posts. Hope you were able to re load and take another stab at this. Proxmox is solid and does just about anything you could ever need!
 
I disagree with @Zombie , its unlikely that NAS had anything to do with LVM metadata corruption.

@geoffh, I should have been clearer - before you restore you have to wipe and redo your LVM groups, most likely completely remove all the ghost slices, repair metadata, fsck your file systems.
Until you are 100% sure you have a clean disk setup I would not restore/write/save anything on that disk.
 
It may not be @bbgeek17 but it’s just strange that the issue popped up after the NAS change but it could be related to the issue you linked above after the NAS. Just strange it cropped up after it. I don’t know unless we can look at some logs.