[Critical] VM Dead after creating snapshot. :'(

dhairyasharma

New Member
Feb 6, 2015
10
0
1
Hi there everyone,

Now here is what happened. This node was running since 2013 perfectly. I have created 8 VMs back then and they were all up, they are still.


When I consoled into the VM using VNC, the vm was stuck at boot with error "boot failed, This hard drive is not bootable"

here is the task log via the web UI.:

Creating snapshot:
Formatting '/var/lib/vz/images/105/vm-105-state-feb.raw', fmt=raw size=7017070592
TASK OK

Removing Snapshot:
TASK OK


Vm config:
root@instant:~# cat /etc/pve/qemu-server/105.conf
balloon: 2048
bootdisk: ide0
cores: 2
cpu: host
cpuunits: 100
ide0: local:105/vm-105-disk-1.qcow2,format=qcow2,cache=writeback,size=76G
ide2: none,media=cdrom
memory: 3096
name: server
net0: e1000=E6:FB:7E:BD:43:06,bridge=vmbr0
onboot: 1
ostype: l26
sockets: 1


Output of pveversion -v:
root@instant:~# pveversion -v
proxmox-ve-2.6.32: 3.2-136 (running kernel: 3.10.0-1-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-2.6.32-32-pve: 2.6.32-136
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-3.10.0-1-pve: 3.10.0-5
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-31-pve: 2.6.32-132
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-3
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

kernel :
root@instant:~# uname -a
Linux instant. 3.10.0-1-pve #1 SMP Tue Dec 17 13:12:13 CET 2013 x86_64 GNU/Linux


file:
root@instant:/var/lib/vz/images/105# file vm-105-disk-1.qcow2
vm-105-disk-1.qcow2: QEMU QCOW Image (unknown version)

Fdisk:
root@instant:/var/lib/vz/images/105# fdisk -l vm-105-disk-1.qcow2
Disk vm-105-disk-1.qcow2: 66.8 GB, 66801762304 bytes
255 heads, 63 sectors/track, 8121 cylinders, total 130472192 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk vm-105-disk-1.qcow2 doesn't contain a valid partition table


ls:
root@instant:/var/lib/vz/images/105# ls -l vm-105-disk-1.qcow2
-rw-r--r-- 1 root root 66801762304 Feb 6 06:31 vm-105-disk-1.qcow2


When I did qemu-check I got thousands of errors (Leak errors etc.)
qemu-img check vm-105-disk-1.qcow2

So I used the -r all flag to fix the disk, that fixed all of the errors.
qemu-img check -r all vm-105-disk-1.qcow2


And ran the qemu-check again

root@instant:/var/lib/vz/images/105# qemu-img check vm-105-disk-1.qcow2
No errors were found on the image.
303104/1245184 = 24.34% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 66800386048

Still no luck.

After that I've decided to use the qemu-nbd command to mount the disk and run fsck
Here is what happened next.

root@instant:/var/lib/vz/images/105# modprobe nbd
FATAL: Module nbd not found.
root@instant:~# lsmod | grep ndb
root@instant:~#

Though I'm able to view the LVM partitions via tetsdisk.

IDK if im on the right track or not and don't know what to do next either. Can anybody please help me.
 
Last edited:
Ok I switched the kernel, now I can use qemu-nbd to mount the image.

Trying my luck again. Lets see if it works.
 
thanks to testdiks, I have added the partitions back. What I did next is attached this disk image to another vm running on centos and luckily the system is able to find the LVms. But I can't mount them. So can anyone help me with this?
 
Hi dhairyasharma,
I can help you with my own HOWTO, I created back then, when I had a similar problem, but I'm not sure it's exactly like yours. This is it, I paste it here:

When a snapshot in a qcow2 image is broken and cannot be deleted, the only way is to copy data from that image to another.
Here is how.
Here we assume this is ubuntu and the root partition's image is broken (the most complicated case), but another cases are similar
and the system is starting OK, the data is OK, just there is a problem with the snapshots.
We won't use live CD, although there is a way with a live CD too, simpler maybe.
The approach is this: Copy the broken image, create a healthy image with the same size, start the machine
and from within copy the contents of the coppied image to the new healthy image with dd. This is needed
because you cannot copy the root filesystem in real time. If the problematic image is not the root's one, you
can do the same without copying the broken image first OR you can use a live CD, your choice.

1. Shutdown the domain.
virsh shutdown ubuntu-server (or log and shutdown from inside)
virsh list (to be sure it is down)

2. Copy the image to another one
cd /home/images
cp ubuntu-server.vda ubuntu-server1.vda

3. Create another image, to which we are going to copy the content
qemu-img create -f qcow2 -o preallocation=metadata ubuntu-server2.img 10737418240
(the last is size, take it from qemu-img info ubuntu-server-vda.img in Bytes)
ls -lash ubuntu-server2.img (to confirm it is not allocated - the left size from the output is the file size,
right-image virtual size+metadate blocks and so on-it and normally is larger than the size you want)
ls -l ubuntu-server2.img (to get the image virtual size in B, DO NOT get it from qemu-img info,
because it is the internal virtual size of the image, not the file itself)
fallocate -l theSizeInBytesFromAbove ubuntu-server2.img (to allocate the image for faster writing)
ls -lash ubuntu-server2.img (to confirm it is allocated - the left size should be equal to the right now)
qemu-img info ubuntu-server2.img (to double confirm with the qemu-img command)

4. Open Virtmanager from your graphical interface and connect to the VM host

5. Stop and Start the storage pool to autodetect the new images

6. Now add these images as another VirtIO storage devices in the domain.
Here we have to give the size of the image in MB (ubuntu 10.04), you can calculate it from the qemu-img info command.

7. Start the domain.

8. FIRST get info which /dev/vd... are user by the system now, because they may have changed after addition of images.

9. Get info about the /dev/vd.. names of the images you just added
parted -l

10. The new empty image (ubuntu-server2.img) should be something like: Error: /dev/vdd: unrecognised disk label,
and the image you coppied (ubuntu-server1.img) should be seen too.

and create filesystems there. Then mount the partitions from ubuntu-server1.img and ubuntu-server2.img. Then copy the
content with "cp -af...". Then if this is the root partition you must take a look at the "grubFixHOWTO.txt" file to
11. Now copy the contents of ubuntu-server1.img to ubuntu-server2.img
dd if=/dev/vd.(the name of ubuntu-server1.img) of=/dev/vd.(the name of ubuntu-server2.img) (without /dev/vda1, just /dev/vda)
Note: If you don't want to use dd, but copy the data plain, you must create identical partitions on ubuntu-server2.img
restore GRUB and replace UUIDs accordingly.

12. Confirm with parted -l again that the [vdd] device is now identical to the source

13. Poweroff the domain
poweroff

14. Now rename the old images to something
mv ubuntu-server.img old.ubuntu-server.img
mv ubuntu-server1.img old.ubuntu-server1.img

15. Rename the new image to the name of the old original broken image, so it can be used by the domain, like the original one
mv ubuntu-server2.img ubuntu-server.img

16. From Virtmanager - remove the temporary images you added before - ubuntu-server1.img and ubuntu-server2.img, otherwise
the domain will not start and give error like:
error: Failed to start domain ubuntu-server-pdc
error: monitor socket did not show up.: Connection refused

17. Start the domain and see if it boots successfully and everything is OK
virsh start ubuntu-server


That's it!
Now you have an image without snapshots, but it is clean and healthy.
 
I see you've managed to repair something, it would be helpful to share what you did (for others with the same problem) and maybe describe your problem more accurately.
So, what are the errors with mounting?
 
many thanks for replying toshko4 :)


[root@static /]# fdisk -l


Disk /dev/sda: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00074945


Device Boot Start End Blocks Id System
/dev/sda1 * 1 64 512000 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 64 13055 104344576 8e Linux LVM


Disk /dev/sdb: 81.6 GB, 81604378624 bytes
1 heads, 1 sectors/track, 159383552 cylinders, total 159383552 sectors
Units = cylinders of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Device Boot Start End Blocks Id System
/dev/sdb1 * 1028095 105885694 52428800 83 Linux
/dev/sdb2 125829121 157276350 15723615 8e Linux LVM


Disk /dev/mapper/VolGroup-lv_root: 53.7 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000




Disk /dev/mapper/VolGroup-lv_swap: 4227 MB, 4227858432 bytes
255 heads, 63 sectors/track, 514 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000




Disk /dev/mapper/VolGroup-lv_home: 48.9 GB, 48930750464 bytes
255 heads, 63 sectors/track, 5948 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


sdb2 is the faulty disk.


I can activate the vg with --partial flag

[root@static /]# vgchange -a y vg_server --partial
PARTIAL MODE. Incomplete logical volumes will be processed.
Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
3 logical volume(s) in volume group "vg_server" now active


[root@static /]# vgscan
Reading all physical volumes. This may take a while...
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
Found volume group "vg_server" using metadata type lvm2
Found volume group "VolGroup" using metadata type lvm2


[root@static /]# lvscan
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
ACTIVE '/dev/vg_server/lv_root' [50.00 GiB] inherit
ACTIVE '/dev/vg_server/lv_home' [18.54 GiB] inherit
ACTIVE '/dev/vg_server/lv_swap' [2.97 GiB] inherit
ACTIVE '/dev/VolGroup/lv_root' [50.00 GiB] inherit
ACTIVE '/dev/VolGroup/lv_home' [45.57 GiB] inherit
ACTIVE '/dev/VolGroup/lv_swap' [3.94 GiB] inherit


[root@static /]# lvdisplay
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
/dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
--- Logical volume ---
LV Path /dev/vg_server/lv_root
LV Name lv_root
VG Name vg_server
LV UUID 6ilnt9-NDlW-l0DT-hX9p-FtDy-C7sy-4AwTmn
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2014-03-31 07:43:25 -0400
LV Status available
# open 0
LV Size 50.00 GiB
Current LE 12800
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:4


--- Logical volume ---
LV Path /dev/vg_server/lv_home
LV Name lv_home
VG Name vg_server
LV UUID 3Ojh5O-NB5W-apjm-qZw9-qmjF-pXjj-KR1wUJ
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2014-03-31 07:43:35 -0400
LV Status available
# open 0
LV Size 18.54 GiB
Current LE 4746
Segments 2
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:6


--- Logical volume ---
LV Path /dev/vg_server/lv_swap
LV Name lv_swap
VG Name vg_server
LV UUID GHWlsX-oZQb-2jYg-h1PF-m1vL-U3QO-gFU7vR
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2014-03-31 07:43:39 -0400
LV Status available
# open 0
LV Size 2.97 GiB
Current LE 760
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:8


--- Logical volume ---
LV Path /dev/VolGroup/lv_root
LV Name lv_root
VG Name VolGroup
LV UUID G7t2rA-0Kaf-NSVK-yQ96-z5FW-lDeQ-NShoUn
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2015-02-08 04:05:28 -0500
LV Status available
# open 1
LV Size 50.00 GiB
Current LE 12800
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0


--- Logical volume ---
LV Path /dev/VolGroup/lv_home
LV Name lv_home
VG Name VolGroup
LV UUID HjJglm-FrLe-QH5e-PjRa-2OG9-cft2-MSNkgi
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2015-02-08 04:05:38 -0500
LV Status available
# open 1
LV Size 45.57 GiB
Current LE 11666
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2


--- Logical volume ---
LV Path /dev/VolGroup/lv_swap
LV Name lv_swap
VG Name VolGroup
LV UUID UQcKoG-cKtd-zvle-3riA-92Aa-0BhC-8iakqK
LV Write Access read/write
LV Creation host, time localhost.localdomain, 2015-02-08 04:05:48 -0500
LV Status available
# open 1
LV Size 3.94 GiB
Current LE 1008
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:1

But when I try to mount it, I get this error
[root@static /]# mount //dev/vg_server/lv_home /home/a
mount: you must specify the filesystem type

I even tried to specify the partition type
[root@static /]# mount -t ext4 /dev/vg_server/lv_home /home/a
mount: wrong fs type, bad option, bad superblock on /dev/mapper/vg_server-lv_home,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so


[root@static /]# fsck /dev/vg_server/lv_home
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
fsck.ext2: Attempt to read block from filesystem resulted in short read while trying to open /dev/mapper/vg_server-lv_home
 
Hi toshko4

Thanks for your reply. My issue is a bit different from yours. The main issue I'm facing is that the original disk image I have, is corrupted and not booting up. Whenever I try to boot the VM with the original image, it just shows "Boot Failed, Not a bootable device" .
 
Hmm, sorry but I'm not familiar with LVM, so could not help you. In every case, you should first copy the image file and experiment on the copy, NOT on the original. So if anything goes wrong, or a repair software breaks it, you still have a "healthy" image.
 
after running qemu-check on the image I found some errors

160 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
409496/1245184 = 32.89% allocated, 0.18% fragmented, 0.00% compressed clusters
Image end offset: 66800386048

Tried repairing it with -r leaks first

then -r all

The following inconsistencies were found and repaired:


0 leaked clusters
86 corruptions


Double checking the fixed image now...
No errors were found on the image.
409496/1245184 = 32.89% allocated, 0.18% fragmented, 0.00% compressed clusters
Image end offset: 66800386048




now what?
 
Trying to re-create the image with qemu-img convert

(Original image was 105 but I have made a copy and trying to run the same on a new vm ID108)

converted the disk to raw then back to qcow2

root@instant:/var/lib/vz/images/108# qemu-img info vm-108-disk-2.raw
image: vm-108-disk-2.raw
file format: raw
virtual size: 76G (81604378624 bytes)
disk size: 3.7G



root@instant:/var/lib/vz/images/108# qemu-img info vm-108-disk-4.qcow2
image: vm-108-disk-4.qcow2
file format: qcow2
virtual size: 76G (81604378624 bytes)
disk size: 3.7G
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false








now going to boot up, lets see if it works
 
Hi and thank for your reply Udo,

Yes I have tried converting it to raw but it didn't helped. Though I haven't tried extracting the lvm out. I'm now gonna give it a try.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!