[Critical] VM Dead after creating snapshot. :'(

Discussion in 'Proxmox VE: Installation and configuration' started by dhairyasharma, Feb 6, 2015.

  1. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    Hi there everyone,

    Now here is what happened. This node was running since 2013 perfectly. I have created 8 VMs back then and they were all up, they are still.


    When I consoled into the VM using VNC, the vm was stuck at boot with error "boot failed, This hard drive is not bootable"

    here is the task log via the web UI.:

    Creating snapshot:
    Formatting '/var/lib/vz/images/105/vm-105-state-feb.raw', fmt=raw size=7017070592
    TASK OK

    Removing Snapshot:
    TASK OK


    Vm config:
    root@instant:~# cat /etc/pve/qemu-server/105.conf
    balloon: 2048
    bootdisk: ide0
    cores: 2
    cpu: host
    cpuunits: 100
    ide0: local:105/vm-105-disk-1.qcow2,format=qcow2,cache=writeback,size=76G
    ide2: none,media=cdrom
    memory: 3096
    name: server
    net0: e1000=E6:FB:7E:BD:43:06,bridge=vmbr0
    onboot: 1
    ostype: l26
    sockets: 1


    Output of pveversion -v:
    root@instant:~# pveversion -v
    proxmox-ve-2.6.32: 3.2-136 (running kernel: 3.10.0-1-pve)
    pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
    pve-kernel-2.6.32-32-pve: 2.6.32-136
    pve-kernel-2.6.32-27-pve: 2.6.32-121
    pve-kernel-3.10.0-1-pve: 3.10.0-5
    pve-kernel-2.6.32-28-pve: 2.6.32-124
    pve-kernel-2.6.32-31-pve: 2.6.32-132
    pve-kernel-2.6.32-26-pve: 2.6.32-114
    lvm2: 2.02.98-pve4
    clvm: 2.02.98-pve4
    corosync-pve: 1.4.7-1
    openais-pve: 1.1.4-3
    libqb0: 0.11.1-2
    redhat-cluster-pve: 3.2.0-2
    resource-agents-pve: 3.9.2-4
    fence-agents-pve: 4.0.10-1
    pve-cluster: 3.0-15
    qemu-server: 3.3-3
    pve-firmware: 1.1-3
    libpve-common-perl: 3.0-19
    libpve-access-control: 3.0-15
    libpve-storage-perl: 3.0-25
    pve-libspice-server1: 0.12.4-3
    vncterm: 1.1-8
    vzctl: 4.0-1pve6
    vzprocps: 2.0.11-2
    vzquota: 3.1-2
    pve-qemu-kvm: 2.1-10
    ksm-control-daemon: 1.1-1
    glusterfs-client: 3.5.2-1

    kernel :
    root@instant:~# uname -a
    Linux instant. 3.10.0-1-pve #1 SMP Tue Dec 17 13:12:13 CET 2013 x86_64 GNU/Linux


    file:
    root@instant:/var/lib/vz/images/105# file vm-105-disk-1.qcow2
    vm-105-disk-1.qcow2: QEMU QCOW Image (unknown version)

    Fdisk:
    root@instant:/var/lib/vz/images/105# fdisk -l vm-105-disk-1.qcow2
    Disk vm-105-disk-1.qcow2: 66.8 GB, 66801762304 bytes
    255 heads, 63 sectors/track, 8121 cylinders, total 130472192 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000
    Disk vm-105-disk-1.qcow2 doesn't contain a valid partition table


    ls:
    root@instant:/var/lib/vz/images/105# ls -l vm-105-disk-1.qcow2
    -rw-r--r-- 1 root root 66801762304 Feb 6 06:31 vm-105-disk-1.qcow2


    When I did qemu-check I got thousands of errors (Leak errors etc.)
    qemu-img check vm-105-disk-1.qcow2

    So I used the -r all flag to fix the disk, that fixed all of the errors.
    qemu-img check -r all vm-105-disk-1.qcow2


    And ran the qemu-check again

    root@instant:/var/lib/vz/images/105# qemu-img check vm-105-disk-1.qcow2
    No errors were found on the image.
    303104/1245184 = 24.34% allocated, 0.00% fragmented, 0.00% compressed clusters
    Image end offset: 66800386048

    Still no luck.

    After that I've decided to use the qemu-nbd command to mount the disk and run fsck
    Here is what happened next.

    root@instant:/var/lib/vz/images/105# modprobe nbd
    FATAL: Module nbd not found.
    root@instant:~# lsmod | grep ndb
    root@instant:~#

    Though I'm able to view the LVM partitions via tetsdisk.

    IDK if im on the right track or not and don't know what to do next either. Can anybody please help me.
     
    #1 dhairyasharma, Feb 6, 2015
    Last edited: Feb 8, 2015
  2. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
  3. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    Ok I switched the kernel, now I can use qemu-nbd to mount the image.

    Trying my luck again. Lets see if it works.
     
  4. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    thanks to testdiks, I have added the partitions back. What I did next is attached this disk image to another vm running on centos and luckily the system is able to find the LVms. But I can't mount them. So can anyone help me with this?
     
  5. toshko4

    toshko4 New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    Likes Received:
    0
    Hi dhairyasharma,
    I can help you with my own HOWTO, I created back then, when I had a similar problem, but I'm not sure it's exactly like yours. This is it, I paste it here:

    When a snapshot in a qcow2 image is broken and cannot be deleted, the only way is to copy data from that image to another.
    Here is how.
    Here we assume this is ubuntu and the root partition's image is broken (the most complicated case), but another cases are similar
    and the system is starting OK, the data is OK, just there is a problem with the snapshots.
    We won't use live CD, although there is a way with a live CD too, simpler maybe.
    The approach is this: Copy the broken image, create a healthy image with the same size, start the machine
    and from within copy the contents of the coppied image to the new healthy image with dd. This is needed
    because you cannot copy the root filesystem in real time. If the problematic image is not the root's one, you
    can do the same without copying the broken image first OR you can use a live CD, your choice.

    1. Shutdown the domain.
    virsh shutdown ubuntu-server (or log and shutdown from inside)
    virsh list (to be sure it is down)

    2. Copy the image to another one
    cd /home/images
    cp ubuntu-server.vda ubuntu-server1.vda

    3. Create another image, to which we are going to copy the content
    qemu-img create -f qcow2 -o preallocation=metadata ubuntu-server2.img 10737418240
    (the last is size, take it from qemu-img info ubuntu-server-vda.img in Bytes)
    ls -lash ubuntu-server2.img (to confirm it is not allocated - the left size from the output is the file size,
    right-image virtual size+metadate blocks and so on-it and normally is larger than the size you want)
    ls -l ubuntu-server2.img (to get the image virtual size in B, DO NOT get it from qemu-img info,
    because it is the internal virtual size of the image, not the file itself)
    fallocate -l theSizeInBytesFromAbove ubuntu-server2.img (to allocate the image for faster writing)
    ls -lash ubuntu-server2.img (to confirm it is allocated - the left size should be equal to the right now)
    qemu-img info ubuntu-server2.img (to double confirm with the qemu-img command)

    4. Open Virtmanager from your graphical interface and connect to the VM host

    5. Stop and Start the storage pool to autodetect the new images

    6. Now add these images as another VirtIO storage devices in the domain.
    Here we have to give the size of the image in MB (ubuntu 10.04), you can calculate it from the qemu-img info command.

    7. Start the domain.

    8. FIRST get info which /dev/vd... are user by the system now, because they may have changed after addition of images.

    9. Get info about the /dev/vd.. names of the images you just added
    parted -l

    10. The new empty image (ubuntu-server2.img) should be something like: Error: /dev/vdd: unrecognised disk label,
    and the image you coppied (ubuntu-server1.img) should be seen too.

    and create filesystems there. Then mount the partitions from ubuntu-server1.img and ubuntu-server2.img. Then copy the
    content with "cp -af...". Then if this is the root partition you must take a look at the "grubFixHOWTO.txt" file to
    11. Now copy the contents of ubuntu-server1.img to ubuntu-server2.img
    dd if=/dev/vd.(the name of ubuntu-server1.img) of=/dev/vd.(the name of ubuntu-server2.img) (without /dev/vda1, just /dev/vda)
    Note: If you don't want to use dd, but copy the data plain, you must create identical partitions on ubuntu-server2.img
    restore GRUB and replace UUIDs accordingly.

    12. Confirm with parted -l again that the [vdd] device is now identical to the source

    13. Poweroff the domain
    poweroff

    14. Now rename the old images to something
    mv ubuntu-server.img old.ubuntu-server.img
    mv ubuntu-server1.img old.ubuntu-server1.img

    15. Rename the new image to the name of the old original broken image, so it can be used by the domain, like the original one
    mv ubuntu-server2.img ubuntu-server.img

    16. From Virtmanager - remove the temporary images you added before - ubuntu-server1.img and ubuntu-server2.img, otherwise
    the domain will not start and give error like:
    error: Failed to start domain ubuntu-server-pdc
    error: monitor socket did not show up.: Connection refused

    17. Start the domain and see if it boots successfully and everything is OK
    virsh start ubuntu-server


    That's it!
    Now you have an image without snapshots, but it is clean and healthy.
     
  6. toshko4

    toshko4 New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    Likes Received:
    0
    I see you've managed to repair something, it would be helpful to share what you did (for others with the same problem) and maybe describe your problem more accurately.
    So, what are the errors with mounting?
     
  7. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    many thanks for replying toshko4 :)


    [root@static /]# fdisk -l


    Disk /dev/sda: 107.4 GB, 107374182400 bytes
    255 heads, 63 sectors/track, 13054 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00074945


    Device Boot Start End Blocks Id System
    /dev/sda1 * 1 64 512000 83 Linux
    Partition 1 does not end on cylinder boundary.
    /dev/sda2 64 13055 104344576 8e Linux LVM


    Disk /dev/sdb: 81.6 GB, 81604378624 bytes
    1 heads, 1 sectors/track, 159383552 cylinders, total 159383552 sectors
    Units = cylinders of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000


    Device Boot Start End Blocks Id System
    /dev/sdb1 * 1028095 105885694 52428800 83 Linux
    /dev/sdb2 125829121 157276350 15723615 8e Linux LVM


    Disk /dev/mapper/VolGroup-lv_root: 53.7 GB, 53687091200 bytes
    255 heads, 63 sectors/track, 6527 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000




    Disk /dev/mapper/VolGroup-lv_swap: 4227 MB, 4227858432 bytes
    255 heads, 63 sectors/track, 514 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000




    Disk /dev/mapper/VolGroup-lv_home: 48.9 GB, 48930750464 bytes
    255 heads, 63 sectors/track, 5948 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x00000000


    sdb2 is the faulty disk.


    I can activate the vg with --partial flag

    [root@static /]# vgchange -a y vg_server --partial
    PARTIAL MODE. Incomplete logical volumes will be processed.
    Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
    3 logical volume(s) in volume group "vg_server" now active


    [root@static /]# vgscan
    Reading all physical volumes. This may take a while...
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
    Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
    Found volume group "vg_server" using metadata type lvm2
    Found volume group "VolGroup" using metadata type lvm2


    [root@static /]# lvscan
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
    Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
    ACTIVE '/dev/vg_server/lv_root' [50.00 GiB] inherit
    ACTIVE '/dev/vg_server/lv_home' [18.54 GiB] inherit
    ACTIVE '/dev/vg_server/lv_swap' [2.97 GiB] inherit
    ACTIVE '/dev/VolGroup/lv_root' [50.00 GiB] inherit
    ACTIVE '/dev/VolGroup/lv_home' [45.57 GiB] inherit
    ACTIVE '/dev/VolGroup/lv_swap' [3.94 GiB] inherit


    [root@static /]# lvdisplay
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687025664: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 53687083008: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_root: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_home: read failed after 0 of 4096 at 4096: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187605504: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 3187662848: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 0: Input/output error
    /dev/vg_server/lv_swap: read failed after 0 of 4096 at 4096: Input/output error
    Couldn't find device with uuid RRj0hg-XbT0-uc0t-7mlG-7Jp3-M6NJ-HkHe3b.
    --- Logical volume ---
    LV Path /dev/vg_server/lv_root
    LV Name lv_root
    VG Name vg_server
    LV UUID 6ilnt9-NDlW-l0DT-hX9p-FtDy-C7sy-4AwTmn
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2014-03-31 07:43:25 -0400
    LV Status available
    # open 0
    LV Size 50.00 GiB
    Current LE 12800
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:4


    --- Logical volume ---
    LV Path /dev/vg_server/lv_home
    LV Name lv_home
    VG Name vg_server
    LV UUID 3Ojh5O-NB5W-apjm-qZw9-qmjF-pXjj-KR1wUJ
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2014-03-31 07:43:35 -0400
    LV Status available
    # open 0
    LV Size 18.54 GiB
    Current LE 4746
    Segments 2
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:6


    --- Logical volume ---
    LV Path /dev/vg_server/lv_swap
    LV Name lv_swap
    VG Name vg_server
    LV UUID GHWlsX-oZQb-2jYg-h1PF-m1vL-U3QO-gFU7vR
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2014-03-31 07:43:39 -0400
    LV Status available
    # open 0
    LV Size 2.97 GiB
    Current LE 760
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:8


    --- Logical volume ---
    LV Path /dev/VolGroup/lv_root
    LV Name lv_root
    VG Name VolGroup
    LV UUID G7t2rA-0Kaf-NSVK-yQ96-z5FW-lDeQ-NShoUn
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2015-02-08 04:05:28 -0500
    LV Status available
    # open 1
    LV Size 50.00 GiB
    Current LE 12800
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:0


    --- Logical volume ---
    LV Path /dev/VolGroup/lv_home
    LV Name lv_home
    VG Name VolGroup
    LV UUID HjJglm-FrLe-QH5e-PjRa-2OG9-cft2-MSNkgi
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2015-02-08 04:05:38 -0500
    LV Status available
    # open 1
    LV Size 45.57 GiB
    Current LE 11666
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:2


    --- Logical volume ---
    LV Path /dev/VolGroup/lv_swap
    LV Name lv_swap
    VG Name VolGroup
    LV UUID UQcKoG-cKtd-zvle-3riA-92Aa-0BhC-8iakqK
    LV Write Access read/write
    LV Creation host, time localhost.localdomain, 2015-02-08 04:05:48 -0500
    LV Status available
    # open 1
    LV Size 3.94 GiB
    Current LE 1008
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:1

    But when I try to mount it, I get this error
    [root@static /]# mount //dev/vg_server/lv_home /home/a
    mount: you must specify the filesystem type

    I even tried to specify the partition type
    [root@static /]# mount -t ext4 /dev/vg_server/lv_home /home/a
    mount: wrong fs type, bad option, bad superblock on /dev/mapper/vg_server-lv_home,
    missing codepage or helper program, or other error
    In some cases useful info is found in syslog - try
    dmesg | tail or so


    [root@static /]# fsck /dev/vg_server/lv_home
    fsck from util-linux-ng 2.17.2
    e2fsck 1.41.12 (17-May-2010)
    fsck.ext2: Attempt to read block from filesystem resulted in short read while trying to open /dev/mapper/vg_server-lv_home
     
  8. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    Hi toshko4

    Thanks for your reply. My issue is a bit different from yours. The main issue I'm facing is that the original disk image I have, is corrupted and not booting up. Whenever I try to boot the VM with the original image, it just shows "Boot Failed, Not a bootable device" .
     
  9. toshko4

    toshko4 New Member

    Joined:
    Feb 8, 2015
    Messages:
    3
    Likes Received:
    0
    Hmm, sorry but I'm not familiar with LVM, so could not help you. In every case, you should first copy the image file and experiment on the copy, NOT on the original. So if anything goes wrong, or a repair software breaks it, you still have a "healthy" image.
     
  10. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    I am not experimenting on the original file. I have made a copy and using the same to fix this issue.
     
  11. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    after running qemu-check on the image I found some errors

    160 errors were found on the image.
    Data may be corrupted, or further writes to the image may corrupt it.
    409496/1245184 = 32.89% allocated, 0.18% fragmented, 0.00% compressed clusters
    Image end offset: 66800386048

    Tried repairing it with -r leaks first

    then -r all

    The following inconsistencies were found and repaired:


    0 leaked clusters
    86 corruptions


    Double checking the fixed image now...
    No errors were found on the image.
    409496/1245184 = 32.89% allocated, 0.18% fragmented, 0.00% compressed clusters
    Image end offset: 66800386048




    now what?
     
  12. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    Trying to re-create the image with qemu-img convert

    (Original image was 105 but I have made a copy and trying to run the same on a new vm ID108)

    converted the disk to raw then back to qcow2

    root@instant:/var/lib/vz/images/108# qemu-img info vm-108-disk-2.raw
    image: vm-108-disk-2.raw
    file format: raw
    virtual size: 76G (81604378624 bytes)
    disk size: 3.7G



    root@instant:/var/lib/vz/images/108# qemu-img info vm-108-disk-4.qcow2
    image: vm-108-disk-4.qcow2
    file format: qcow2
    virtual size: 76G (81604378624 bytes)
    disk size: 3.7G
    cluster_size: 65536
    Format specific information:
    compat: 1.1
    lazy refcounts: false








    now going to boot up, lets see if it works
     
  13. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,835
    Likes Received:
    159
  14. dhairyasharma

    dhairyasharma New Member

    Joined:
    Feb 6, 2015
    Messages:
    10
    Likes Received:
    0
    Hi and thank for your reply Udo,

    Yes I have tried converting it to raw but it didn't helped. Though I haven't tried extracting the lvm out. I'm now gonna give it a try.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice