Corrupt Filesystem after snapshot

Discussion in 'Proxmox VE: Installation and configuration' started by cryptolukas, Jan 23, 2017.

  1. cryptolukas

    cryptolukas New Member

    Joined:
    Dec 8, 2016
    Messages:
    10
    Likes Received:
    1
    I make a snapshot. After then. The server was not usable.

    I try to restart the system. Ich receive the following error:

    Code:
    kvm: -drive file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
    TASK ERROR: start failed: command '/usr/bin/kvm -id 200 -chardev 'socket,id=qmp,path=/var/run/qemu-server/200.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/200.pid -daemonize -smbios 'type=1,uuid=0b764250-f58c-48c5-b6ce-cda2ad04da12' -name websrv02 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga cirrus -vnc unix:/var/run/qemu-server/200.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -k de -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:95df8a998b30' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap200i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=3A:92:E5:8B:5B:79,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: exit code 1
     
  2. w3ph

    w3ph New Member

    Joined:
    Aug 20, 2011
    Messages:
    29
    Likes Received:
    0
    I've been consistently running into a similar situation with Proxmox 4.4. Snapshots of VMs that use local-lvm storage always work, but snapshots of VMs that use .qcow2 images on NFS storage often wind up damaging the image, requiring repair with qemu-img. This didn't happen with Proxmox 3.x. It's happening with 3 different file servers (two FreeNAS/TrueNAS and one Synology). Disk image corruption isn't happening except when we try to make snapshots. I'm running tests to see if I can figure out whether this only affects big images (500gb) or also small (32gb) ones.

    Our workaround for now is to move the VM's storage to local lvm-thin when we need to make a snapshot, then move it back to NFS if we need the lvm-thin space when we're finished with whatever made is need the snapshot.
     
  3. w3ph

    w3ph New Member

    Joined:
    Aug 20, 2011
    Messages:
    29
    Likes Received:
    0
    After more tests, the corruption that happens to .qcow2 images when making a snapshot involves CentOS 6 and 7 VMs that were set up using virtio disks, when storage is NFS. I can reproduce this 100% of the time - it isn't a subtle bug.

    Snapshots of CentOS VMs that were set up using scsi as the disk type are not getting corrupted.

    pve-manager/4.4-13/7ea56165 (running kernel: 4.4.49-1-pve)

    For now, my workaround is to migrate virtio disk type VMs that need snapshots to lvm-thin, where the snapshots always work, and only attempt snapshots on .qcow2 VMs that use scsi disk type.

    I've only tested with CentOS 6 and 7 so far so I don't know whether this affects Ubuntu or Debian yet.
     
  4. strausmann

    strausmann New Member

    Joined:
    Aug 25, 2010
    Messages:
    4
    Likes Received:
    0
    Good evening,

    I have the same problem. The conditions with me are as follows:

    PVE Manager version: pve-manager / 4.4-13 / 7ea56165
    KVM OS: CloudLinux 7.3
    KVM Settings: Local Storage / qcow2 Format / Cache: writetrough / Size: 100 GB / VirtIO

    The image is no longer accessible. If I try to "qemu-img check vm-110-disk-1.qcow2" gives me the error:

    qemu-img: Check failed: Can not allocate memory

    Why is this error when I check the image? How can I save the image? Unfortunately, the R1Soft Backup has left me in the stitch. Just as it is when you need it.

    I need urgent help...

    Thank you

    greeting

    Bjorn
     
  5. wbumiller

    wbumiller Proxmox Staff Member
    Staff Member

    Joined:
    Jun 23, 2015
    Messages:
    643
    Likes Received:
    82
    Can you also provide the output of `qemu-img info /path/to/qcow2` and `qemu-img snapshot -l /path/to/qcow2` please?

    Edit:
    Also: did the VM crash during the creation of the snapshot? Could you include syslog message from around that time?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. strausmann

    strausmann New Member

    Joined:
    Aug 25, 2010
    Messages:
    4
    Likes Received:
    0
    Hello Wolfgang,

    her the output:

    qemu-img info vm-110-disk-1.qcow2

    image: vm-110-disk-1.qcow2
    file format: qcow2
    virtual size: 100G (107374182400 bytes)
    disk size: 68G
    cluster_size: 65536
    Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: true

    qemu-img snapshot -l vm-110-disk-1.qcow2

    non output
     
  7. strausmann

    strausmann New Member

    Joined:
    Aug 25, 2010
    Messages:
    4
    Likes Received:
    0
    Jun 3 22:21:46 pmn01 kernel: [4224931.799151] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224931.803318] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224931.807482] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224931.816703] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224931.908318] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224931.917225] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.081222] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.092298] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.097200] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.552718] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.561933] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.574442] audit: type=1400 audit(1496521306.854:5503899): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:21:46 pmn01 kernel: [4224932.594628] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:46 pmn01 kernel: [4224932.602857] audit: type=1400 audit(1496521306.882:5503903): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:21:47 pmn01 kernel: [4224933.216564] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:47 pmn01 kernel: [4224933.229052] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:47 pmn01 kernel: [4224933.237380] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:47 pmn01 kernel: [4224933.249249] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:47 pmn01 kernel: [4224933.270956] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:47 pmn01 kernel: [4224933.459627] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> starting task UPID:pmn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam:
    Jun 3 22:21:48 pmn01 pvedaemon[28438]: <root@pam> delete snapshot VM 110: PleskUpdate
    Jun 3 22:21:48 pmn01 pvedaemon[28438]: VM is locked (snapshot)
    Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> end task UPID:pmn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam: VM is locked (snapshot)
    Jun 3 22:21:52 pmn01 kernel: [4224937.872056] audit: type=1400 audit(1496521312.150:5503904): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:21:59 pmn01 kernel: [4224945.101388] audit: type=1400 audit(1496521319.381:5503909): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:21:59 pmn01 kernel: [4224945.630237] audit: type=1400 audit(1496521319.913:5503910): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:21:59 pmn01 kernel: [4224945.652358] audit: type=1400 audit(1496521319.933:5503913): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3440 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:01 pmn01 CRON[28560]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
    Jun 3 22:22:04 pmn01 kernel: [4224950.664829] audit: type=1400 audit(1496521324.945:5503927): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:05 pmn01 kernel: [4224951.503214] audit: type=1400 audit(1496521325.785:5503930): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3439 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:08 pmn01 kernel: [4224953.746215] audit: type=1400 audit(1496521328.025:5503933): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3430 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:08 pmn01 kernel: [4224954.009628] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:16 pmn01 kernel: [4224962.273051] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:17 pmn01 kernel: [4224963.465466] audit: type=1400 audit(1496521337.745:5503942): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:20 pmn01 kernel: [4224965.749942] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:22 pmn01 kernel: [4224968.311855] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:22 pmn01 kernel: [4224968.609310] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:23 pmn01 kernel: [4224968.925544] audit: type=1400 audit(1496521343.205:5503946): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3432 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:28 pmn01 kernel: [4224974.576551] audit: type=1400 audit(1496521348.857:5503951): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3428 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:29 pmn01 kernel: [4224975.012274] audit: type=1400 audit(1496521349.293:5503952): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:38 pmn01 kernel: [4224983.999280] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:40 pmn01 kernel: [4224986.429879] audit: type=1400 audit(1496521360.708:5503953): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:40 pmn01 kernel: [4224986.664979] audit: type=1400 audit(1496521360.944:5503954): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:43 pmn01 kernel: [4224989.084109] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:22:44 pmn01 kernel: [4224989.726488] audit: type=1400 audit(1496521364.004:5503955): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:22:58 pmn01 systemd-timesyncd[31442]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.014s/0.006s/-25ppm
    Jun 3 22:22:59 pmn01 kernel: [4225005.127975] audit: type=1400 audit(1496521379.404:5503961): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
    Jun 3 22:23:01 pmn01 CRON[29741]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
    Jun 3 22:23:05 pmn01 kernel: [4225010.752322] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:23:05 pmn01 kernel: [4225010.753730] hcp: ERROR: FALLOCATE FAILED!
    Jun 3 22:23:05 pmn01 kernel: [4225011.232657] audit: type=1400 audit(1496521385.508:5503965): app
     
  8. w3ph

    w3ph New Member

    Joined:
    Aug 20, 2011
    Messages:
    29
    Likes Received:
    0
    I had the same problem, where VMs on NFS storage, qcow2 image, virtio disk would be corrupted when I tried to take a snapshot. In some cases the images were repairable with qemu-img but in other cases I had to restore from backup because the image was so damaged. Nasty bug. This affected only qcow2 images on NFS. Local-lvm wasn't affected.

    The fix was to shut down VM, delete the virtio disk from hardware (it doesn't go away, just gets listed as unused), then double-click on the unused image and add it back as SCSI, then go into Options and set the boot order to use the SCSI volume (it will still say virtio so the volume won't be found and boot will fail unless you do this).

    This won't fix your corrupted image, but it so far has prevented it from happening again.
     
  9. coudert

    coudert New Member
    Proxmox Subscriber

    Joined:
    Jun 22, 2017
    Messages:
    2
    Likes Received:
    0
    Hi guy,

    we have similar issue, when we create a snapshot, disk is corrupted.

    If we start vm, we have this message :

    Task viewer: VM 64115237 - Start
    kvm: -drive file=/mnt/pve/vmdisk-nfs-emcspb1-02/images/237/vm-237-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write


    Conditions :
    Vms on NFS storage / Virtio Driver / format qcow2 / disk size 80 Go / snapshot with RAM.

    Qemu check to test status of qcow2.

    qemu-img check vm-237-disk-1.qcow2

    ERROR cluster 16048 refcount=2 reference=3
    ERROR cluster 16049 refcount=2 reference=3
    ERROR cluster 16050 refcount=2 reference=3
    ERROR cluster 16051 refcount=2 reference=3
    ...
    ERROR OFLAG_COPIED data cluster: l2_entry=c818d0000 refcount=1
    ERROR OFLAG_COPIED data cluster: l2_entry=fb4a50000 refcount=1

    292 errors were found on the image.
    Data may be corrupted, or further writes to the image may corrupt it.

    27 leaked clusters were found on the image.
    This means waste of disk space, but no harm to data.
    819200/819200 = 100.00% allocated, 9.56% fragmented, 0.00% compressed clusters
    Image end offset: 87233527808


    Sometime, we can save the vm with these comands : “qemu-img check -r vm-237-disk-1.qcow2” and “qm unlock 237”

    After a repair and unlock, we have this status :

    qemu-img info vm-237-disk-1.qcow2
    image: vm-237-disk-1.qcow2
    file format: qcow2
    virtual size: 50G (53687091200 bytes)
    disk size: 80G
    cluster_size: 65536
    Snapshot list:
    ID TAG VM SIZE DATE VM CLOCK
    1 centreon2_1 0 2017-06-19 12:59:41 72:49:04.217
    2 centreon2_2 0 2017-06-21 19:10:14 00:05:59.367
    Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false


    Info on proxmox version :
    pve-manager/4.4-13/7ea56165 (running kernel: 4.4.62-1-pve)

    Regards,
    Stéphane C.

    Note : we had never see this issue with Proxmox 3.3.5
     
  10. remark

    remark Member

    Joined:
    May 4, 2011
    Messages:
    91
    Likes Received:
    6
    Running out of space on host?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. coudert

    coudert New Member
    Proxmox Subscriber

    Joined:
    Jun 22, 2017
    Messages:
    2
    Likes Received:
    0
    Hi,

    No problem on local and nfs disk space. There is some To of free.

    Thanks,
    Stéphane C.
     
  12. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    I've just have the same problem, I need to recover this server, or at least the files on it, I dont have any recent backup, this is the second time it's happend to me, ple ase help me
     
  13. David Wilson

    David Wilson New Member
    Proxmox Subscriber

    Joined:
    Dec 26, 2017
    Messages:
    8
    Likes Received:
    0
    Good day guys,

    Season's greetings to you.

    I can confirm that we experienced the same scary problem after running a snapshot on a VM with qcow2 disk images stored on NFS, presented to the VM as "Virtio SCSI". Others experiencing a similar problem seem to report the problem only occurring when using "Virtio Block" whereas we experienced the problem using "VirtIO SCSI".
     
  14. taenzerme

    taenzerme Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2013
    Messages:
    35
    Likes Received:
    0
    Hi all, Hi David,

    have you found the reason for the problem? We experienced the same problem with qcow2 on NFS (Synology storages) - but not always. I cloned some VMs and took snapshots without any problems. Yet other Vms crashed and corrupted the filesystem immediately.

    Best
    Sebastian
     
  15. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    Never use virtio over NFS and qcow2, is very bad idea, go out of there, if you make an online buckup you can lost you VM, It happened to me, three times untill I found this problem.
     
  16. taenzerme

    taenzerme Member
    Proxmox Subscriber

    Joined:
    Sep 18, 2013
    Messages:
    35
    Likes Received:
    0
    @afrugone ... and I just took 3 snapshots of a Debian VM on NFS w/ qcow2 without any problems. I can't reproduce it in general, that's why I'm asking.
     
  17. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    I've both cases, but for me is very dangerous situation, better use SCSI, not virtio as disk
     
  18. David Wilson

    David Wilson New Member
    Proxmox Subscriber

    Joined:
    Dec 26, 2017
    Messages:
    8
    Likes Received:
    0
    Thank you for your reply Sebastian.

    What you've mentioned regarding your experience with NFS and Synology is interesting. We are using NFS on QNAP.
    Sadly I haven't had a chance to investigate further yet but am very eager to find a fix.
     
  19. David Wilson

    David Wilson New Member
    Proxmox Subscriber

    Joined:
    Dec 26, 2017
    Messages:
    8
    Likes Received:
    0
    Thank you.
    It seems that people in this forum have reported the problem when using "VirtIO. I experienced the problem when using 'VirtIO SCSI".
    I don't use a Synology NAS - I use a Qnap NAS, which we are looking to replace.
     
  20. Antony Street

    Antony Street New Member
    Proxmox Subscriber

    Joined:
    Dec 14, 2017
    Messages:
    8
    Likes Received:
    0
    I also experienced this bug today and have changed all of my Linux VMs to scsi. Our NFS is on a brand new CentOS server and so I'm not sure it's related to QNAP or Synology. Has anyone had this happen to a Windows VM?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice