Corrupt Filesystem after snapshot

cryptolukas

New Member
Dec 8, 2016
10
1
3
30
I make a snapshot. After then. The server was not usable.

I try to restart the system. Ich receive the following error:

Code:
kvm: -drive file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
TASK ERROR: start failed: command '/usr/bin/kvm -id 200 -chardev 'socket,id=qmp,path=/var/run/qemu-server/200.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/200.pid -daemonize -smbios 'type=1,uuid=0b764250-f58c-48c5-b6ce-cda2ad04da12' -name websrv02 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga cirrus -vnc unix:/var/run/qemu-server/200.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -k de -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:95df8a998b30' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap200i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=3A:92:E5:8B:5B:79,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: exit code 1
 

w3ph

New Member
Aug 20, 2011
29
0
1
I've been consistently running into a similar situation with Proxmox 4.4. Snapshots of VMs that use local-lvm storage always work, but snapshots of VMs that use .qcow2 images on NFS storage often wind up damaging the image, requiring repair with qemu-img. This didn't happen with Proxmox 3.x. It's happening with 3 different file servers (two FreeNAS/TrueNAS and one Synology). Disk image corruption isn't happening except when we try to make snapshots. I'm running tests to see if I can figure out whether this only affects big images (500gb) or also small (32gb) ones.

Our workaround for now is to move the VM's storage to local lvm-thin when we need to make a snapshot, then move it back to NFS if we need the lvm-thin space when we're finished with whatever made is need the snapshot.
 

w3ph

New Member
Aug 20, 2011
29
0
1
After more tests, the corruption that happens to .qcow2 images when making a snapshot involves CentOS 6 and 7 VMs that were set up using virtio disks, when storage is NFS. I can reproduce this 100% of the time - it isn't a subtle bug.

Snapshots of CentOS VMs that were set up using scsi as the disk type are not getting corrupted.

pve-manager/4.4-13/7ea56165 (running kernel: 4.4.49-1-pve)

For now, my workaround is to migrate virtio disk type VMs that need snapshots to lvm-thin, where the snapshots always work, and only attempt snapshots on .qcow2 VMs that use scsi disk type.

I've only tested with CentOS 6 and 7 so far so I don't know whether this affects Ubuntu or Debian yet.
 

strausmann

New Member
Aug 25, 2010
4
0
1
Good evening,

I have the same problem. The conditions with me are as follows:

PVE Manager version: pve-manager / 4.4-13 / 7ea56165
KVM OS: CloudLinux 7.3
KVM Settings: Local Storage / qcow2 Format / Cache: writetrough / Size: 100 GB / VirtIO

The image is no longer accessible. If I try to "qemu-img check vm-110-disk-1.qcow2" gives me the error:

qemu-img: Check failed: Can not allocate memory

Why is this error when I check the image? How can I save the image? Unfortunately, the R1Soft Backup has left me in the stitch. Just as it is when you need it.

I need urgent help...

Thank you

greeting

Bjorn
 

wbumiller

Proxmox Staff Member
Staff member
Jun 23, 2015
645
84
28
Can you also provide the output of `qemu-img info /path/to/qcow2` and `qemu-img snapshot -l /path/to/qcow2` please?

Edit:
Also: did the VM crash during the creation of the snapshot? Could you include syslog message from around that time?
 

strausmann

New Member
Aug 25, 2010
4
0
1
Hello Wolfgang,

her the output:

qemu-img info vm-110-disk-1.qcow2

image: vm-110-disk-1.qcow2
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 68G
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: true

qemu-img snapshot -l vm-110-disk-1.qcow2

non output
 

strausmann

New Member
Aug 25, 2010
4
0
1
Jun 3 22:21:46 pmn01 kernel: [4224931.799151] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.803318] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.807482] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.816703] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.908318] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.917225] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.081222] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.092298] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.097200] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.552718] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.561933] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.574442] audit: type=1400 audit(1496521306.854:5503899): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:46 pmn01 kernel: [4224932.594628] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.602857] audit: type=1400 audit(1496521306.882:5503903): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:47 pmn01 kernel: [4224933.216564] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.229052] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.237380] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.249249] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.270956] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.459627] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> starting task UPID:pmn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam:
Jun 3 22:21:48 pmn01 pvedaemon[28438]: <root@pam> delete snapshot VM 110: PleskUpdate
Jun 3 22:21:48 pmn01 pvedaemon[28438]: VM is locked (snapshot)
Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> end task UPID:pmn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam: VM is locked (snapshot)
Jun 3 22:21:52 pmn01 kernel: [4224937.872056] audit: type=1400 audit(1496521312.150:5503904): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.101388] audit: type=1400 audit(1496521319.381:5503909): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.630237] audit: type=1400 audit(1496521319.913:5503910): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.652358] audit: type=1400 audit(1496521319.933:5503913): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3440 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:01 pmn01 CRON[28560]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
Jun 3 22:22:04 pmn01 kernel: [4224950.664829] audit: type=1400 audit(1496521324.945:5503927): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:05 pmn01 kernel: [4224951.503214] audit: type=1400 audit(1496521325.785:5503930): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3439 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:08 pmn01 kernel: [4224953.746215] audit: type=1400 audit(1496521328.025:5503933): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3430 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:08 pmn01 kernel: [4224954.009628] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:16 pmn01 kernel: [4224962.273051] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:17 pmn01 kernel: [4224963.465466] audit: type=1400 audit(1496521337.745:5503942): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:20 pmn01 kernel: [4224965.749942] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:22 pmn01 kernel: [4224968.311855] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:22 pmn01 kernel: [4224968.609310] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:23 pmn01 kernel: [4224968.925544] audit: type=1400 audit(1496521343.205:5503946): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3432 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:28 pmn01 kernel: [4224974.576551] audit: type=1400 audit(1496521348.857:5503951): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3428 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:29 pmn01 kernel: [4224975.012274] audit: type=1400 audit(1496521349.293:5503952): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:38 pmn01 kernel: [4224983.999280] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:40 pmn01 kernel: [4224986.429879] audit: type=1400 audit(1496521360.708:5503953): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:40 pmn01 kernel: [4224986.664979] audit: type=1400 audit(1496521360.944:5503954): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:43 pmn01 kernel: [4224989.084109] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:44 pmn01 kernel: [4224989.726488] audit: type=1400 audit(1496521364.004:5503955): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:58 pmn01 systemd-timesyncd[31442]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.014s/0.006s/-25ppm
Jun 3 22:22:59 pmn01 kernel: [4225005.127975] audit: type=1400 audit(1496521379.404:5503961): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:23:01 pmn01 CRON[29741]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
Jun 3 22:23:05 pmn01 kernel: [4225010.752322] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:23:05 pmn01 kernel: [4225010.753730] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:23:05 pmn01 kernel: [4225011.232657] audit: type=1400 audit(1496521385.508:5503965): app
 

w3ph

New Member
Aug 20, 2011
29
0
1
Good evening,

I have the same problem. The conditions with me are as follows:

PVE Manager version: pve-manager / 4.4-13 / 7ea56165
KVM OS: CloudLinux 7.3
KVM Settings: Local Storage / qcow2 Format / Cache: writetrough / Size: 100 GB / VirtIO

The image is no longer accessible. If I try to "qemu-img check vm-110-disk-1.qcow2" gives me the error:

qemu-img: Check failed: Can not allocate memory

Why is this error when I check the image? How can I save the image? Unfortunately, the R1Soft Backup has left me in the stitch. Just as it is when you need it.
I had the same problem, where VMs on NFS storage, qcow2 image, virtio disk would be corrupted when I tried to take a snapshot. In some cases the images were repairable with qemu-img but in other cases I had to restore from backup because the image was so damaged. Nasty bug. This affected only qcow2 images on NFS. Local-lvm wasn't affected.

The fix was to shut down VM, delete the virtio disk from hardware (it doesn't go away, just gets listed as unused), then double-click on the unused image and add it back as SCSI, then go into Options and set the boot order to use the SCSI volume (it will still say virtio so the volume won't be found and boot will fail unless you do this).

This won't fix your corrupted image, but it so far has prevented it from happening again.
 
Jun 22, 2017
2
0
1
45
Hi guy,

we have similar issue, when we create a snapshot, disk is corrupted.

If we start vm, we have this message :

Task viewer: VM 64115237 - Start
kvm: -drive file=/mnt/pve/vmdisk-nfs-emcspb1-02/images/237/vm-237-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write


Conditions :
Vms on NFS storage / Virtio Driver / format qcow2 / disk size 80 Go / snapshot with RAM.

Qemu check to test status of qcow2.

qemu-img check vm-237-disk-1.qcow2

ERROR cluster 16048 refcount=2 reference=3
ERROR cluster 16049 refcount=2 reference=3
ERROR cluster 16050 refcount=2 reference=3
ERROR cluster 16051 refcount=2 reference=3
...
ERROR OFLAG_COPIED data cluster: l2_entry=c818d0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=fb4a50000 refcount=1

292 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

27 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
819200/819200 = 100.00% allocated, 9.56% fragmented, 0.00% compressed clusters
Image end offset: 87233527808


Sometime, we can save the vm with these comands : “qemu-img check -r vm-237-disk-1.qcow2” and “qm unlock 237”

After a repair and unlock, we have this status :

qemu-img info vm-237-disk-1.qcow2
image: vm-237-disk-1.qcow2
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 80G
cluster_size: 65536
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK
1 centreon2_1 0 2017-06-19 12:59:41 72:49:04.217
2 centreon2_2 0 2017-06-21 19:10:14 00:05:59.367
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false


Info on proxmox version :
pve-manager/4.4-13/7ea56165 (running kernel: 4.4.62-1-pve)

Regards,
Stéphane C.

Note : we had never see this issue with Proxmox 3.3.5
 

afrugone

Member
Nov 26, 2008
99
0
6
I've just have the same problem, I need to recover this server, or at least the files on it, I dont have any recent backup, this is the second time it's happend to me, ple ase help me
 
Dec 26, 2017
8
0
1
40
Good day guys,

Season's greetings to you.

I can confirm that we experienced the same scary problem after running a snapshot on a VM with qcow2 disk images stored on NFS, presented to the VM as "Virtio SCSI". Others experiencing a similar problem seem to report the problem only occurring when using "Virtio Block" whereas we experienced the problem using "VirtIO SCSI".
 
Hi all, Hi David,

have you found the reason for the problem? We experienced the same problem with qcow2 on NFS (Synology storages) - but not always. I cloned some VMs and took snapshots without any problems. Yet other Vms crashed and corrupted the filesystem immediately.

Best
Sebastian
 

afrugone

Member
Nov 26, 2008
99
0
6
Never use virtio over NFS and qcow2, is very bad idea, go out of there, if you make an online buckup you can lost you VM, It happened to me, three times untill I found this problem.
 

afrugone

Member
Nov 26, 2008
99
0
6
I've both cases, but for me is very dangerous situation, better use SCSI, not virtio as disk
 
Dec 26, 2017
8
0
1
40
Hi all, Hi David,

have you found the reason for the problem? We experienced the same problem with qcow2 on NFS (Synology storages) - but not always. I cloned some VMs and took snapshots without any problems. Yet other Vms crashed and corrupted the filesystem immediately.

Best
Sebastian
Thank you for your reply Sebastian.

What you've mentioned regarding your experience with NFS and Synology is interesting. We are using NFS on QNAP.
Sadly I haven't had a chance to investigate further yet but am very eager to find a fix.
 
Dec 26, 2017
8
0
1
40
Never use virtio over NFS and qcow2, is very bad idea, go out of there, if you make an online buckup you can lost you VM, It happened to me, three times untill I found this problem.
Thank you.
It seems that people in this forum have reported the problem when using "VirtIO. I experienced the problem when using 'VirtIO SCSI".
I don't use a Synology NAS - I use a Qnap NAS, which we are looking to replace.
 
Dec 14, 2017
8
0
1
38
I also experienced this bug today and have changed all of my Linux VMs to scsi. Our NFS is on a brand new CentOS server and so I'm not sure it's related to QNAP or Synology. Has anyone had this happen to a Windows VM?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!