Corrupt Filesystem after snapshot

cryptolukas · Jan 23, 2017

I make a snapshot. After then. The server was not usable.

I try to restart the system. Ich receive the following error:

Code:

kvm: -drive file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
TASK ERROR: start failed: command '/usr/bin/kvm -id 200 -chardev 'socket,id=qmp,path=/var/run/qemu-server/200.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/200.pid -daemonize -smbios 'type=1,uuid=0b764250-f58c-48c5-b6ce-cda2ad04da12' -name websrv02 -smp '4,sockets=2,cores=2,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga cirrus -vnc unix:/var/run/qemu-server/200.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -k de -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:95df8a998b30' -drive 'if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/var/lib/vz/images/200/vm-200-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap200i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=3A:92:E5:8B:5B:79,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300'' failed: exit code 1

w3ph · Apr 13, 2017

I've been consistently running into a similar situation with Proxmox 4.4. Snapshots of VMs that use local-lvm storage always work, but snapshots of VMs that use .qcow2 images on NFS storage often wind up damaging the image, requiring repair with qemu-img. This didn't happen with Proxmox 3.x. It's happening with 3 different file servers (two FreeNAS/TrueNAS and one Synology). Disk image corruption isn't happening except when we try to make snapshots. I'm running tests to see if I can figure out whether this only affects big images (500gb) or also small (32gb) ones.

Our workaround for now is to move the VM's storage to local lvm-thin when we need to make a snapshot, then move it back to NFS if we need the lvm-thin space when we're finished with whatever made is need the snapshot.

w3ph · Apr 15, 2017

After more tests, the corruption that happens to .qcow2 images when making a snapshot involves CentOS 6 and 7 VMs that were set up using virtio disks, when storage is NFS. I can reproduce this 100% of the time - it isn't a subtle bug.

Snapshots of CentOS VMs that were set up using scsi as the disk type are not getting corrupted.

pve-manager/4.4-13/7ea56165 (running kernel: 4.4.49-1-pve)

For now, my workaround is to migrate virtio disk type VMs that need snapshots to lvm-thin, where the snapshots always work, and only attempt snapshots on .qcow2 VMs that use scsi disk type.

I've only tested with CentOS 6 and 7 so far so I don't know whether this affects Ubuntu or Debian yet.

strausmann · Jun 5, 2017

Good evening,

I have the same problem. The conditions with me are as follows:

PVE Manager version: pve-manager / 4.4-13 / 7ea56165
KVM OS: CloudLinux 7.3
KVM Settings: Local Storage / qcow2 Format / Cache: writetrough / Size: 100 GB / VirtIO

The image is no longer accessible. If I try to "qemu-img check vm-110-disk-1.qcow2" gives me the error:

qemu-img: Check failed: Can not allocate memory

Why is this error when I check the image? How can I save the image? Unfortunately, the R1Soft Backup has left me in the stitch. Just as it is when you need it.

I need urgent help...

Thank you

greeting

Bjorn

wbumiller · Jun 6, 2017

Can you also provide the output of `qemu-img info /path/to/qcow2` and `qemu-img snapshot -l /path/to/qcow2` please?

Edit:
Also: did the VM crash during the creation of the snapshot? Could you include syslog message from around that time?

strausmann · Jun 6, 2017

Hello Wolfgang,

her the output:

qemu-img info vm-110-disk-1.qcow2

image: vm-110-disk-1.qcow2
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 68G
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: true

qemu-img snapshot -l vm-110-disk-1.qcow2

non output

strausmann · Jun 6, 2017

Jun 3 22:21:46 pmn01 kernel: [4224931.799151] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.803318] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.807482] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.816703] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.908318] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224931.917225] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.081222] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.092298] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.097200] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.552718] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.561933] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.574442] audit: type=1400 audit(1496521306.854:5503899): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:46 pmn01 kernel: [4224932.594628] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:46 pmn01 kernel: [4224932.602857] audit: type=1400 audit(1496521306.882:5503903): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:47 pmn01 kernel: [4224933.216564] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.229052] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.237380] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.249249] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.270956] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:47 pmn01 kernel: [4224933.459627] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> starting task UPID

mn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam:
Jun 3 22:21:48 pmn01 pvedaemon[28438]: <root@pam> delete snapshot VM 110: PleskUpdate
Jun 3 22:21:48 pmn01 pvedaemon[28438]: VM is locked (snapshot)
Jun 3 22:21:48 pmn01 pvedaemon[2239]: <root@pam> end task UPID

mn01:00006F16:192E7134:59331A5C:qmdelsnapshot:110:root@pam: VM is locked (snapshot)
Jun 3 22:21:52 pmn01 kernel: [4224937.872056] audit: type=1400 audit(1496521312.150:5503904): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3433 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.101388] audit: type=1400 audit(1496521319.381:5503909): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.630237] audit: type=1400 audit(1496521319.913:5503910): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3431 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:21:59 pmn01 kernel: [4224945.652358] audit: type=1400 audit(1496521319.933:5503913): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3440 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:01 pmn01 CRON[28560]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
Jun 3 22:22:04 pmn01 kernel: [4224950.664829] audit: type=1400 audit(1496521324.945:5503927): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:05 pmn01 kernel: [4224951.503214] audit: type=1400 audit(1496521325.785:5503930): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3439 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:08 pmn01 kernel: [4224953.746215] audit: type=1400 audit(1496521328.025:5503933): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3430 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:08 pmn01 kernel: [4224954.009628] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:16 pmn01 kernel: [4224962.273051] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:17 pmn01 kernel: [4224963.465466] audit: type=1400 audit(1496521337.745:5503942): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:20 pmn01 kernel: [4224965.749942] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:22 pmn01 kernel: [4224968.311855] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:22 pmn01 kernel: [4224968.609310] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:23 pmn01 kernel: [4224968.925544] audit: type=1400 audit(1496521343.205:5503946): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3432 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:28 pmn01 kernel: [4224974.576551] audit: type=1400 audit(1496521348.857:5503951): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3428 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:29 pmn01 kernel: [4224975.012274] audit: type=1400 audit(1496521349.293:5503952): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:38 pmn01 kernel: [4224983.999280] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:40 pmn01 kernel: [4224986.429879] audit: type=1400 audit(1496521360.708:5503953): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3441 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:40 pmn01 kernel: [4224986.664979] audit: type=1400 audit(1496521360.944:5503954): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3437 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:43 pmn01 kernel: [4224989.084109] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:22:44 pmn01 kernel: [4224989.726488] audit: type=1400 audit(1496521364.004:5503955): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3427 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:22:58 pmn01 systemd-timesyncd[31442]: interval/delta/delay/jitter/drift 2048s/+0.000s/0.014s/0.006s/-25ppm
Jun 3 22:22:59 pmn01 kernel: [4225005.127975] audit: type=1400 audit(1496521379.404:5503961): apparmor="DENIED" operation="sendmsg" profile="/usr/sbin/named" name="/run/systemd/journal/dev-log" pid=3434 comm="named" requested_mask="w" denied_mask="w" fsuid=109 ouid=0
Jun 3 22:23:01 pmn01 CRON[29741]: (root) CMD (/usr/local/rtm/bin/rtm 28 > /dev/null 2> /dev/null)
Jun 3 22:23:05 pmn01 kernel: [4225010.752322] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:23:05 pmn01 kernel: [4225010.753730] hcp: ERROR: FALLOCATE FAILED!
Jun 3 22:23:05 pmn01 kernel: [4225011.232657] audit: type=1400 audit(1496521385.508:5503965): app

w3ph · Jun 7, 2017

strausmann said:
Good evening,

I have the same problem. The conditions with me are as follows:

PVE Manager version: pve-manager / 4.4-13 / 7ea56165
KVM OS: CloudLinux 7.3
KVM Settings: Local Storage / qcow2 Format / Cache: writetrough / Size: 100 GB / VirtIO

The image is no longer accessible. If I try to "qemu-img check vm-110-disk-1.qcow2" gives me the error:

qemu-img: Check failed: Can not allocate memory

Why is this error when I check the image? How can I save the image? Unfortunately, the R1Soft Backup has left me in the stitch. Just as it is when you need it.

I had the same problem, where VMs on NFS storage, qcow2 image, virtio disk would be corrupted when I tried to take a snapshot. In some cases the images were repairable with qemu-img but in other cases I had to restore from backup because the image was so damaged. Nasty bug. This affected only qcow2 images on NFS. Local-lvm wasn't affected.

The fix was to shut down VM, delete the virtio disk from hardware (it doesn't go away, just gets listed as unused), then double-click on the unused image and add it back as SCSI, then go into Options and set the boot order to use the SCSI volume (it will still say virtio so the volume won't be found and boot will fail unless you do this).

This won't fix your corrupted image, but it so far has prevented it from happening again.

coudert · Jun 22, 2017

Hi guy,

we have similar issue, when we create a snapshot, disk is corrupted.

If we start vm, we have this message :
Task viewer: VM 64115237 - Start
kvm: -drive file=/mnt/pve/vmdisk-nfs-emcspb1-02/images/237/vm-237-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write

Conditions :
Vms on NFS storage / Virtio Driver / format qcow2 / disk size 80 Go / snapshot with RAM.

Qemu check to test status of qcow2.

qemu-img check vm-237-disk-1.qcow2

ERROR cluster 16048 refcount=2 reference=3
ERROR cluster 16049 refcount=2 reference=3
ERROR cluster 16050 refcount=2 reference=3
ERROR cluster 16051 refcount=2 reference=3
...
ERROR OFLAG_COPIED data cluster: l2_entry=c818d0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=fb4a50000 refcount=1

292 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

27 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
819200/819200 = 100.00% allocated, 9.56% fragmented, 0.00% compressed clusters
Image end offset: 87233527808

Sometime, we can save the vm with these comands : “qemu-img check -r vm-237-disk-1.qcow2” and “qm unlock 237”

After a repair and unlock, we have this status :

qemu-img info vm-237-disk-1.qcow2
image: vm-237-disk-1.qcow2
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 80G
cluster_size: 65536
Snapshot list:
ID TAG VM SIZE DATE VM CLOCK
1 centreon2_1 0 2017-06-19 12:59:41 72:49:04.217
2 centreon2_2 0 2017-06-21 19:10:14 00:05:59.367
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

Info on proxmox version :
pve-manager/4.4-13/7ea56165 (running kernel: 4.4.62-1-pve)

Regards,
Stéphane C.

Note : we had never see this issue with Proxmox 3.3.5

remark · Jun 22, 2017

Running out of space on host?

coudert · Jun 23, 2017

Hi,

No problem on local and nfs disk space. There is some To of free.

Thanks,
Stéphane C.

afrugone · Aug 1, 2017

I've just have the same problem, I need to recover this server, or at least the files on it, I dont have any recent backup, this is the second time it's happend to me, ple ase help me

David Wilson · Dec 26, 2017

Good day guys,

Season's greetings to you.

I can confirm that we experienced the same scary problem after running a snapshot on a VM with qcow2 disk images stored on NFS, presented to the VM as "Virtio SCSI". Others experiencing a similar problem seem to report the problem only occurring when using "Virtio Block" whereas we experienced the problem using "VirtIO SCSI".

taenzerme · Jan 3, 2018

Hi all, Hi David,

have you found the reason for the problem? We experienced the same problem with qcow2 on NFS (Synology storages) - but not always. I cloned some VMs and took snapshots without any problems. Yet other Vms crashed and corrupted the filesystem immediately.

Best
Sebastian

afrugone · Jan 3, 2018

Never use virtio over NFS and qcow2, is very bad idea, go out of there, if you make an online buckup you can lost you VM, It happened to me, three times untill I found this problem.

taenzerme · Jan 3, 2018

@afrugone ... and I just took 3 snapshots of a Debian VM on NFS w/ qcow2 without any problems. I can't reproduce it in general, that's why I'm asking.

afrugone · Jan 3, 2018

I've both cases, but for me is very dangerous situation, better use SCSI, not virtio as disk

David Wilson · Jan 4, 2018

taenzerme said:
Hi all, Hi David,

have you found the reason for the problem? We experienced the same problem with qcow2 on NFS (Synology storages) - but not always. I cloned some VMs and took snapshots without any problems. Yet other Vms crashed and corrupted the filesystem immediately.

Best
Sebastian

Thank you for your reply Sebastian.

What you've mentioned regarding your experience with NFS and Synology is interesting. We are using NFS on QNAP.
Sadly I haven't had a chance to investigate further yet but am very eager to find a fix.

David Wilson · Jan 8, 2018

afrugone said:
Never use virtio over NFS and qcow2, is very bad idea, go out of there, if you make an online buckup you can lost you VM, It happened to me, three times untill I found this problem.

Thank you.
It seems that people in this forum have reported the problem when using "VirtIO. I experienced the problem when using 'VirtIO SCSI".
I don't use a Synology NAS - I use a Qnap NAS, which we are looking to replace.

Antony Street · Jan 9, 2018

I also experienced this bug today and have changed all of my Linux VMs to scsi. Our NFS is on a brand new CentOS server and so I'm not sure it's related to QNAP or Synology. Has anyone had this happen to a Windows VM?

Corrupt Filesystem after snapshot

New Member

Member

Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Member

New Member

Renowned Member

New Member

Renowned Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Active Member

Active Member

Member