qcow2 corruption after snapshot or heavy disk I/O

tafkaz

Renowned Member
Dec 1, 2010
147
7
83
Geilenkirchen Germany
Hi,
we were just facing this really odd trouble here, and wondered if anyone was experiencing the same or has any answers:

- Virtual Environment 4.4-12/e71b7a74 on supermicro server hardware with LSI RAID Adapter.
RAID1 with two 3 TB disks, no errors in Raid, no Errors in (ECC) RAM
- KVM-guest, with univention (maintained debian distribution), 16 GB RAM and a qcow2 Vdisk with 500 GB
- worked pretty fine for quite some time, but then began to behave strange since 2 or 3 weeks.
- first by automagically shutting down during a running snapshot backups. restarting the machin finds some fixable errors in fsck, but will then work ok again.
- last 2-3 days by completely corrupting the whole qcow2 vdisk after taking a snapshot through the proxmox GUI
- the resulting corrupted file cannot be repaired with qemu-check (read only file???)
- booting up the VM with a rescue disk and trying to fsck shows a billion errors saying files not allocated, and if fixed, results in an empty disk with loads of entries in "lost & found" -> not usable

i now converted the qcow2 from a backup to raw, and hope i am safe....
I only hope we never end up in a bigger disaster here...

help, ideas, anything much appreciated
thanks Sascha
 
yes, seems so. will replace that one.
But if that's not the main problem, then we should investigate the real cause of qcow2 failing, i guess.
Such a worrying issue...

thanks
Sascha
 
Last edited:
Just got news from the datacenter (Hetzner). Both HD passed their extended tests and don't show any critical errors.
So i guess we have something else here?

Thanks
Sascha
 
I would replace the seagate ASAP
Too many errors. The longtest only check for surface errors (unreadable sectors), but you also have tons of seek errors, seems that something bad on mechanical side will happen in a short time.
 
I'm able to reproduce this, and it isn't a hardware problem. pve-manager/4.4-13/7ea56165 (running kernel: 4.4.44-1-pve), with CentOS 6 and 7 VMs using virtio disk driver, qcow2 images, on NFS (have been able to reproduce reliably on 3 different NFS servers). All is well, then you take a snapshot, check disk image with 'qemu-img check' and discover many leaked clusters and hundreds (sometimes thousands) of corruptions. The VM's OS thinks it has a corrupted disk and things start to break quickly. If you're lucky the OS puts its volumes in read-only mode. Attempting to roll back to the snapshot fails because of the corruption.

This does not happen when we use local storage (lvm-thin), only NFS and .qcow2. This behavior started with Proxmox 4 - with 3.x we were always able to make and revert to snapshots when we needed them.

Today I created a new large (500gb disk) CentOS 7 VM using the SCSI driver (the default), .qcow2 image on NFS, and so far have not been able to corrupt the .qcow2 image by taking snapshots, so maybe having used virtio is the culprit.
 
Upgraded: pve-manager/4.4-13/7ea56165 (running kernel: 4.4.49-1-pve)

This didn't make a difference:

VM 1, CentOS 7, virtio 500 gb qcow2 disk, gets corrupted 100% of the time when I try to take a snapshot. Sample qemu-img check results below.

VM 2, CentOS 7, virtio 500 gb qcow2 disk, does not get corrupted. I can make multiple snapshots, roll back, etc. with no corruption or leaks detected by qemu-img check

The action that triggers the corruption problem is taking a snapshot. Left alone, I'm seeing no corruption in normal operation.

Now that I understand that I can take snapshots if I move the disk image to lvm-thin or if I create the VM with scsi disk, I can work around this, but it's a nasty bug if you aren't expecting it. I'm not sure when this started to happen since we don't take snapshots all that often, but it didn't happen with Proxmox 3.x and is happening now with 4.4. I don't know yet whether the VM's OS matters (we've had the problem with CentOS 6 and 7 VMs but that could be sampling error since the majority of the VMs we've been working with have used CentOS 6 or 7). I'll run some tests with Ubuntu VMs when I have a minute.

Here's the damage that happened after taking a snapshot with a CentOS 7 virtio qcow2 VM:

Image end offset: 537288376320
ERROR cluster 7988422 refcount=1 reference=2
ERROR cluster 7996583 refcount=1 reference=2
ERROR cluster 7996616 refcount=1 reference=2
ERROR cluster 7996653 refcount=1 reference=2
ERROR cluster 7998877 refcount=1 reference=2
ERROR cluster 8000268 refcount=1 reference=2
ERROR cluster 8000269 refcount=1 reference=2
ERROR cluster 8001931 refcount=1 reference=2
ERROR cluster 8001990 refcount=1 reference=2
ERROR cluster 8021195 refcount=1 reference=2
ERROR cluster 8029356 refcount=1 reference=2
ERROR cluster 8029357 refcount=1 reference=2
ERROR cluster 8029358 refcount=1 reference=2
ERROR cluster 8029359 refcount=1 reference=2
ERROR cluster 8029362 refcount=1 reference=2
ERROR cluster 8029878 refcount=1 reference=2
ERROR cluster 8032113 refcount=1 reference=2
ERROR cluster 8032137 refcount=1 reference=2
ERROR cluster 8032141 refcount=1 reference=2
ERROR cluster 8032147 refcount=1 reference=2
ERROR cluster 8032151 refcount=1 reference=2
ERROR cluster 8033532 refcount=1 reference=2
ERROR cluster 8033759 refcount=1 reference=2
ERROR cluster 8034310 refcount=1 reference=2
ERROR cluster 8034311 refcount=1 reference=2
ERROR cluster 8034312 refcount=1 reference=2
ERROR cluster 8034313 refcount=1 reference=2
ERROR cluster 8034459 refcount=1 reference=2
ERROR cluster 8034460 refcount=1 reference=2
ERROR cluster 8037582 refcount=1 reference=2
ERROR cluster 8037619 refcount=1 reference=2
ERROR cluster 8037620 refcount=1 reference=2
ERROR cluster 8037621 refcount=1 reference=2
ERROR cluster 8041788 refcount=1 reference=2
ERROR cluster 8064589 refcount=1 reference=2
ERROR cluster 8064590 refcount=1 reference=2
ERROR OFLAG_COPIED L2 cluster: l1_index=975 l1_entry=79e4c60000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a04a70000 refcount=1
ERROR OFLAG_COPIED L2 cluster: l1_index=976 l1_entry=7a04c80000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a04ed0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a0d9d0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a130c0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a130d0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a198b0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a19c60000 refcount=1
ERROR OFLAG_COPIED L2 cluster: l1_index=979 l1_entry=7a64cb0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a84ac0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a84ad0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a84ae0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a84af0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7a84b20000 refcount=1
ERROR OFLAG_COPIED L2 cluster: l1_index=981 l1_entry=7aa4ce0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f30000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f40000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f50000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f60000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f70000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7aa4f80000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7ab53c0000 refcount=1
ERROR OFLAG_COPIED L2 cluster: l1_index=1007 l1_entry=7de4ee0000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7e04cf0000 refcount=1
ERROR OFLAG_COPIED L2 cluster: l1_index=1008 l1_entry=7e04f00000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7e05120000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7e05190000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7e05330000 refcount=1
ERROR OFLAG_COPIED data cluster: l2_entry=7e07940000 refcount=1

66 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
8388608/8388608 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 549840879616
 
A quick question. Do you have the qemu guest agent installed in your VMs?

No, it wasn't. I added it just now in one of the test VMs (CentOS 7, virtio disk). Installed on the guest, shut down, enabled via Proxmox gui, restarted. Verified that qemu-guest-agent was running on the guest. Took snapshot, no improvement:

Per qemu-img check:

9275 leaked clusters
27822 corruptions
 
That's the plan for new VMs, but we have a bunch that need snapshots when we do updates that are configured for virtio. I tried changing the disk type on one of those to scsi but it wouldn't boot. If there's an easy way to fix that we can go back and change older VMs that use virtio to scsi and not worry.

EDIT - I was forgetting to set the boot order after removing and re-adding the disk using scsi. So it looks like just switching to scsi will do the trick and we have a workaround.

I'm intrigued that this problem, which started for us when we migrated the VMS from a Proxmox 3.x cluster to a 4.x one, isn't being noticed by more people. Maybe the answer is that most people use the default (scsi) and those who've chosen virtio don't often make snapshots.
 
Last edited:
I had the impression that virtio was recommended if the VM's OS supports it, so we've been using virtio for CentOS/Ubuntu/Debian VMs until now. I haven't measured the speed with scsi vs virtio, but after going back and forth with some test VMs over the past few days I have the impression that there isn't a significant difference. There's a far more noticeable impact using .qcow2 vs. .raw, and speed of the NFS server's disks makes a big difference too.

On principle I'm concerned that the qemu-corruption-when-creating-snapshot bug is there in Prox 4 vs 3, but we do have a workaround in changing to scsi, so we're good. Thanks for all your suggestions - the discussion helped me think it through.
 
There are several benefits from using virtio-scsi instead of virtio:
  • virtio development has stopped and all efforts are used on virtio-scsi
  • performance is more or less on par - max 1-2% better with virtio
  • virtio-scsi passes unmap (trim, discard) through to the disk controller thereby regaining space
  • virtio-scsi is default for disks in proxmox 4
  • virtio vs virtio-scsi is more or less like comparing SATA with SAS. virtio-scsi is an intelligent controller