Q: Qcow2 file error Weird Problem > Any comments/thoughts?

fortechitsolutions · Sep 28, 2022

Hi everyone,

small 'fun' question. New one for me, I am banging my head on this gently in the last few days.

I've got a proxmox (ver 6.4.15) host at a client site (yes, I need to upgrade it to 7.Latest) which has been in service a few years (approx)
Generally it works smoothly (Dell 2u rackmount box, dual-quad xeon, dell-perc hardware raid, pretty standard hardware etc). I do have a bcache volume here which is a bit atypical but generally in my experience bcache has been solid.

There was a major storm here over the weekend, and power was out for ~48 hours. It looks like the proxmox host failed to shut down gracefully. I am guessing one WindowsVM maybe didn't play nice when told 'shut down' please by the UPS / apcupsd daemon that is meant to do graceful shutdown when the UPS battery is getting low.

Once things came back on Monday and we powered up proxmox.
-- at first it was grumpy, bcache config. Did a reboot and it was ok, proxmox booted
-- all of the VM on the proxmox host were ok, EXCEPT for one. - The rest - all started up without complaining
-- one VM did not start graceful. This one was a KVM Virtual running ClearOS (Linux NAS Appliance Distro kind of). It has a stock ClearOS (RHEL/CENTOS under the hood) install which includes LVM and XFS filesystem in the VM config. Whee.
-- At first I had to run an XFS repair, which took ~15min and did 'lots' of corrections it identified. Then finished and claimed it was done.
-- powered off the VM, power on.
-- then I got error at proxmox level. approx reading thus:

Code:

kvm: -drive file=/bcache-tank/images/107/vm-107-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
TASK ERROR: start failed: QEMU exited with code 1

So, that was not happy thing to see. And weird because - the VM booted fine the first time after proxmox boot up; it just was grumpy inside the VM (not at proxmox level)

So, I attempted a check and then fix on the QCOW file, basically

Code:

qemu-img check -r all vm-107-disk-0.qcow2

and this takes about 5 minutes
and then it spits out a happy seeming message,

Code:

root@proxmox:/bcache-tank/images/107# qemu-img check -r all vm-107-disk-0.qcow2
 No errors were found on the image.
67108864/67108864 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 4398717861888
root@proxmox:/bcache-tank/images/107#

So, then I re-start the VM from Proxmox WebUI. And bouf, it boots up normally, there is no complaint from proxmox/saying QCOW is sad.

This was on Monday. Then the VM was acting weird this morning, so I had to shut it down, start it up
It refused to respond to a shutdown so I had to do a 'stop'. Console of the VM was not responsive, the VM appeared to be super-busy doing an IO/read and was ignoring me.
It stopped promptly.
Then try to start, and again proxmox throws the error about bad QCOW file
then I do the qcow repair scan via CLI on proxmox
then it reports again all is well 100% ok no errrs found
then I start the VM, it boots up and is happy and no errors

So.

I am a bit puzzled,
why does proxmox complain the QCow is bad, but then I do a repair-scan, it 'blesses' the image, and after that proxmox will talk to the Qcow again?
it seems inconsistent but the pattern of behavior is there
I am a bit baffled

Plan-B scenario is that I can restore from last good backup - I have a proxmox backup server linked here and it has a backup from Thursday night
just some modest drama because I don't have enough storage to spin up 2 copies of the VM at once, as it has a pretty huge disk
so that likely means I have to spin up a staging proxmox host, and juggle VMs around, etc.
so if there is a graceful way to resolve this that does not involve that it would be nice
but end of the day I can't have this VM crashing out randomly and rejecting the QCow file until I pat it on the head / verify / and then it will boot.
so

wondering if anyone has ever seen this kind of fun thing
or has any comments about qcow corruption that is not really corrupt, or at least not so clearly apparent

thank you if you have managed to read this far!

Tim

fiona · Sep 29, 2022

Hi,
are there any IO errors, messages from kvm or similar in your host's /var/log/syslog? I'd also run a memtest and check your disks with e.g. smartctl. Maybe also fsck your bcache file system.

fortechitsolutions said:

Code:

kvm: -drive file=/bcache-tank/images/107/vm-107-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
TASK ERROR: start failed: QEMU exited with code 1

Is this the full VM 107 - Start task log?

fortechitsolutions · Sep 29, 2022

Hi, thank you for the reply. These are great suggestions and questions.

- yes the blob of text given above is the full output from the VM107 start task log.
- I dig into logs on host, and found a few things related that may be of interest.

Code:

Sep 26 12:48:21 proxmox QEMU[16996]: qcow2: Marking image as corrupt: Cluster allocation offset 0x53e300308400 unaligned (L2 offset: 0x1100ae0000, L2 index: 0x112); further corruption events will be suppressed
Sep 26 13:38:05 proxmox QEMU[16996]: kvm: terminating on signal 15 from pid 770 (/usr/sbin/qmeventd)
Sep 28 09:48:31 proxmox QEMU[29143]: qcow2: Marking image as corrupt: Cluster allocation offset 0x53e300308400 unaligned (L2 offset: 0x1100ae0000, L2 index: 0x100); further corruption events will be suppressed

Note Sep.26 at 12:48 is when I manually started the VM on Monday / and Sep.28 948am is when I manually started the VM on Wednesday
and had to do the song-dance of (try to start, it fails with QCOW Corrupt error, I run scan-repair, it claims all is well, try to start the VM again  and it is fine after that.

and in syslog.1

Sep 28 09:48:26 proxmox vzdump[29575]: INFO: Finished Backup of VM 106 (00:01:10)
Sep 28 09:48:26 proxmox vzdump[29575]: INFO: Starting Backup of VM 107 (qemu)
Sep 28 09:48:31 proxmox QEMU[29143]: qcow2: Marking image as corrupt: Cluster allocation offset 0x53e300308400 unaligned (L2 offset: 0x1100ae0000, L2 index: 0x100); further corruption events will be suppressed
Sep 28 09:48:32 proxmox vzdump[29575]: ERROR: Backup of VM 107 failed - job failed with err -123 - No medium found
Sep 28 09:48:32 proxmox vzdump[29575]: INFO: Starting Backup of VM 108 (lxc)
Sep 28 09:49:00 proxmox systemd[1]: Starting Proxmox VE replication runner...

I've got backups running to a local PBS server starting at Midnight, and it appears this is dragging on much later than it used to - ie in the past it was done before 7am, now it was still grinding along and kicked off a PBS Backup on VM107 at 948am. This appears to be what triggered the VM107 getting sad on wed.AM at 948am, ie, QCOW was tagged as corrupt again, then PBS backup job moves along to next VM. But in the process I think VM107 gets side-swiped by this / hence I had to get involved and power off the VM, run the QCOW scan-yes-ok-then manually restart VM107 again.

So, I think my next step is to schedule maintenance window, take all VM offline, run an FSCK on my proxmox datastore which is on the bcache volume, and see if that comes back clean, or if it gives me any errors/fixes/status hints. I'm still baffled that the qcow reports as 'corrupt error' but yet the qcow repair-health-check then says "yeah no this is all good". Possibly indicates ? the filesystem/bcache store has an issue.

Will see when I can get maintenance window for downtime (ie, tonight after hours vs weekend) and will post back update to the thread after that.

thank you for the feedback/comments; the help is very much appreciated.

Tim

Search

Search

Q: Qcow2 file error Weird Problem > Any comments/thoughts?

fortechitsolutions

Renowned Member

fiona

Proxmox Staff Member

fortechitsolutions

Renowned Member