Hi everyone,
small 'fun' question. New one for me, I am banging my head on this gently in the last few days.
I've got a proxmox (ver 6.4.15) host at a client site (yes, I need to upgrade it to 7.Latest) which has been in service a few years (approx)
Generally it works smoothly (Dell 2u rackmount box, dual-quad xeon, dell-perc hardware raid, pretty standard hardware etc). I do have a bcache volume here which is a bit atypical but generally in my experience bcache has been solid.
There was a major storm here over the weekend, and power was out for ~48 hours. It looks like the proxmox host failed to shut down gracefully. I am guessing one WindowsVM maybe didn't play nice when told 'shut down' please by the UPS / apcupsd daemon that is meant to do graceful shutdown when the UPS battery is getting low.
Once things came back on Monday and we powered up proxmox.
-- at first it was grumpy, bcache config. Did a reboot and it was ok, proxmox booted
-- all of the VM on the proxmox host were ok, EXCEPT for one. - The rest - all started up without complaining
-- one VM did not start graceful. This one was a KVM Virtual running ClearOS (Linux NAS Appliance Distro kind of). It has a stock ClearOS (RHEL/CENTOS under the hood) install which includes LVM and XFS filesystem in the VM config. Whee.
-- At first I had to run an XFS repair, which took ~15min and did 'lots' of corrections it identified. Then finished and claimed it was done.
-- powered off the VM, power on.
-- then I got error at proxmox level. approx reading thus:
So, that was not happy thing to see. And weird because - the VM booted fine the first time after proxmox boot up; it just was grumpy inside the VM (not at proxmox level)
So, I attempted a check and then fix on the QCOW file, basically
and this takes about 5 minutes
and then it spits out a happy seeming message,
So, then I re-start the VM from Proxmox WebUI. And bouf, it boots up normally, there is no complaint from proxmox/saying QCOW is sad.
This was on Monday. Then the VM was acting weird this morning, so I had to shut it down, start it up
It refused to respond to a shutdown so I had to do a 'stop'. Console of the VM was not responsive, the VM appeared to be super-busy doing an IO/read and was ignoring me.
It stopped promptly.
Then try to start, and again proxmox throws the error about bad QCOW file
then I do the qcow repair scan via CLI on proxmox
then it reports again all is well 100% ok no errrs found
then I start the VM, it boots up and is happy and no errors
So.
I am a bit puzzled,
why does proxmox complain the QCow is bad, but then I do a repair-scan, it 'blesses' the image, and after that proxmox will talk to the Qcow again?
it seems inconsistent but the pattern of behavior is there
I am a bit baffled
Plan-B scenario is that I can restore from last good backup - I have a proxmox backup server linked here and it has a backup from Thursday night
just some modest drama because I don't have enough storage to spin up 2 copies of the VM at once, as it has a pretty huge disk
so that likely means I have to spin up a staging proxmox host, and juggle VMs around, etc.
so if there is a graceful way to resolve this that does not involve that it would be nice
but end of the day I can't have this VM crashing out randomly and rejecting the QCow file until I pat it on the head / verify / and then it will boot.
so
wondering if anyone has ever seen this kind of fun thing
or has any comments about qcow corruption that is not really corrupt, or at least not so clearly apparent
thank you if you have managed to read this far!
Tim
small 'fun' question. New one for me, I am banging my head on this gently in the last few days.
I've got a proxmox (ver 6.4.15) host at a client site (yes, I need to upgrade it to 7.Latest) which has been in service a few years (approx)
Generally it works smoothly (Dell 2u rackmount box, dual-quad xeon, dell-perc hardware raid, pretty standard hardware etc). I do have a bcache volume here which is a bit atypical but generally in my experience bcache has been solid.
There was a major storm here over the weekend, and power was out for ~48 hours. It looks like the proxmox host failed to shut down gracefully. I am guessing one WindowsVM maybe didn't play nice when told 'shut down' please by the UPS / apcupsd daemon that is meant to do graceful shutdown when the UPS battery is getting low.
Once things came back on Monday and we powered up proxmox.
-- at first it was grumpy, bcache config. Did a reboot and it was ok, proxmox booted
-- all of the VM on the proxmox host were ok, EXCEPT for one. - The rest - all started up without complaining
-- one VM did not start graceful. This one was a KVM Virtual running ClearOS (Linux NAS Appliance Distro kind of). It has a stock ClearOS (RHEL/CENTOS under the hood) install which includes LVM and XFS filesystem in the VM config. Whee.
-- At first I had to run an XFS repair, which took ~15min and did 'lots' of corrections it identified. Then finished and claimed it was done.
-- powered off the VM, power on.
-- then I got error at proxmox level. approx reading thus:
Code:
kvm: -drive file=/bcache-tank/images/107/vm-107-disk-0.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=native,detect-zeroes=on: qcow2: Image is corrupt; cannot be opened read/write
TASK ERROR: start failed: QEMU exited with code 1
So, I attempted a check and then fix on the QCOW file, basically
Code:
qemu-img check -r all vm-107-disk-0.qcow2
and this takes about 5 minutes
and then it spits out a happy seeming message,
Code:
root@proxmox:/bcache-tank/images/107# qemu-img check -r all vm-107-disk-0.qcow2
No errors were found on the image.
67108864/67108864 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters
Image end offset: 4398717861888
root@proxmox:/bcache-tank/images/107#
So, then I re-start the VM from Proxmox WebUI. And bouf, it boots up normally, there is no complaint from proxmox/saying QCOW is sad.
This was on Monday. Then the VM was acting weird this morning, so I had to shut it down, start it up
It refused to respond to a shutdown so I had to do a 'stop'. Console of the VM was not responsive, the VM appeared to be super-busy doing an IO/read and was ignoring me.
It stopped promptly.
Then try to start, and again proxmox throws the error about bad QCOW file
then I do the qcow repair scan via CLI on proxmox
then it reports again all is well 100% ok no errrs found
then I start the VM, it boots up and is happy and no errors
So.
I am a bit puzzled,
why does proxmox complain the QCow is bad, but then I do a repair-scan, it 'blesses' the image, and after that proxmox will talk to the Qcow again?
it seems inconsistent but the pattern of behavior is there
I am a bit baffled
Plan-B scenario is that I can restore from last good backup - I have a proxmox backup server linked here and it has a backup from Thursday night
just some modest drama because I don't have enough storage to spin up 2 copies of the VM at once, as it has a pretty huge disk
so that likely means I have to spin up a staging proxmox host, and juggle VMs around, etc.
so if there is a graceful way to resolve this that does not involve that it would be nice
but end of the day I can't have this VM crashing out randomly and rejecting the QCow file until I pat it on the head / verify / and then it will boot.
so
wondering if anyone has ever seen this kind of fun thing
or has any comments about qcow corruption that is not really corrupt, or at least not so clearly apparent
thank you if you have managed to read this far!
Tim
Last edited: