file corruption on qcow vmdisks

Nemo

Member
Nov 4, 2011
13
0
21
Hello,

We are experiencing massive issues with file corruption on qemu vms. It started about one month ago it seems; it might correspond with our last pve upgrade.
The files can be detected via
Code:
debsums -c
on the Debian guests, also
Code:
qemu-img check <image.qcow2>
show massive corruption.
On the guest, I see files with the correct size and timestamp, but only containing NUL bytes. Some of them also are partially nulled; I found one containing 4096 NUL bytes from the start followed by the normal content at that position. Mostly, these files seem to have been created shortly before a reboot (kernel modules or freshly installed software).
At first we suspected activating discard on the images and doing fstrim to be the culprit, but now it seems that's a temporal coincidence. Also we suspected sth wrong with the underlying glusterfs, but that didn't show anything like that in the years before and wasn't updated for a while, so we discarded that for now. These are very preliminary guesses about the cause, however; I post them mostly to find analogies if someone else experiences phenomena like this.

Sorry about the coarse posting; I wanted to get info out to corroborate or debunk my observations. Feel free to ask me about any detail you're interested in.

pveversion is pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve), so atm it's fresh apart from libpve-common-perl/stable 5.0-56, which we didn't update yet (we try to reduce power cycles as much as possible).

One bug report strikes me as possibly related:
https://bugs.launchpad.net/qemu/+bug/1846427

However, versions seem to differ (although I don't know if and how the versions correlate):

Code:
~# dpkg -l | grep qemu
ii  pve-qemu-kvm                         3.0.1-4                        amd64        Full virtualization on x86 hardware
ii  qemu-server                          5.0-54                         amd64        Qemu Server Tools

That's about all I got so far...
 
What's the underlying filesystem? And was the image pre-allocated? Can you please post a pveversion -v? Is there anything visible in the logs?
 
I reproduced the issue on glusterfs and local storage with ext4. Nothing in the logs.
~# pveversion -v:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-23-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-11
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-17-pve: 4.15.18-43
pve-kernel-4.15.18-16-pve: 4.15.18-41
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Since the issue is also in the customer support (ticket 9253328), it's probably more oeconomic to put it on hold here. Of course I'll follow up if something comes up.
For the time being, it seems like switching discard on and stopping/starting the VM (Proxmox interface) seems to not cause, but greatly enhance the risk of corruption, so I'd advise to caution there.
 
As an update: We switched off discard on all guests. We could provoke corruption creating machines with discard on, but it became more difficult and rare. So it might be depending on the load, or the number of parrallel vms with discard on. Hard to say. That being said, we started recreating every single vm on the cluster, since we also experienced file corruption on the machines with discard switched off, oftentimes after apt upgrades, sometimes just so. We suspect the data might incidentally land on corrupted blocks in the qcow image. Also, we found that we had corruption in images created via qemu-img convert from seemingly error free (qemu-img check) image files. This corruption only showed via debsums -c.
All in all, this is a worrying issue. We meanwhile upgraded glusterfs and Proxmox. I might try to provoke corruption again one day, but for the moment we are rather busy with rescuing data and systems.
 
This corruption only showed via debsums -c.
These seems to be hardware related. Is the firmware & BIOS of the hardware up-to-date?
 
BIOS versions are of 2017 and 2018, five machines, two different sets of hardware. Would be improbable that all show the same symptoms at the same time. But still something to keep in mind.
 
pveversion is pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve), so atm it's fresh apart from libpve-common-perl/stable 5.0-56, which we didn't update yet (we try to reduce power cycles as much as possible).
Please also upgrade, as PVE 5 will be EoL around July. And it will bring a newer pve-qemu-kvm package. ;)
https://pve.proxmox.com/wiki/FAQ
 
PVE is 6.1.7 meanwhile, and glusterfs 7.1. We stilll have images showing fresh corruption, but my belief atm is, that they only start to show up whenever one of the zeroed blocks is written to, so they are not really new errors. Atm we're much too busy saving operations to make further tests, but I might find time for that somewhat later.
 
PVE is 6.1.7 meanwhile, and glusterfs 7.1. We stilll have images showing fresh corruption, but my belief atm is, that they only start to show up whenever one of the zeroed blocks is written to, so they are not really new errors. Atm we're much too busy saving operations to make further tests, but I might find time for that somewhat later.
This is with discard on or off?
 
PVE is 6.1.7 meanwhile, and glusterfs 7.1. We stilll have images showing fresh corruption, but my belief atm is, that they only start to show up whenever one of the zeroed blocks is written to, so they are not really new errors. Atm we're much too busy saving operations to make further tests, but I might find time for that somewhat later.

Hello Nemo,

We have same issues, it seems atm VM linked clones affected. We have NVME storage under glusterfs, but on HDD we have no such issues.

After some tests, Gluster 5.5, Debian Buster, on xfs with glusterfs defaults, files inside qcow are corrupted.
First solution: use RAW images
Second solution: turn off performance.write-behind option on affected volume.

Best regards,
Peter
 
Last edited: