Hello,
We are experiencing massive issues with file corruption on qemu vms. It started about one month ago it seems; it might correspond with our last pve upgrade.
The files can be detected via
on the Debian guests, also
show massive corruption.
On the guest, I see files with the correct size and timestamp, but only containing NUL bytes. Some of them also are partially nulled; I found one containing 4096 NUL bytes from the start followed by the normal content at that position. Mostly, these files seem to have been created shortly before a reboot (kernel modules or freshly installed software).
At first we suspected activating discard on the images and doing fstrim to be the culprit, but now it seems that's a temporal coincidence. Also we suspected sth wrong with the underlying glusterfs, but that didn't show anything like that in the years before and wasn't updated for a while, so we discarded that for now. These are very preliminary guesses about the cause, however; I post them mostly to find analogies if someone else experiences phenomena like this.
Sorry about the coarse posting; I wanted to get info out to corroborate or debunk my observations. Feel free to ask me about any detail you're interested in.
pveversion is pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve), so atm it's fresh apart from libpve-common-perl/stable 5.0-56, which we didn't update yet (we try to reduce power cycles as much as possible).
One bug report strikes me as possibly related:
https://bugs.launchpad.net/qemu/+bug/1846427
However, versions seem to differ (although I don't know if and how the versions correlate):
That's about all I got so far...
We are experiencing massive issues with file corruption on qemu vms. It started about one month ago it seems; it might correspond with our last pve upgrade.
The files can be detected via
Code:
debsums -c
Code:
qemu-img check <image.qcow2>
On the guest, I see files with the correct size and timestamp, but only containing NUL bytes. Some of them also are partially nulled; I found one containing 4096 NUL bytes from the start followed by the normal content at that position. Mostly, these files seem to have been created shortly before a reboot (kernel modules or freshly installed software).
At first we suspected activating discard on the images and doing fstrim to be the culprit, but now it seems that's a temporal coincidence. Also we suspected sth wrong with the underlying glusterfs, but that didn't show anything like that in the years before and wasn't updated for a while, so we discarded that for now. These are very preliminary guesses about the cause, however; I post them mostly to find analogies if someone else experiences phenomena like this.
Sorry about the coarse posting; I wanted to get info out to corroborate or debunk my observations. Feel free to ask me about any detail you're interested in.
pveversion is pve-manager/5.4-13/aee6f0ec (running kernel: 4.15.18-23-pve), so atm it's fresh apart from libpve-common-perl/stable 5.0-56, which we didn't update yet (we try to reduce power cycles as much as possible).
One bug report strikes me as possibly related:
https://bugs.launchpad.net/qemu/+bug/1846427
However, versions seem to differ (although I don't know if and how the versions correlate):
Code:
~# dpkg -l | grep qemu
ii pve-qemu-kvm 3.0.1-4 amd64 Full virtualization on x86 hardware
ii qemu-server 5.0-54 amd64 Qemu Server Tools
That's about all I got so far...