I have found that creating a Snapshot of a VM with a qcow2 disk can corrupt the filesystem when the VM is doing heavy disk writes. It works fine when there are no disk writes.
The error is repeatable and not a one-off.
Although I have no intention of using qcow2 in production, the idea that running a Snapshot could destroy a VM's disk is horrifying to me. Completely horrifying. Especially when in another thread there are people mentioning that merely running a backup onto a shared disk can corrupt the filesystem.
Backups and snapshots must be bullet-proof. 100% bullet-proof. The idea that doing something that should protect my data may instead actually destroy it is giving me sleepless nights.
So, I'm posting this here in the hope that someone will re-assure me that this is very unusual and not something I'm likely to see in real life. I need to have 100% faith in a product or a technology in order to use it.
The qcow2 disk I'm testing is stored locally in a Directory on an ext4 filesystem in an LVM volume, which is not something I would normally use. Still, it COULD be something I might use in the future.
I'm using a Centos 7 VM using the "Generic Cloud" Centos 7 image from Centos.org as a base.
The disk interface is VirtIO-SCSI and no adjustments in the VM's fstab. The VM uses xfs - this is the default for Centos 7 and is what the Generic Cloud Centos 7 image uses.
Guest Agent is installed. Both the VM and Proxmox (5.2-8) are 100% up to date with all released updates.
The physical disk is fine but it is a slow 3Gbit/sec 500GB SATA drive connected directly to an old and also slow Dell R200's motherboard - no RAID controller.
Creating a Snapshot of the VM when idle works fine.
But if I run bonnie++ then try creating a Snapshot, it goes horribly wrong:
In the Proxmox GUI, I get a window saying: VM 103 qmp command 'snapshot-drive' failed - Error while creating snapshot on 'dive-scsi0'
At this point the disk appears to be trashed. No files can be found via the CLI, and it won't boot. I have not tried to mount the qcow2 disk via nbd to see what's left in it, if anything.
I should point out that running bonnie++ without trying to do a Snapshot at the same time works just fine.
Also when I repeat the test with a .raw disk in lvm-thin I get no errors and no issues.
I see similar qcow2 corruption under heavy load was reported previously here: https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865 but that was with the old VirtIO driver and much older Proxmox.
Since my test is somewhat artificial (quite old hardware, qcow2, bonnie++ etc etc), I'm not sure there is any point looking deeper into this. But if it would help anybody then I'm willing to do so if I could be guided in terms of what, exactly, you would like me to do. I am in unfamiliar territory here.
The error is repeatable and not a one-off.
Although I have no intention of using qcow2 in production, the idea that running a Snapshot could destroy a VM's disk is horrifying to me. Completely horrifying. Especially when in another thread there are people mentioning that merely running a backup onto a shared disk can corrupt the filesystem.
Backups and snapshots must be bullet-proof. 100% bullet-proof. The idea that doing something that should protect my data may instead actually destroy it is giving me sleepless nights.
So, I'm posting this here in the hope that someone will re-assure me that this is very unusual and not something I'm likely to see in real life. I need to have 100% faith in a product or a technology in order to use it.
The qcow2 disk I'm testing is stored locally in a Directory on an ext4 filesystem in an LVM volume, which is not something I would normally use. Still, it COULD be something I might use in the future.
I'm using a Centos 7 VM using the "Generic Cloud" Centos 7 image from Centos.org as a base.
The disk interface is VirtIO-SCSI and no adjustments in the VM's fstab. The VM uses xfs - this is the default for Centos 7 and is what the Generic Cloud Centos 7 image uses.
Guest Agent is installed. Both the VM and Proxmox (5.2-8) are 100% up to date with all released updates.
The physical disk is fine but it is a slow 3Gbit/sec 500GB SATA drive connected directly to an old and also slow Dell R200's motherboard - no RAID controller.
Creating a Snapshot of the VM when idle works fine.
But if I run bonnie++ then try creating a Snapshot, it goes horribly wrong:
Code:
[root@fc1 ~]# bonnie++ -d /tmp -s 4G -n 0 -m TEST -f -b -u centos
Using uid:1000, gid:1000.
Writing intelligently...Can't write block.: Input/output error
Can't write block 135241.
Can't sync file.
[root@fc1 ~]#
[root@fc1 ~]#
[root@fc1 ~]# ls -al
-bash: ls: command not found
[root@fc1 ~]#
In the Proxmox GUI, I get a window saying: VM 103 qmp command 'snapshot-drive' failed - Error while creating snapshot on 'dive-scsi0'
At this point the disk appears to be trashed. No files can be found via the CLI, and it won't boot. I have not tried to mount the qcow2 disk via nbd to see what's left in it, if anything.
I should point out that running bonnie++ without trying to do a Snapshot at the same time works just fine.
Also when I repeat the test with a .raw disk in lvm-thin I get no errors and no issues.
I see similar qcow2 corruption under heavy load was reported previously here: https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865 but that was with the old VirtIO driver and much older Proxmox.
Since my test is somewhat artificial (quite old hardware, qcow2, bonnie++ etc etc), I'm not sure there is any point looking deeper into this. But if it would help anybody then I'm willing to do so if I could be guided in terms of what, exactly, you would like me to do. I am in unfamiliar territory here.