[TUTORIAL] Understanding QCOW2 Risks with QEMU cache=none in Proxmox

bbgeek17

Distinguished Member
Nov 20, 2020
5,986
2,161
258
Blockbridge
www.blockbridge.com
Hey everyone,

A few recent developments prompted us to examine QCOW2’s behavior and reliability characteristics more closely:

1. Community feedback
  • There are various community discussions questioning the reliability of QCOW2. We have customers (predating our native integration) interested in using QCOW on LVM.
2. Integrity testing failures with QCOW/LVM snapshots
  • When we ran our data integrity tests against the tech preview of QCOW2/LVM snapshots, we observed consistent failures starting immediately after the first snapshot was taken.
3. Confusing Documentation
  • The existing resources documenting the behavior and semantics of the various cache modes lack clarity.

After extensive lab testing, we now have a clear understanding of QEMU and QCOW2 behavior, as well as the inherent risks.

Lessons Learned:

Compared to physical storage devices, QCOW2 exhibits unusual write semantics due to delayed metadata updates.

The integrity issues with LVM snapshots arose from a common misconception that cache=none disables write caching entirely. In reality, this assumption only holds for RAW disks. QEMU/QCOW2 defers and maintains cached metadata structures that remain volatile for much longer than expected, even across guest reboot!

Subcluster allocation in the new snapshot chain feature ("Volume as Snapshot Chains") significantly increases metadata churn. It amplifies the risk of torn writes and data inconsistency after power loss or unplanned guest termination.

Technical Report:

We've published a technical article summarizing what we've learned, including a reproducible experiment that demonstrates the semantics leading to corruption on power loss:
Please feel free to ask questions, and we'll do our best to answer. If you spot a gap in our understanding, let us know.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
We have received several requests to provide a procedure to reproduce our results. Here are the steps you can use in your lab:

  • Create a Linux VM (we will use Alpine for a smaller footprint)
Code:
qm create 100 --name vm100 --memory 256 --socket 1 --bootdisk scsi0 --boot c --bootdisk scsi0 \
  --onboot no --scsihw virtio-scsi-single --net0 virtio,bridge=vmbr0,firewall=1 --ide2 local-lvm:cloudinit \ 
  --agent enabled=1 --sshkeys /root/.ssh/authorized_keys --serial0 socket --vga serial0 \ 
  --cicustom meta=local:snippets/alpine.metadata.100.701528,user=local:snippets/alpine.userdata.100.701528 \
  --ipconfig0 ip=dhcp \ 
  --scsi0 local-lvm:0,aio=native,iothread=1,import-from=/mnt/pve/nfs/template/iso/nocloud_alpine-3.22.0-x86_64-bios-cloudinit-r0.qcow2
  • Create a 3GB "data" disk on LVM storage pool with "Allow Snapshots as Volume-Chain" enabled. Assign the disk to the VM:
Code:
pvesm alloc testvg 100 '' 3G
qm disk rescan --vmid 100
qm set 100 --scsi1 testvg:vm-100-disk-0.qcow2,cache=none
Note that "cache=none" is default and will not be visible if you examine the VM config later.
  • Boot the VM and record disk names. In our case the "data" disk is /dev/sdb
  • Fill the disks with ones: doas badblocks -w -b 4096 -p 1 -t 0x11111111 /dev/sdb
Code:
doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000
  • Examine disk's data: doas hexdump -C -n $((1024*18024)) /dev/sdb
  • Simulate hardware or power failure by killing the process: pkill -9 -f 'kvm -id 100'
  • Boot the VM and confirm that disk's data is as expected (still all ones): doas hexdump -C -n $((1024*18024)) /dev/sdb
  • Initiate a VM snapshot: qm snapshot 100 snapshot1
snapshotting 'drive-scsi0' (local-lvm:vm-100-disk-0)
Logical volume "snap_vm-100-disk-0_snapshot1" created.
snapshotting 'drive-scsi1' (testvg:vm-100-disk-0.qcow2)
external qemu snapshot
Creating a new current volume with snapshot1 as backing snap
Renamed "vm-100-disk-0.qcow2" to "snap_vm-100-disk-0_snapshot1.qcow2" in volume group "tesvg"
Rounding up size to full physical extent 3.00 GiB
Logical volume "vm-100-disk-0.qcow2" created.
Formatting '/dev/tesvg/vm-100-disk-0.qcow2', fmt=qcow2 cluster_size=131072 extended_l2=on preallocation=metadata compression_type=zlib size=3221225472 backing_file=snap_vm-100-disk-0_snapshot1.qcow2 backing_fmt=qcow2 lazy_refcounts6
blockdev replace current by snapshot1
blockdev-snapshot: reopen current with snapshot1 backing image

  • Write a new, recognizable, pattern to the disk: doas badblocks -w -b 4096 -p 1 -t 0xDEADBEEF /dev/sdb
  • Examine the disk to confirm you can read the expected data: doas hexdump -C -n $((1024*18024)) /dev/sdb
Code:
00000000  de ad be ef de ad be ef  de ad be ef de ad be ef  |................|
*
0119a000
  • Kill the VM: pkill -9 -f 'kvm -id 100'
  • After restarting the VM - examine the data and observe that it went back to the pre-write state and all new data has been lost:
Code:
doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox