[TUTORIAL] Understanding QCOW2 Risks with QEMU cache=none in Proxmox

bbgeek17 · Thursday at 16:33

Hey everyone,

A few recent developments prompted us to examine QCOW2’s behavior and reliability characteristics more closely:

1. Community feedback

There are various community discussions questioning the reliability of QCOW2. We have customers (predating our native integration) interested in using QCOW on LVM.

2. Integrity testing failures with QCOW/LVM snapshots

When we ran our data integrity tests against the tech preview of QCOW2/LVM snapshots, we observed consistent failures starting immediately after the first snapshot was taken.

3. Confusing Documentation

The existing resources documenting the behavior and semantics of the various cache modes lack clarity.

After extensive lab testing, we now have a clear understanding of QEMU and QCOW2 behavior, as well as the inherent risks.

Lessons Learned:

Compared to physical storage devices, QCOW2 exhibits unusual write semantics due to delayed metadata updates.

The integrity issues with LVM snapshots arose from a common misconception that cache=none disables write caching entirely. In reality, this assumption only holds for RAW disks. QEMU/QCOW2 defers and maintains cached metadata structures that remain volatile for much longer than expected, even across guest reboot!

Subcluster allocation in the new snapshot chain feature ("Volume as Snapshot Chains") significantly increases metadata churn. It amplifies the risk of torn writes and data inconsistency after power loss or unplanned guest termination.

Technical Report:

We've published a technical article summarizing what we've learned, including a reproducible experiment that demonstrates the semantics leading to corruption on power loss:

Understanding QCOW2 Risks with QEMU cache=none in Proxmox

Please feel free to ask questions, and we'll do our best to answer. If you spot a gap in our understanding, let us know.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

floh8 · Thursday at 17:23

Thx nice to know

bbgeek17 · 2025-11-17T15:19:47+0100

We have received several requests to provide a procedure to reproduce our results. Here are the steps you can use in your lab:

Create a Linux VM (we will use Alpine for a smaller footprint)

Code:

qm create 100 --name vm100 --memory 256 --socket 1 --bootdisk scsi0 --boot c --bootdisk scsi0 \
  --onboot no --scsihw virtio-scsi-single --net0 virtio,bridge=vmbr0,firewall=1 --ide2 local-lvm:cloudinit \ 
  --agent enabled=1 --sshkeys /root/.ssh/authorized_keys --serial0 socket --vga serial0 \ 
  --cicustom meta=local:snippets/alpine.metadata.100.701528,user=local:snippets/alpine.userdata.100.701528 \
  --ipconfig0 ip=dhcp \ 
  --scsi0 local-lvm:0,aio=native,iothread=1,import-from=/mnt/pve/nfs/template/iso/nocloud_alpine-3.22.0-x86_64-bios-cloudinit-r0.qcow2

Create a 3GB "data" disk on LVM storage pool with "Allow Snapshots as Volume-Chain" enabled. Assign the disk to the VM:

Code:

pvesm alloc testvg 100 '' 3G
qm disk rescan --vmid 100
qm set 100 --scsi1 testvg:vm-100-disk-0.qcow2,cache=none

Note that "cache=none" is default and will not be visible if you examine the VM config later.

Boot the VM and record disk names. In our case the "data" disk is /dev/sdb
Fill the disks with ones: doas badblocks -w -b 4096 -p 1 -t 0x11111111 /dev/sdb

Code:

doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000

Examine disk's data: doas hexdump -C -n $((1024*18024)) /dev/sdb
Simulate hardware or power failure by killing the process: pkill -9 -f 'kvm -id 100'
Boot the VM and confirm that disk's data is as expected (still all ones): doas hexdump -C -n $((1024*18024)) /dev/sdb
Initiate a VM snapshot: qm snapshot 100 snapshot1

snapshotting 'drive-scsi0' (local-lvm:vm-100-disk-0)
Logical volume "snap_vm-100-disk-0_snapshot1" created.
snapshotting 'drive-scsi1' (testvg:vm-100-disk-0.qcow2)
external qemu snapshot
Creating a new current volume with snapshot1 as backing snap
Renamed "vm-100-disk-0.qcow2" to "snap_vm-100-disk-0_snapshot1.qcow2" in volume group "tesvg"
Rounding up size to full physical extent 3.00 GiB
Logical volume "vm-100-disk-0.qcow2" created.
Formatting '/dev/tesvg/vm-100-disk-0.qcow2', fmt=qcow2 cluster_size=131072 extended_l2=on preallocation=metadata compression_type=zlib size=3221225472 backing_file=snap_vm-100-disk-0_snapshot1.qcow2 backing_fmt=qcow2 lazy_refcounts6
blockdev replace current by snapshot1
blockdev-snapshot: reopen current with snapshot1 backing image

Write a new, recognizable, pattern to the disk: doas badblocks -w -b 4096 -p 1 -t 0xDEADBEEF /dev/sdb
Examine the disk to confirm you can read the expected data: doas hexdump -C -n $((1024*18024)) /dev/sdb

Code:

00000000  de ad be ef de ad be ef  de ad be ef de ad be ef  |................|
*
0119a000

Kill the VM: pkill -9 -f 'kvm -id 100'
After restarting the VM - examine the data and observe that it went back to the pre-write state and all new data has been lost:

Code:

doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

[TUTORIAL] Understanding QCOW2 Risks with QEMU cache=none in Proxmox

bbgeek17

Distinguished Member

floh8

Renowned Member

bbgeek17

Distinguished Member

We value your privacy