[TUTORIAL] Understanding QCOW2 Risks with QEMU cache=none in Proxmox

bbgeek17

Distinguished Member
Nov 20, 2020
6,076
2,242
278
Blockbridge
www.blockbridge.com
Hey everyone,

A few recent developments prompted us to examine QCOW2’s behavior and reliability characteristics more closely:

1. Community feedback
  • There are various community discussions questioning the reliability of QCOW2. We have customers (predating our native integration) interested in using QCOW on LVM.
2. Integrity testing failures with QCOW/LVM snapshots
  • When we ran our data integrity tests against the tech preview of QCOW2/LVM snapshots, we observed consistent failures starting immediately after the first snapshot was taken.
3. Confusing Documentation
  • The existing resources documenting the behavior and semantics of the various cache modes lack clarity.

After extensive lab testing, we now have a clear understanding of QEMU and QCOW2 behavior, as well as the inherent risks.

Lessons Learned:

Compared to physical storage devices, QCOW2 exhibits unusual write semantics due to delayed metadata updates.

The integrity issues with LVM snapshots arose from a common misconception that cache=none disables write caching entirely. In reality, this assumption only holds for RAW disks. QEMU/QCOW2 defers and maintains cached metadata structures that remain volatile for much longer than expected, even across guest reboot!

Subcluster allocation in the new snapshot chain feature ("Volume as Snapshot Chains") significantly increases metadata churn. It amplifies the risk of torn writes and data inconsistency after power loss or unplanned guest termination.

Technical Report:

We've published a technical article summarizing what we've learned, including a reproducible experiment that demonstrates the semantics leading to corruption on power loss:
Please feel free to ask questions, and we'll do our best to answer. If you spot a gap in our understanding, let us know.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
We have received several requests to provide a procedure to reproduce our results. Here are the steps you can use in your lab:

  • Create a Linux VM (we will use Alpine for a smaller footprint)
Code:
qm create 100 --name vm100 --memory 256 --socket 1 --bootdisk scsi0 --boot c --bootdisk scsi0 \
  --onboot no --scsihw virtio-scsi-single --net0 virtio,bridge=vmbr0,firewall=1 --ide2 local-lvm:cloudinit \ 
  --agent enabled=1 --sshkeys /root/.ssh/authorized_keys --serial0 socket --vga serial0 \ 
  --cicustom meta=local:snippets/alpine.metadata.100.701528,user=local:snippets/alpine.userdata.100.701528 \
  --ipconfig0 ip=dhcp \ 
  --scsi0 local-lvm:0,aio=native,iothread=1,import-from=/mnt/pve/nfs/template/iso/nocloud_alpine-3.22.0-x86_64-bios-cloudinit-r0.qcow2
  • Create a 3GB "data" disk on LVM storage pool with "Allow Snapshots as Volume-Chain" enabled. Assign the disk to the VM:
Code:
pvesm alloc testvg 100 '' 3G
qm disk rescan --vmid 100
qm set 100 --scsi1 testvg:vm-100-disk-0.qcow2,cache=none
Note that "cache=none" is default and will not be visible if you examine the VM config later.
  • Boot the VM and record disk names. In our case the "data" disk is /dev/sdb
  • Fill the disks with ones: doas badblocks -w -b 4096 -p 1 -t 0x11111111 /dev/sdb
Code:
doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000
  • Examine disk's data: doas hexdump -C -n $((1024*18024)) /dev/sdb
  • Simulate hardware or power failure by killing the process: pkill -9 -f 'kvm -id 100'
  • Boot the VM and confirm that disk's data is as expected (still all ones): doas hexdump -C -n $((1024*18024)) /dev/sdb
  • Initiate a VM snapshot: qm snapshot 100 snapshot1
snapshotting 'drive-scsi0' (local-lvm:vm-100-disk-0)
Logical volume "snap_vm-100-disk-0_snapshot1" created.
snapshotting 'drive-scsi1' (testvg:vm-100-disk-0.qcow2)
external qemu snapshot
Creating a new current volume with snapshot1 as backing snap
Renamed "vm-100-disk-0.qcow2" to "snap_vm-100-disk-0_snapshot1.qcow2" in volume group "tesvg"
Rounding up size to full physical extent 3.00 GiB
Logical volume "vm-100-disk-0.qcow2" created.
Formatting '/dev/tesvg/vm-100-disk-0.qcow2', fmt=qcow2 cluster_size=131072 extended_l2=on preallocation=metadata compression_type=zlib size=3221225472 backing_file=snap_vm-100-disk-0_snapshot1.qcow2 backing_fmt=qcow2 lazy_refcounts6
blockdev replace current by snapshot1
blockdev-snapshot: reopen current with snapshot1 backing image

  • Write a new, recognizable, pattern to the disk: doas badblocks -w -b 4096 -p 1 -t 0xDEADBEEF /dev/sdb
  • Examine the disk to confirm you can read the expected data: doas hexdump -C -n $((1024*18024)) /dev/sdb
Code:
00000000  de ad be ef de ad be ef  de ad be ef de ad be ef  |................|
*
0119a000
  • Kill the VM: pkill -9 -f 'kvm -id 100'
  • After restarting the VM - examine the data and observe that it went back to the pre-write state and all new data has been lost:
Code:
doas hexdump -C -n $((1024*18024)) /dev/sdb
00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
*
0119a000



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thanks a lot. I have read quite a lot of your website articles and while I'm not a Blockbridge customer and I'll probably never be (much smaller setup here) I have appreciated your KB articles a lot.

But now a question has arisen: Is LVM-Thin dangerous as QCOW2 is? I mean, does it also never flush metadata writes unless it's instructed to do so by the guest file system?
 
  • Like
Reactions: Johannes S
But now a question has arisen: Is LVM-Thin dangerous as QCOW2 is? I mean, does it also never flush metadata writes unless it's instructed to do so by the guest file system?
Hi @Kurgan,

Thanks for the great question. LVM is considerably more sophisticated than QEMU/QCOW, though I'm not an expert in its internal architecture. My assumption is that it uses a mix of demand-based and timer-based flushing. I'll try to dig into this and follow up with more detail after the holiday.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Kurgan
Hi @Kurgan,

Based on a quick review, LVM-thin appears to offer better durability characteristics than QCOW2 for a few reasons:

LVM-thin stores its metadata in a B-tree and applies updates transactionaly (though not via a traditional journal). This design tends to preserve metadata ordering during writeback, reducing the risk of inconsistent or reordered indirect mappings after a power loss.

LVM-thin also flushes its metadata at regular intervals, keeping the in-flight metadata window small and limiting the amount of state that can be lost during an unexpected shutdown.

Unlike QCOW2, LVM-thin does not implement sub-cluster allocation. Sub-cluster-sized writes are handled through copy-on-write at the block level, which avoids much of the metadata churn associated with QCOW2’s finer-grained allocation. This reduces, although does not entirely eliminate, the likelihood of torn or partially updated data.

In summary, LVM-thin is generally safer than QCOW2 in these failure scenarios, but it still falls short of the guarantees provided by enterprise-grade storage systems.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Hi @Alwin Antreich

Good question. I’d expect both cache=direct and cache=writethrough to offer stronger consistency guarantees, but I’ll run some tests and report back.

For what it’s worth, some of the confusion around cache=none stems from the link you shared (we also reference it in our article). It describes how QEMU interacts with storage and explains why a guest advertises a write-back cache, but the reasoning isn’t entirely complete. When QCOW is the backend, the behavior is driven much more by QCOW’s volatile metadata than by the underlying storage device itself.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S