[TUTORIAL] Inside Proxmox VE 9 SAN Snapshot Support

bbgeek17 · Aug 12, 2025

Hi Everyone,

I'm excited to share a deep dive into one of the notable new feature in PVE9: the enhanced snapshot functionality designed for legacy SANs. This feature represents a meaningful step forward in expanding PVE's compatibility with traditional enterprise storage setups, something many of us have been eager to see... and something I've answered a lot of questions about over the years

Inside Proxmox VE 9 SAN Snapshot Support: https://kb.blockbridge.com/technote/proxmox-qcow-snapshots-on-lvm/

Our latest technote dives into the architecture, how the feature operates, and some limitations to keep in mind. It's important to highlight that this is currently a technology preview, so be considerate.

For some background, many enterprise PVE users had to rely on file-based storage like NFS for VM snapshots because of their compatibility with QCOW2 disk images. Legacy SANs, with their static LUN provisioning, have posed a tough challenge because of QCOW2's dynamic nature. PVE9's new approach aims to bridge that gap, using clever LVM management, which could open doors for many users relying on traditional SANs.

If you're considering testing this feature out or want to understand the nuts and bolts, this article may be a good read. As always, let me know if you have questions, corrections, or if we've missed anything!

Enjoy, The Blockbridge Team!

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

spirit · Aug 12, 2025

Hi,
thanks for the article, I hope you enjoy my work ^_^

about performance, they are still performance improvement possible, mostly with qemu l2 cache size. Currently it's really depend of the qcow2 volume size, it's working up to around ~256GB. for bigger volume, we need to add an option in qemu to increase l2 metadatas cache. (it should be easy to add, and from my test, I really help to match 90% raw performance with bigger volume)

Another improvement possible, is to keep raw for main volume, and only qcow2 for snapshot (I'll need more work because of snasphot volume renaming)

I'm also also for ""thin provisoning"", with lvm volume smaller than qcow2 virtual size, with dynamic resize of the lvm volume by chunk of 1GB for example

spirit · Aug 12, 2025

about
"
When a snapshot is deleted, the system merges data from the previous QCOW image and LVM logical volume into the new QCOW image and logical volume. This merge process allocates additional storage in the new logical volume but does not release any storage from the old one."

the option "wipe removed volumes" should fill with zero the old snapshot space.

if the storage support discard, another way could be to add "`issue_discards = 0" in lvm.conf , but I'll look to do it with blkdiscard command on snapshot removal in coming patches to avoid the need to tuning lvm.conf manually. (maybe reuse the discard option of the vm to enable the feature, something like that)

bbgeek17 · Aug 13, 2025

Hi Spirit,

Thank you for your hard work! I especially appreciate how cleanly the feature is integrated, with no excess knobs or tunables. Great job by the whole team!

We're aligned on the QEMU caching. Larger caches can help somewhat, but it's worth considering how the extra memory footprint will interact with chained snapshots.

FWIW, we've had hundreds of inquiries from people with legacy SANs looking for a fix to this exact problem. Most of these SANs are either near EOL or have only modest performance. In those cases, correctness is a far bigger concern than raw speed.

The main issue we see is ensuring the raw device is fully zeroed before layering a QCOW on top. QCOW doesn't serialize metadata and data writes, and it lacks a journal. If the backing device doesn't read zeros, a power loss or process termination can lead to data corruption.

Addressing this is tricky. Zeroing semantics for unmap/discard vary by device and implementation, but they can usually be checked via the device's VPD pages. Offloaded zeroing is worth exploring, though older systems might not support it, and multiple device vendors have buggy implementations that the kernel disabled via quirks.

On our side, we could detect when zeroing is needed automatically. But building a vendor-specific fix just for this case doesn't seem like the right long-term approach.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

spirit · Aug 13, 2025

bbgeek17 said:
Hi Spirit,

Thank you for your hard work! I especially appreciate how cleanly the feature is integrated, with no excess knobs or tunables. Great job by the whole team!

bbgeek17 said:
We're aligned on the QEMU caching. Larger caches can help somewhat, but it's worth considering how the extra memory footprint will interact with chained snapshots.

each snapshot have his own cache. (not that the cache value is a max value, by default 32MB max, but nothing is loaded until you begin to read differents part of disk). Also, metadatas not used are removed after 10min.
I just sent a simple patch to the mailing, to increase the max cache size to 1GB by default (this should allow 8TB image). I don't known if we need to make it tunable or not.

bbgeek17 said:
FWIW, we've had hundreds of inquiries from people with legacy SANs looking for a fix to this exact problem. Most of these SANs are either near EOL or have only modest performance. In those cases, correctness is a far bigger concern than raw speed.

yes, I have a lot of customer with same problem too, that was the main reason of this work ^_^

bbgeek17 said:
The main issue we see is ensuring the raw device is fully zeroed before layering a QCOW on top. QCOW doesn't serialize metadata and data writes, and it lacks a journal. If the backing device doesn't read zeros, a power loss or process termination can lead to data corruption.

note that on image clone (from a template for example), qemu-img convert should zeroing the disk (I'm 100% sure that it's doing some kind of triming, but I think it's zeroing too). but for new empty image, or create a new image for snapshot , it's not done.

bbgeek17 said:
Addressing this is tricky. Zeroing semantics for unmap/discard vary by device and implementation, but they can usually be checked via the device's VPD pages.

yes, I known that, at minimum I would like to add an option on the storage to use fast discard , and keep also "safe removal" zeroing, but maybe use them for image create/snapshot. (this could slowdown snapshot creation, but it's not a problem)

bbgeek17 said:
Offloaded zeroing is worth exploring, though older systems might not support it, and multiple device vendors have buggy implementations that the kernel disabled via quirks.

On our side, we could detect when zeroing is needed automatically. But building a vendor-specific fix just for this case doesn't seem like the right long-term approach.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I'll try to work on this next week, I'll keep you in touch !

spirit · Aug 14, 2025

I have done test with blkdiscard --zeroing, from my tests, it's around 4~5x faster than the current csteam code (without throttling).
Another benefit, is seem that is skip already zeroed block. so if you create,delete,create,delete,... it's a lot faster the second times if the block has not been rewriten.
and I can easy add a knob to disable zeroing, and use true discard if the storage support it.
(I have looked a redhat ovirt code, and they are also use blkdiscard)

spirit · Aug 14, 2025

@bbgeek17

in your article:

"When a snapshot is created, subsequent writes are redirected to a new LVM logical volume. On a thin-provisioned SAN, storage for these writes is allocated on demand. When a snapshot is deleted, the system merges data from the previous QCOW image and LVM logical volume into the new QCOW image and logical volume. This merge process allocates additional storage in the new logical volume but does not release any storage from the old one."

currently, if you enable "“Wipe Removed Volumes”, it's also zeroing the snapshot volume after the merge.
is it enough for your to release the storage space on your san side ? (I'll also replace the current cstream code with blkdiscard with zeroing or discard for this part)

spirit · Aug 19, 2025

@bbgeek17

if you have time to test, I have send 2 patches:

1) bumping max qcow2 cache size up to 1GB memory (it should handle 8TB image without performance degradation)

https://lists.proxmox.com/pipermail/pve-devel/2025-August/074555.html

2) saferemove speed improvement using blkdiscard (with optional discard feature if the storage support zeroing through discard)

https://lists.proxmox.com/pipermail/pve-devel/2025-August/074610.html

bbgeek17 · Aug 20, 2025

Hi Spirit,

It'll be a little while before the team can revisit this (and I'm currently OoO). That said, increasing the default cache size likely introduces serious real-world risks. I believe it creates a denial-of-service vulnerability and makes cluster-wide resource scheduling extremely difficult. With a relatively simple workload, an attacker could deplete system memory just by creating snapshots during modest I/O or even using multiple VMs running on the same host. Since qemu cache memory consumption will correlate with snapshots and I/O patterns, this is going to be a problem.

I'm happy to test it when I'm back, but this needs careful consideration from the PVE team. To be candid, we had already identified the default cache size as a vulnerability in the context of the chained snapshot model. This change makes it much worse. At a minimum, memory usage needs to be bounded. In practice, it really should also be deterministic.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

spirit · Aug 20, 2025

bbgeek17 said:
Hi Spirit,

It'll be a little while before the team can revisit this (and I'm currently OoO). That said, increasing the default cache size likely introduces serious real-world risks. I believe it creates a denial-of-service vulnerability and makes cluster-wide resource scheduling extremely difficult. With a relatively simple workload, an attacker could deplete system memory just by creating snapshots during modest I/O or even using multiple VMs running on the same host. Since qemu cache memory consumption will correlate with snapshots and I/O patterns, this is going to be a problem.

the cache value is a max value, it's only allocated when you need to load specific metadatas (and unused metadatas are flushed after 10min). It's not too much different than zfs btw, where you need memory to handle metadatas too.
so , yes, if you read full disk with 8TB datas in less than 10min, it'll use 1GB memory.

bbgeek17 said:
I'm happy to test it when I'm back, but this needs careful consideration from the PVE team. To be candid, we had already identified the default cache size as a vulnerability in the context of the chained snapshot model. This change makes it much worse. At a minimum, memory usage needs to be bounded. In practice, it really should also be deterministic.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

It could be a configurable option, if user want to privilege memory consumption vs performance.

spirit · Aug 20, 2025

(I forgot to say that for snapshot, we use qcow2 sub-allocated cluster (l2_extended=on) with 128k cluster size, so metadatas are 32x smaller than base image (64k cluster without suballocated cluster).

around 4MB memory for 1TB image
https://www.youtube.com/watch?v=NfgLCdtkRus

Also, the memory needed is not cumulative, the memory is allocated on the image where the cluster is read.
for example

baseimage (data at cluster1)--->snap1 (cluster1 empty), on read, the data will be read on the baseimage (and fill the baseimage cache), but not the snap1 cache.

of course if you read baseimage, then create a new snap, rewrite the data on snap1, read it again, the cache will be fill on baseimage + snap1. (but after 10min, it'll be release from baseimage cache, as it's not read anymore)

Gilberto Ferreira · Aug 20, 2025

Good job @spirit
I would to suggest that this plugin should be displayed only when we add LVM-based storage, back there in Datacenter -> Storage, since I think this is to be use only with LVM + SAN, right?

Cheers

spirit · Aug 20, 2025

Gilberto Ferreira said:
Good job @spirit
I would to suggest that this plugin should be displayed only when we add LVM-based storage, back there in Datacenter -> Storage, since I think this is to be use only with LVM + SAN, right?

Cheers

It's use for any qcow2 storage, including file storage. (local,nfs,...). External snapshot allow snap/delete snap without interruption. Current internal snapshot freeze the vm when deleting snapshot for example.

Gilberto Ferreira · Aug 20, 2025

spirit said:
It's use for any qcow2 storage, including file storage. (local,nfs,...). External snapshot allow snap/delete snap without interruption. Current internal snapshot freeze the vm when deleting snapshot for example.

Oh I see... I was under the assumption that this is only to LVM in combination with SAN, because this a common scenario in Enterprise.
But I am glad that this will works on any other storages.

Thanks

bbgeek17 · Sep 17, 2025

spirit said:
(I forgot to say that for snapshot, we use qcow2 sub-allocated cluster (l2_extended=on) with 128k cluster size, so metadatas are 32x smaller than base image (64k cluster without suballocated cluster).

around 4MB memory for 1TB image
https://www.youtube.com/watch?v=NfgLCdtkRus

Also, the memory needed is not cumulative, the memory is allocated on the image where the cluster is read.
for example

baseimage (data at cluster1)--->snap1 (cluster1 empty), on read, the data will be read on the baseimage (and fill the baseimage cache), but not the snap1 cache.

of course if you read baseimage, then create a new snap, rewrite the data on snap1, read it again, the cache will be fill on baseimage + snap1. (but after 10min, it'll be release from baseimage cache, as it's not read anymore)

Hi @spirit,

Thanks for your patience. I’ve been meaning to get back to you, but things have been quite busy.

On the caching side, I think there may be a bit of misunderstanding around subcluster allocation and the effect of l2_extended=on.

In QCOW, enabling l2_extended increases the metadata overhead from 8 bytes per data cluster to 16 bytes. The extra 8 bytes are partly used for a bitmap that tracks subcluster allocation status. In isolation, this setting actually reduces cache efficiency by 50% because of the higher metadata overhead.

QCOW images with subcluster allocation let you split a native cluster into 32 subclusters. This doesn’t multiply the address space (it’s not an additional L3 table). Instead, it allows smaller portions of a cluster to be faulted in during a COW operation. That in turn helps optimize COW performance, especially when I/O sizes align with the subcluster allocation size.

To maintain consistent cache efficiency while using subcluster allocation with l2_extended=on, you’d need to double the chunk size so the metadata-to-data ratio remains constant. To actually improve efficiency you need to increase chunk size further.

As for the practical issue we discussed earlier, I ran a quick test with a 1 TiB disk and a few snapshots on a single VM. From rough measurements (looking at resident memory), I was able to increase QEMU’s memory footprint by nearly 500 MB by accessing only specific offsets that line up with the L2 table in each snapshot. This indicates that you don’t need to touch all of the data to grow the cache, just a handful of strategic offsets. Given modern system performance, it’s possible to inflate the cache this way in just a few hundred microseconds.

Regarding optimized zeroing, I'll get this in the queue of stuff to play around with!

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

spirit · Tuesday at 11:43

@bbgeek17 zeroing optimisation has been merged in pve-storage >= 9.0.14 !

Search

Search

[TUTORIAL] Inside Proxmox VE 9 SAN Snapshot Support

bbgeek17

Distinguished Member

spirit

Distinguished Member

spirit

Distinguished Member

bbgeek17

Distinguished Member

spirit

Distinguished Member

spirit

Distinguished Member

spirit

Distinguished Member

spirit

Distinguished Member

bbgeek17

Distinguished Member

spirit

Distinguished Member

spirit

Distinguished Member

Gilberto Ferreira

Renowned Member

spirit

Distinguished Member

Gilberto Ferreira

Renowned Member

bbgeek17

Distinguished Member

spirit

Distinguished Member

We value your privacy