[TUTORIAL] Inside Proxmox VE 9 SAN Snapshot Support

bbgeek17

Distinguished Member
Nov 20, 2020
5,622
1,912
228
Blockbridge
www.blockbridge.com
Hi Everyone,

I'm excited to share a deep dive into one of the notable new feature in PVE9: the enhanced snapshot functionality designed for legacy SANs. This feature represents a meaningful step forward in expanding PVE's compatibility with traditional enterprise storage setups, something many of us have been eager to see... and something I've answered a lot of questions about over the years ;)

Inside Proxmox VE 9 SAN Snapshot Support: https://kb.blockbridge.com/technote/proxmox-qcow-snapshots-on-lvm/

Our latest technote dives into the architecture, how the feature operates, and some limitations to keep in mind. It's important to highlight that this is currently a technology preview, so be considerate.

For some background, many enterprise PVE users had to rely on file-based storage like NFS for VM snapshots because of their compatibility with QCOW2 disk images. Legacy SANs, with their static LUN provisioning, have posed a tough challenge because of QCOW2's dynamic nature. PVE9's new approach aims to bridge that gap, using clever LVM management, which could open doors for many users relying on traditional SANs.

If you're considering testing this feature out or want to understand the nuts and bolts, this article may be a good read. As always, let me know if you have questions, corrections, or if we've missed anything!

Enjoy, The Blockbridge Team!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,
thanks for the article, I hope you enjoy my work ^_^

about performance, they are still performance improvement possible, mostly with qemu l2 cache size. Currently it's really depend of the qcow2 volume size, it's working up to around ~256GB. for bigger volume, we need to add an option in qemu to increase l2 metadatas cache. (it should be easy to add, and from my test, I really help to match 90% raw performance with bigger volume)

Another improvement possible, is to keep raw for main volume, and only qcow2 for snapshot (I'll need more work because of snasphot volume renaming)

I'm also also for ""thin provisoning"", with lvm volume smaller than qcow2 virtual size, with dynamic resize of the lvm volume by chunk of 1GB for example
 
about
"
When a snapshot is deleted, the system merges data from the previous QCOW image and LVM logical volume into the new QCOW image and logical volume. This merge process allocates additional storage in the new logical volume but does not release any storage from the old one."

the option "wipe removed volumes" should fill with zero the old snapshot space.

if the storage support discard, another way could be to add "`issue_discards = 0" in lvm.conf , but I'll look to do it with blkdiscard command on snapshot removal in coming patches to avoid the need to tuning lvm.conf manually. (maybe reuse the discard option of the vm to enable the feature, something like that)
 
Hi Spirit,

Thank you for your hard work! I especially appreciate how cleanly the feature is integrated, with no excess knobs or tunables. Great job by the whole team!

We're aligned on the QEMU caching. Larger caches can help somewhat, but it's worth considering how the extra memory footprint will interact with chained snapshots.

FWIW, we've had hundreds of inquiries from people with legacy SANs looking for a fix to this exact problem. Most of these SANs are either near EOL or have only modest performance. In those cases, correctness is a far bigger concern than raw speed.

The main issue we see is ensuring the raw device is fully zeroed before layering a QCOW on top. QCOW doesn't serialize metadata and data writes, and it lacks a journal. If the backing device doesn't read zeros, a power loss or process termination can lead to data corruption.

Addressing this is tricky. Zeroing semantics for unmap/discard vary by device and implementation, but they can usually be checked via the device's VPD pages. Offloaded zeroing is worth exploring, though older systems might not support it, and multiple device vendors have buggy implementations that the kernel disabled via quirks.

On our side, we could detect when zeroing is needed automatically. But building a vendor-specific fix just for this case doesn't seem like the right long-term approach.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Hi Spirit,

Thank you for your hard work! I especially appreciate how cleanly the feature is integrated, with no excess knobs or tunables. Great job by the whole team!

We're aligned on the QEMU caching. Larger caches can help somewhat, but it's worth considering how the extra memory footprint will interact with chained snapshots.
each snapshot have his own cache. (not that the cache value is a max value, by default 32MB max, but nothing is loaded until you begin to read differents part of disk). Also, metadatas not used are removed after 10min.
I just sent a simple patch to the mailing, to increase the max cache size to 1GB by default (this should allow 8TB image). I don't known if we need to make it tunable or not.

FWIW, we've had hundreds of inquiries from people with legacy SANs looking for a fix to this exact problem. Most of these SANs are either near EOL or have only modest performance. In those cases, correctness is a far bigger concern than raw speed.
yes, I have a lot of customer with same problem too, that was the main reason of this work ^_^

The main issue we see is ensuring the raw device is fully zeroed before layering a QCOW on top. QCOW doesn't serialize metadata and data writes, and it lacks a journal. If the backing device doesn't read zeros, a power loss or process termination can lead to data corruption.
note that on image clone (from a template for example), qemu-img convert should zeroing the disk (I'm 100% sure that it's doing some kind of triming, but I think it's zeroing too). but for new empty image, or create a new image for snapshot , it's not done.
Addressing this is tricky. Zeroing semantics for unmap/discard vary by device and implementation, but they can usually be checked via the device's VPD pages.
yes, I known that, at minimum I would like to add an option on the storage to use fast discard , and keep also "safe removal" zeroing, but maybe use them for image create/snapshot. (this could slowdown snapshot creation, but it's not a problem)

Offloaded zeroing is worth exploring, though older systems might not support it, and multiple device vendors have buggy implementations that the kernel disabled via quirks.

On our side, we could detect when zeroing is needed automatically. But building a vendor-specific fix just for this case doesn't seem like the right long-term approach.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I'll try to work on this next week, I'll keep you in touch !
 
  • Like
Reactions: Johannes S
I have done test with blkdiscard --zeroing, from my tests, it's around 4~5x faster than the current csteam code (without throttling).
Another benefit, is seem that is skip already zeroed block. so if you create,delete,create,delete,... it's a lot faster the second times if the block has not been rewriten.
and I can easy add a knob to disable zeroing, and use true discard if the storage support it.
(I have looked a redhat ovirt code, and they are also use blkdiscard)
 
  • Like
Reactions: Johannes S
@bbgeek17

in your article:

"When a snapshot is created, subsequent writes are redirected to a new LVM logical volume. On a thin-provisioned SAN, storage for these writes is allocated on demand. When a snapshot is deleted, the system merges data from the previous QCOW image and LVM logical volume into the new QCOW image and logical volume. This merge process allocates additional storage in the new logical volume but does not release any storage from the old one."

currently, if you enable "“Wipe Removed Volumes”, it's also zeroing the snapshot volume after the merge.
is it enough for your to release the storage space on your san side ? (I'll also replace the current cstream code with blkdiscard with zeroing or discard for this part)
 
  • Like
Reactions: Johannes S
Hi Spirit,

It'll be a little while before the team can revisit this (and I'm currently OoO). That said, increasing the default cache size likely introduces serious real-world risks. I believe it creates a denial-of-service vulnerability and makes cluster-wide resource scheduling extremely difficult. With a relatively simple workload, an attacker could deplete system memory just by creating snapshots during modest I/O or even using multiple VMs running on the same host. Since qemu cache memory consumption will correlate with snapshots and I/O patterns, this is going to be a problem.

I'm happy to test it when I'm back, but this needs careful consideration from the PVE team. To be candid, we had already identified the default cache size as a vulnerability in the context of the chained snapshot model. This change makes it much worse. At a minimum, memory usage needs to be bounded. In practice, it really should also be deterministic.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Johannes S
Hi Spirit,

It'll be a little while before the team can revisit this (and I'm currently OoO). That said, increasing the default cache size likely introduces serious real-world risks. I believe it creates a denial-of-service vulnerability and makes cluster-wide resource scheduling extremely difficult. With a relatively simple workload, an attacker could deplete system memory just by creating snapshots during modest I/O or even using multiple VMs running on the same host. Since qemu cache memory consumption will correlate with snapshots and I/O patterns, this is going to be a problem.
the cache value is a max value, it's only allocated when you need to load specific metadatas (and unused metadatas are flushed after 10min). It's not too much different than zfs btw, where you need memory to handle metadatas too.
so , yes, if you read full disk with 8TB datas in less than 10min, it'll use 1GB memory.

I'm happy to test it when I'm back, but this needs careful consideration from the PVE team. To be candid, we had already identified the default cache size as a vulnerability in the context of the chained snapshot model. This change makes it much worse. At a minimum, memory usage needs to be bounded. In practice, it really should also be deterministic.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
It could be a configurable option, if user want to privilege memory consumption vs performance.
 
(I forgot to say that for snapshot, we use qcow2 sub-allocated cluster (l2_extended=on) with 128k cluster size, so metadatas are 32x smaller than base image (64k cluster without suballocated cluster).

around 4MB memory for 1TB image
https://www.youtube.com/watch?v=NfgLCdtkRus

Also, the memory needed is not cumulative, the memory is allocated on the image where the cluster is read.
for example

baseimage (data at cluster1)--->snap1 (cluster1 empty), on read, the data will be read on the baseimage (and fill the baseimage cache), but not the snap1 cache.

of course if you read baseimage, then create a new snap, rewrite the data on snap1, read it again, the cache will be fill on baseimage + snap1. (but after 10min, it'll be release from baseimage cache, as it's not read anymore)
 
Last edited:
Good job @spirit
I would to suggest that this plugin should be displayed only when we add LVM-based storage, back there in Datacenter -> Storage, since I think this is to be use only with LVM + SAN, right?

Cheers
It's use for any qcow2 storage, including file storage. (local,nfs,...). External snapshot allow snap/delete snap without interruption. Current internal snapshot freeze the vm when deleting snapshot for example.
 
  • Like
Reactions: Gilberto Ferreira
It's use for any qcow2 storage, including file storage. (local,nfs,...). External snapshot allow snap/delete snap without interruption. Current internal snapshot freeze the vm when deleting snapshot for example.
Oh I see... I was under the assumption that this is only to LVM in combination with SAN, because this a common scenario in Enterprise.
But I am glad that this will works on any other storages.

Thanks