Deleting snapshots from qcow2

j_s · Feb 5, 2024

I'm currently writing a script I can invoke from the command line to delete old snapshots in a qcow2 file. I often have more than 10 old snapshots.

If I have an arbitrary qcow2 for a VM with arbitrary snapshots going back some months, is there a difference between deleting snapshots from oldest to newest vs newest to oldest?

My intention is to keep the most recent two snapshots, but discard all others except those that don't have the word "PROTECTED" in the description for the snapshot.

As an example, a VM I am currently cleaning up has 12 snapshots, from random times between Nov 4th 2023 and yesterday. I'm keeping the Nov 4th snapshot and the one from yesterday and one from last thursday. But the other 9 can be deleted.

Right now they delete oldest to newest, but I'm wondering if newest to oldest might be faster.

Thanks!

dakralex · Aug 14, 2024

I haven't checked the source code on this, but logically it would make sense that deleting QEMU snapshots is most efficient and as fast as possible when deleting them from oldest to newest, as the snapshots need to be written to the initial snapshot disk image. But it can also very much depend on which underlying storage you use and this would need further investigation on how QEMU does the snapshot deletion internally.

FYI but off-topic, it would be advisable to use backups for data redundancy and could also setup pruning and/or retention jobs if it meets your criteria.

j_s · Aug 15, 2024

So I did a series of tests.... local tests and remote tests over NFS. Let's talk about the local tests first:

The same VM was duplicated 4 times for all 4 experiments. The virtual disk is 200GB in size, but the qcow2 file is just over 500GB in total size.

Here's the list of snapshot names, which aren't too useful, but they do include date stamps in the name:

Feb_4_2024
Feb_9_2024
Before_12_5
Feb_15_2024
Feb_29_2024
Mar_10_2024
Mar_23_2024
Apr_8_2024
May_5_2024
June_6_2024
June_17_2024
June_21_2024
July_2_2024
July_2_2024b
July_18_2024
July_27_2024
July_31_2024
For the purposes of this test, this node in my HA cluster had no VMs running, and was not the master to minimize the workload on the machine to maximize the performance and repeatability of the snapshot deletion process. Using top I could see that the deletion process is single-threaded, so my relatively weak D-2123IT CPUs will definitely create a bottleneck.

The local storage was Samsung 970 Pro 1TB nvme drive. The local storage is ZFS formatted, and autotrim was enabled. The local storage is exclusively for the OS, and I never run VMs directly off the local storage. I used the "time" command to get the real time for each process to complete.

The proxmox host is a D-2123IT based system with 256GB of RAM.

Code:

Feb_4_2024      26 minutes 50 seconds
Feb_9_2024      25 minutes 8 seconds
Before_12_5     29 minutes 24 seconds
Feb_15_2024     25 minutes 45 seconds
Feb_29_2024     29 minutes 34 seconds
Mar_10_2024     30 minutes 38 seconds 
Mar_23_2024     37 minutes 41 seconds
Apr_8_2024      43 minutes 6 seconds
May_5_2024      55 minutes 58 seconds
June_6_2024     37 minutes 14 seconds
June_17_2024    36 minutes 9 seconds
June_21_2024    43 minutes 16 seconds
July_2_2024     30 minutes 21 seconds
July_2_2024b    31 minutes 46 seconds
July_18_2024    31 minutes 40 seconds
July_27_2024    27 minutes 40 seconds
July_31_2024    30 minutes 44 seconds


Total time: 573 minutes 3 seconds

Deleting them in the reverse order:

Code:

July_31_2024     28 minutes 11 seconds
July_27_2024     29 minutes 13 seconds
July_18_2024     29 minutes 41 seconds
July_2_2024b     29 minutes 39 seconds
July_2_2024      24 minutes 7 seconds
June_21_2024     35 minutes 8 seconds
June_17_2024     38 minutes 28 seconds
June_6_2024      43 minutes 41 seconds
May_5_2024       50 minutes 2 seconds
Apr_8_2024       30 minutes 23 seconds
Mar_23_2024      29 minutes 53 seconds
Mar_10_2024      29 minutes 50 seconds
Feb_29_2024      29 minutes 38 seconds
Feb_15_2024      24 minutes 55 seconds
Before_12_5      23 minutes 17 seconds
Feb_9_2024       29 minutes 58 seconds
Feb_4_2024       178 minutes 46 seconds

Total time: 695 minutes 0 seconds

So the difference between sequentially versus reverse-sequential was pretty significant. It was over 100 minutes (or 18% faster) going sequentially. I'm not really sure why the last snapshot took so incredibly long when going in reverse-sequential, but it took a ridiculous amount of time to complete. If it hadn't been for the last snapshot deletion, it would have been a very close race.

Now for over my network. The proxmox has 10Gb networking to a TrueNAS Core system with an all-SSD zpool running with 3 mirrored vdevs and an slog. The protocol is NFS. Here's how the numbers came out:

Code:

Feb_4_2024      7 minutes 29 seconds
Feb_9_2024      5 minutes 47 seconds
Before_12_5     9 minutes 26 seconds
Feb_15_2024     7 minutes 44 seconds
Feb_29_2024     9 minutes 4 seconds
Mar_10_2024     10 minutes 12 seconds 
Mar_23_2024     15 minutes 40 seconds
Apr_8_2024      21 minutes 8 seconds
May_5_2024      29 minutes 35 seconds
June_6_2024     11 minutes 21 seconds
June_17_2024    9 minutes 49 seconds
June_21_2024    17 minutes 1 seconds
July_2_2024     8 minutes 10 seconds
July_2_2024b    10 minutes 47 seconds
July_18_2024    11 minutes 36 seconds
July_27_2024    8 minutes 20 seconds
July_31_2024    8 minutes 17 seconds


Total time: 201 minutes 36 seconds

Code:

July_31_2024     5 minutes 58 seconds
July_27_2024     6 minutes 54 seconds
July_18_2024     6 minutes 42 seconds
July_2_2024b     6 minutes 9 seconds
July_2_2024      8 minutes 26 seconds
June_21_2024     7 minutes 49 seconds
June_17_2024     9 minutes 14 seconds
June_6_2024      12 minutes 42 seconds
May_5_2024       21 minutes 5 seconds
Apr_8_2024       9 minutes 49 seconds
Mar_23_2024      8 minutes 35 seconds
Mar_10_2024      8 minutes 48 seconds
Feb_29_2024      8 minutes 58 seconds
Feb_15_2024      7 minutes 2 seconds
Before_12_5      5 minutes 21 seconds
Feb_9_2024       8 minutes 41 seconds
Feb_4_2024       147 minutes 5 seconds

Total time: 289 minutes 24 seconds

Yes, the NFS was faster than local storage. But its not fair to compare local vs NFS because there are obvious differences being a network sharing protocol and the storage on the TrueNAS is significantly faster than the 970 Pro.

But comparing NFS to NFS, the sequential deletion is still faster. In this case, it's about 30% faster, or 88 minutes. That's pretty significant. Again though, that last snapshot deletion is a real killer for some reason.

I have no idea what all of this means in the big picture, but I definitely will be deleting snapshots from oldest to newest from now on.

And I'll be making it a higher priority to not keep 6+ months worth of snapshots on a VM. That was totally an accident and I plan to be more diligent in the future!

waltar · Aug 15, 2024

That's a little bit funny isn't it ? I love to login onto nfs fileserver with xfs, make a vm reflink copy as snapshot in milliseconds perhaps montly as you shown and if I want to delete one it takes again less than 1s per snapshot as it's only 1 file ...

dakralex · Aug 16, 2024

FYI, as I just have realized this: You run a Copy-on-write (CoW) filesystem on top of another CoW, in this case qcow2 on top of zfs, even though zfs does provide the same snapshotting functionality for guests. This will usually result in performance issues and wear out SSDs faster.

Kingneutron · Aug 17, 2024

j_s said:
I have no idea what all of this means in the big picture, but I definitely will be deleting snapshots from oldest to newest from now on.

And I'll be making it a higher priority to not keep 6+ months worth of snapshots on a VM. That was totally an accident and I plan to be more diligent in the future!

I wouldn't keep a VM snapshot more than ~2 weeks at the outside (depending on how much I/O the vm is doing) if you don't want to risk performance slowdowns. Possibly less than that if the backing storage is spinning-HD

They're just not meant to run that way; VM snapshot is if you need to backout quickly after a bad update or something. Then you have backups. Cloning the VM into several separate states in a lab is maybe more the way to go if you might need a particular save-point to go back to

Search

Search

Deleting snapshots from qcow2

j_s

Member

dakralex

Proxmox Staff Member

j_s

Member

waltar

Active Member

dakralex

Proxmox Staff Member

Kingneutron

Active Member