So I did a series of tests.... local tests and remote tests over NFS. Let's talk about the local tests first:
The same VM was duplicated 4 times for all 4 experiments. The virtual disk is 200GB in size, but the qcow2 file is just over 500GB in total size.
Here's the list of snapshot names, which aren't too useful, but they do include date stamps in the name:
Feb_4_2024
Feb_9_2024
Before_12_5
Feb_15_2024
Feb_29_2024
Mar_10_2024
Mar_23_2024
Apr_8_2024
May_5_2024
June_6_2024
June_17_2024
June_21_2024
July_2_2024
July_2_2024b
July_18_2024
July_27_2024
July_31_2024
For the purposes of this test, this node in my HA cluster had no VMs running, and was not the master to minimize the workload on the machine to maximize the performance and repeatability of the snapshot deletion process. Using top I could see that the deletion process is single-threaded, so my relatively weak D-2123IT CPUs will definitely create a bottleneck.
The local storage was Samsung 970 Pro 1TB nvme drive. The local storage is ZFS formatted, and autotrim was enabled. The local storage is exclusively for the OS, and I never run VMs directly off the local storage. I used the "time" command to get the real time for each process to complete.
The proxmox host is a D-2123IT based system with 256GB of RAM.
Code:
Feb_4_2024 26 minutes 50 seconds
Feb_9_2024 25 minutes 8 seconds
Before_12_5 29 minutes 24 seconds
Feb_15_2024 25 minutes 45 seconds
Feb_29_2024 29 minutes 34 seconds
Mar_10_2024 30 minutes 38 seconds
Mar_23_2024 37 minutes 41 seconds
Apr_8_2024 43 minutes 6 seconds
May_5_2024 55 minutes 58 seconds
June_6_2024 37 minutes 14 seconds
June_17_2024 36 minutes 9 seconds
June_21_2024 43 minutes 16 seconds
July_2_2024 30 minutes 21 seconds
July_2_2024b 31 minutes 46 seconds
July_18_2024 31 minutes 40 seconds
July_27_2024 27 minutes 40 seconds
July_31_2024 30 minutes 44 seconds
Total time: 573 minutes 3 seconds
Deleting them in the reverse order:
Code:
July_31_2024 28 minutes 11 seconds
July_27_2024 29 minutes 13 seconds
July_18_2024 29 minutes 41 seconds
July_2_2024b 29 minutes 39 seconds
July_2_2024 24 minutes 7 seconds
June_21_2024 35 minutes 8 seconds
June_17_2024 38 minutes 28 seconds
June_6_2024 43 minutes 41 seconds
May_5_2024 50 minutes 2 seconds
Apr_8_2024 30 minutes 23 seconds
Mar_23_2024 29 minutes 53 seconds
Mar_10_2024 29 minutes 50 seconds
Feb_29_2024 29 minutes 38 seconds
Feb_15_2024 24 minutes 55 seconds
Before_12_5 23 minutes 17 seconds
Feb_9_2024 29 minutes 58 seconds
Feb_4_2024 178 minutes 46 seconds
Total time: 695 minutes 0 seconds
So the difference between sequentially versus reverse-sequential was pretty significant. It was over 100 minutes (or 18% faster) going sequentially. I'm not really sure why the last snapshot took so incredibly long when going in reverse-sequential, but it took a ridiculous amount of time to complete. If it hadn't been for the last snapshot deletion, it would have been a very close race.
Now for over my network. The proxmox has 10Gb networking to a TrueNAS Core system with an all-SSD zpool running with 3 mirrored vdevs and an slog. The protocol is NFS. Here's how the numbers came out:
Code:
Feb_4_2024 7 minutes 29 seconds
Feb_9_2024 5 minutes 47 seconds
Before_12_5 9 minutes 26 seconds
Feb_15_2024 7 minutes 44 seconds
Feb_29_2024 9 minutes 4 seconds
Mar_10_2024 10 minutes 12 seconds
Mar_23_2024 15 minutes 40 seconds
Apr_8_2024 21 minutes 8 seconds
May_5_2024 29 minutes 35 seconds
June_6_2024 11 minutes 21 seconds
June_17_2024 9 minutes 49 seconds
June_21_2024 17 minutes 1 seconds
July_2_2024 8 minutes 10 seconds
July_2_2024b 10 minutes 47 seconds
July_18_2024 11 minutes 36 seconds
July_27_2024 8 minutes 20 seconds
July_31_2024 8 minutes 17 seconds
Total time: 201 minutes 36 seconds
Code:
July_31_2024 5 minutes 58 seconds
July_27_2024 6 minutes 54 seconds
July_18_2024 6 minutes 42 seconds
July_2_2024b 6 minutes 9 seconds
July_2_2024 8 minutes 26 seconds
June_21_2024 7 minutes 49 seconds
June_17_2024 9 minutes 14 seconds
June_6_2024 12 minutes 42 seconds
May_5_2024 21 minutes 5 seconds
Apr_8_2024 9 minutes 49 seconds
Mar_23_2024 8 minutes 35 seconds
Mar_10_2024 8 minutes 48 seconds
Feb_29_2024 8 minutes 58 seconds
Feb_15_2024 7 minutes 2 seconds
Before_12_5 5 minutes 21 seconds
Feb_9_2024 8 minutes 41 seconds
Feb_4_2024 147 minutes 5 seconds
Total time: 289 minutes 24 seconds
Yes, the NFS was faster than local storage. But its not fair to compare local vs NFS because there are obvious differences being a network sharing protocol and the storage on the TrueNAS is significantly faster than the 970 Pro.
But comparing NFS to NFS, the sequential deletion is still faster. In this case, it's about 30% faster, or 88 minutes. That's pretty significant. Again though, that last snapshot deletion is a real killer for some reason.
I have no idea what all of this means in the big picture, but I definitely will be deleting snapshots from oldest to newest from now on.
And I'll be making it a higher priority to not keep 6+ months worth of snapshots on a VM. That was totally an accident and I plan to be more diligent in the future!