The Hassle (and Joy) of Snapshots in a ZFS Cluster

Asano

Well-Known Member
Jan 27, 2018
55
10
48
43
I’m running a fairly active but small budget Proxmox cluster for circa one year now. Meaning depending on the current project workload it only consists of 3-4 nodes and uses ZFS due to its replication capabilities which allow nearly HA-like failover times (5 to 10 minutes in my case) with a budget far lower than a setup with real HA storage would require. And tbh this is great!

Obviously I’d like to use other ZFS advantages as well as much as possible. Next to replication this would mainly be snapshots. Doing so I used various scripts which use the Proxmox API or ZFS directly as well as eve4pve-autosnap and came across some Proxmox-induced problems, which I’d like to share here for other people in the same situation and the Proxmox team in case they are interested.

Problem
In a setup describe above there basically is only one ‘real’ problem regarding snapshots: Proxmox snapshots are not reliably stable. Meaning they can and will fail due to various reasons. In this case they always leave a locked VM with a messed up snapshot state behind.

This only happens with a chance of 0,05% to 0,1% when taking (or deleting) a snapshot which is a number that seems little. However, it is not… In my cluster I have about 30 VMs and since this is ZFS and snapshots are extremely cheap regarding every resource I make automated snapshots every hour of every VM. So there are over 5 000 snapshots a week. And when doing so with the Proxmox API this means there is nearly no week in which at least 1 or 2 VMs would hit a “bad snapshot”. This is bad since it means no further snapshots will be taken and, even worse, backups will be skipped as well due to the locked VM.

Regarding the reasons, why this happens, I have to speculate a little bit. The easiest to explain but also least frequent one I’ve seen is a “dataset busy” error. Since replication is enabled and the cluster is fairly active I can imagine this happen when a snapshot falls in a bad time with replications. However most of the time I saw errors from which I couldn’t get a good reason and errors during `delete` where more frequent than during `prepare`. I had the feeling that during the cluster backup the error chance was higher but that’s also just a feeling.

What could Proxmox do better
  • In case a snapshot creation or deletion fails the API should perform an automatic rollback. The manual process now is quite tedious: Check if the snapshot still has an actually corresponding ZFS snapshot and if so destroy it; delete the snapshot from the VM.conf file; unlock the VM.

  • Reduce the actual problems which lead to bad snapshots.

  • [Nice-to-have] Add an ‘exclude from snapshot’ feature like ‘exclude from backup’ and ‘exclude from replication’ since if a VM has a HDD on another storage for example in qcow2 you most defiantly don’t want it to automatically take snapshots every hour or so.

  • [Nice-to-have] Add a built-in and maintained snapshot schedule.

  • [Nice-to-have] Add an easy way (preferably per API) to boot up a VM clone from a snapshot so that we could automate stuff like backups from within a VM more easily.

What others/we can do until Proxmox maybe improves stuff
  • Do not use the Proxmox API for automated snapshots (and obviously also don’t use programs like eve4pve-autosnap which also use the Proxmox API)!

    Instead take automated snapshots directly via ZFS. In case of an error like “dataset busy” the only thing that happens is that one cycle is skipped and no further harm is done. Due to its nature ZFS snapshots are very consistent and even VMs with quite active DBs should be able to recover automatically with the DBMS crash recovery even if the VM is totally agnostic towards snapshot creation. I’m no expert here so I can’t say for sure but I used a few MongoDB and MySQL snapshots which were created in this manner and they always could recover without me doing anything special.

  • My solution which I now use for all VMs (and have used for a selected few to compare it to Proxmox API approaches since some month) is a cronjob in a container which is able to `ssh root@proxmox-nodes` and which runs the following scripts:

    cron-task.sh
    Code:
    #!/bin/bash
    ssh root@10.0.1.1 'bash -s' < /home/user/snap-vms.sh
    ssh root@10.0.1.2 'bash -s' < /home/user/snap-vms.sh
    ssh root@10.0.1.3 'bash -s' < /home/user/snap-vms.sh
    # or/and whatever ips your cluster nodes have

    snap-vms.sh
    Code:
    #!/bin/bash
    
    declare -a normal=(
    "vm-100-disk-1"
    "vm-101-disk-1"
    "subvol-102-disk-1"
    # and whatever further disks you want to include
    )
    
    declare -a short=(
    "vm-103-disk-1"
    # and whatever further disks you want to include
    )
    
    # for normal 6 snapshots are keept
    for i in "${normal[@]}"
    do
    	if zfs list -t all | grep -q 'rpool/data/'"$i"; then
    		if zfs list -t all | grep -q 'rpool/data/'"$i"'@auto-6'; then
    			zfs destroy rpool/data/"$i"@auto-6
    		fi
    		zfs rename rpool/data/"$i"@auto-5 rpool/data/"$i"@auto-6
    		zfs rename rpool/data/"$i"@auto-4 rpool/data/"$i"@auto-5
    		zfs rename rpool/data/"$i"@auto-3 rpool/data/"$i"@auto-4
    		zfs rename rpool/data/"$i"@auto-2 rpool/data/"$i"@auto-3
    		zfs rename rpool/data/"$i"@auto-1 rpool/data/"$i"@auto-2
    		zfs snap rpool/data/"$i"@auto-1
    	fi
    done
    
    # for short 3 snapshots are kept
    for i in "${short[@]}"
    do
    	if zfs list -t all | grep -q 'rpool/data/'"$i"; then
    		if zfs list -t all | grep -q 'rpool/data/'"$i"'@auto-6'; then
    			zfs destroy rpool/data/"$i"@auto-3
    		fi
    		zfs rename rpool/data/"$i"@auto-2 rpool/data/"$i"@auto-3
    		zfs rename rpool/data/"$i"@auto-1 rpool/data/"$i"@auto-2
    		zfs snap rpool/data/"$i"@auto-1
    	fi
    done
 
Reduce the actual problems which lead to bad snapshots.

This is mainly your hardware (please elaborate more on this). Obviously, with the increased number of snapshots, simple zfs list commands will slow down if not all metadata is directly available in memory. The internal mechanisms in PVE does tolerate a specific timeout and if the operation exceeds this, the PVE internal operation will fail. This is just simple asynchronous operation on a busy dataset as you already described yourself. Besides waiting longer, there is nothing PVE can do here. On our pool of roughly 50k snapshots, a "simple zfs list -t all" takes minutes.

I have a similar setup to yours, but use snapshot names based on time, so that you can simple delete all quarter hourly snapshots without the ones at full hours, then at days, weeks, months etc. This involves much less renaming and stuff and if something fails, I always know what I can delete. I also built in a mechanism to determine if snapshots could be deleted further if nothing has changed. This is obviously much simpler for LX(C) containers, due to the zfs diff command, but can also be done via inspection of zfs send/receive difference of zvols.

You can also just snapshot your whole pool with one command (zfs snapshot -r rpool@2019-01-21_11-11-11), which is also very nice. Based on the quarterly snapshots, I implemented a replication myself so that my own snapshots will not interfere with pve-zsync.

I'm not saying that pve-zsync is bad, but it does not play well with other stuff involved in snapshot creation etc. The tool does its job admirably, but only if you don't intervent.
 
My solution which I now use for all VMs (and have used for a selected few to compare it to Proxmox API approaches since some month) is a cronjob in a container which is able to `ssh root@proxmox-nodes` and which runs the following scripts:

Hello @Asano,

Maybe I have another ideea ;) or maybe not ;)

Insted to define in your own scripts for what data-sets do you want to make snapshot, and how many snapshots do you want to keep,
you can create for ANY/some-of-them data-set 2 custom dataset proprieties:
1. snap_enable with 2 possible value: ON / OFF
2. snap_count with a numeric value like X(so, X=3 for a max of 3 snaphots that you want to keep)
... and may others if you want(VM/CT id, node_creation_id, orphan_vdisk)

Now on any of your host, after you create this custom zfs proprieties, you can run a cron job, who will search all of your data-sets, and then will read the snap_enable proprieties(=yes then run a snapshot, and after the snap is finish, chech the snap_count .....). Maybe you can consider to delete any snaphot for a dataset who have chnge from snap_enable = ON -> snap_enable = OFF(up to you, maybe with a new proprieties like default_snapshots_off = {0, 3, ...} ). And the zfs user proprietis can be use as any other zfs proprieties!

So with this "features", you will exclude any CT/VM(or only one vdisk form a multi-vdisk CT/VM) that you do not care. At any time, you can change the desired zfs proprieties as you want, so you do not need to change your master-snapshot-script. Also, if you move your zfs-pool on a new PMX node, you do not need to modify the master-snapshot-script.


You can use the same ideea to trigger a replication!

zfs set snap_enable=ON rpool/{vm/ct vdisk}

Good luck!
 
Last edited:
This is mainly your hardware

I'm not so sure about the "mainly" part here. I was logging the output of my shell scripts which used `zfs` directly very long as well and the only "unexpected" errors which I ever came across where the "dataset busy" errors I also mentioned above. But those were super rare and when this happened with `qm snapshot` it always lead to an messed up snapshot state in `prepare`. The big majority of messed up states however where in `delete` - and I have no idea why a `qm delsnapshot` would get stuck since `zfs destroy` never had a problem, so maybe there is room for improvement on PVEs end as well? But really I don't know exactly what PVE is doing so maybe you are right entirely on that point...

Besides waiting longer, there is nothing PVE can do here.

Well, cleaning up after the timeout would be a nice thing to do or would it not ;-)
I agree with you that there are always hardware/usage induced errors PVE cannot prevent but those happen in the real world. Therefore handling them should be addressed.


I was also looking at `pve-zsync` as option but also it says it can run next to PVE storage replication I had the feeling it's not so well integrated. However building a custom snapshot schedule with the current Proxmox version is definitely possible with it.

@guletz using zfs properties also sounds like a great tool - I didn't even know about it till now (-:
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!