Fiona ... How is Backup Fleecing Going?

Apr 27, 2024
404
138
43
Portland, OR
www.gnetsys.net
Sorry to bug you. There are so many people waiting for this feature.
Can you give us some crumbs?

--------------------------------------------------------------------------
See here.
https://lists.proxmox.com/pipermail/pve-devel/2024-January/061470.html

When a backup for a VM is started, QEMU will install a
"copy-before-write" filter in its block layer. This filter ensures
that upon new guest writes, old data still needed for the backup is
sent to the backup target first. The guest write blocks until this
operation is finished so guest IO to not-yet-backed-up sectors will be
limited by the speed of the backup target.

With backup fleecing, such old data is cached in a fleecing image
rather than sent directly to the backup target. This can help guest IO
performance and even prevent hangs in certain scenarios, at the cost
of requiring more storage space.
 
That one link is the only bit of actual info I've found.
I'd love to get anything else.
Directions on how to set it up would be lovely, but anything, really.
It's actually widely mentioned in all the changelogs ;-)

https://www.proxmox.com/en/about/press-releases/proxmox-virtual-environment-8-2
https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2

And in the docs, of course:
https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_vm_backup_fleecing

But yes, it's not well described how to set it up at all - only the Roadmap mentions it in one sentence:
Fleecing can be configured for a datacenter-wide backup job in the GUI, and be used through the CLI and API.

The docs generally describe it, but fail to mention where to configure it.
There is a small screenshot showing it in the backup job advanced settings, but no text indicating it!
https://pve.proxmox.com/pve-docs/images/screenshot/gui-cluster-backup-edit-04-advanced.png

that is right next to the last paragraph in https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_jobs
But while the screenshot was updated for fleecing, the text was not.
Code:
There are a few settings for tuning backup performance (some of which are exposed in the Advanced tab in the UI). 
The most notable is bwlimit for limiting IO bandwidth. The amount of threads used for the compressor can be controlled
 with the pigz (replacing gzip), respectively, zstd setting. Furthermore, there are ionice (when the BFQ scheduler is used)
 and, as part of the performance setting, max-workers (affects VM backups only) and pbs-entries-max (affects container
 backups only). See the configuration options for details.
 
Last edited:
At that time, it was sort of a .0 release for Proxmox.
Fiona says that fleecing itself is part of the qm spec.
So this is their implementation of it.

TBH ... it was not great. But now its better. Somewhat.

The first version of fleecing met the immediate need, which was that VMs would be unusable during backups.
They were better with fleecing turned on.

It also had two critical bugz that combined in an unfortunate way.
  • The first bug was if the ZFS system didn't respond within 5ms to the command to delete the fleecing cache file, it was left in place.
    • This has been resolved in the latest version.
  • The second bug is that the fleecing implementation doesn't know about the possibility of defunct cache files sitting there and just chokes with a bad filename error the next time the backup is run and it tries to save the same filename.
    • This is not resolved. I've heard of better garbage management strategies in development, but i guess they had some debate about the best way to deal with the trash. In the mean time, backups still fail if they encounter garbage from previous failures. Clean them up yourself for now.
 
  • Like
Reactions: Johannes S
In regards the second, it also happens if a backup fails in the middle of a run (crash, server issues etc). Is there a command to make sure the fleecing cache is flushed to disk or how can we safely remove/delete the dangling disk.
 
Heh. Dangling disk. I like it.
Yes. Anything that causes the backup to fail and not clean up properly will cause the fleecing cache file to be left in place.
Subsequent backups will fail until you fix it.

The fix is annoying.

Part 1
The VM has the cache file locked until you either reboot the VM or (it's weird that this works) rename the file.
So bounce the affected VM or script something.

Part 2
If you did not rename the file, it's associated with the VM, and you can't delete it from the datastore GUI.
You also won't see it on the VM's config page. Open the console.
zfs list ... get the name. It's gonna say something about fleecing.
zfs destroy thefleecingthing
 
Last edited: