Fiona ... How is Backup Fleecing Going?

Apr 27, 2024
438
154
43
Portland, OR
www.gnetsys.net
Sorry to bug you. There are so many people waiting for this feature.
Can you give us some crumbs?

--------------------------------------------------------------------------
See here.
https://lists.proxmox.com/pipermail/pve-devel/2024-January/061470.html

When a backup for a VM is started, QEMU will install a
"copy-before-write" filter in its block layer. This filter ensures
that upon new guest writes, old data still needed for the backup is
sent to the backup target first. The guest write blocks until this
operation is finished so guest IO to not-yet-backed-up sectors will be
limited by the speed of the backup target.

With backup fleecing, such old data is cached in a fleecing image
rather than sent directly to the backup target. This can help guest IO
performance and even prevent hangs in certain scenarios, at the cost
of requiring more storage space.
 
That one link is the only bit of actual info I've found.
I'd love to get anything else.
Directions on how to set it up would be lovely, but anything, really.
It's actually widely mentioned in all the changelogs ;-)

https://www.proxmox.com/en/about/press-releases/proxmox-virtual-environment-8-2
https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2

And in the docs, of course:
https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_vm_backup_fleecing

But yes, it's not well described how to set it up at all - only the Roadmap mentions it in one sentence:
Fleecing can be configured for a datacenter-wide backup job in the GUI, and be used through the CLI and API.

The docs generally describe it, but fail to mention where to configure it.
There is a small screenshot showing it in the backup job advanced settings, but no text indicating it!
https://pve.proxmox.com/pve-docs/images/screenshot/gui-cluster-backup-edit-04-advanced.png

that is right next to the last paragraph in https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_jobs
But while the screenshot was updated for fleecing, the text was not.
Code:
There are a few settings for tuning backup performance (some of which are exposed in the Advanced tab in the UI). 
The most notable is bwlimit for limiting IO bandwidth. The amount of threads used for the compressor can be controlled
 with the pigz (replacing gzip), respectively, zstd setting. Furthermore, there are ionice (when the BFQ scheduler is used)
 and, as part of the performance setting, max-workers (affects VM backups only) and pbs-entries-max (affects container
 backups only). See the configuration options for details.
 
Last edited:
At that time, it was sort of a .0 release for Proxmox.
Fiona says that fleecing itself is part of the qm spec.
So this is their implementation of it.

TBH ... it was not great. But now its better. Somewhat.

The first version of fleecing met the immediate need, which was that VMs would be unusable during backups.
They were better with fleecing turned on.

It also had two critical bugz that combined in an unfortunate way.
  • The first bug was if the ZFS system didn't respond within 5ms to the command to delete the fleecing cache file, it was left in place.
    • This has been resolved in the latest version.
  • The second bug is that the fleecing implementation doesn't know about the possibility of defunct cache files sitting there and just chokes with a bad filename error the next time the backup is run and it tries to save the same filename.
    • This is not resolved. I've heard of better garbage management strategies in development, but i guess they had some debate about the best way to deal with the trash. In the mean time, backups still fail if they encounter garbage from previous failures. Clean them up yourself for now.
 
  • Like
Reactions: Johannes S
In regards the second, it also happens if a backup fails in the middle of a run (crash, server issues etc). Is there a command to make sure the fleecing cache is flushed to disk or how can we safely remove/delete the dangling disk.
 
Heh. Dangling disk. I like it.
Yes. Anything that causes the backup to fail and not clean up properly will cause the fleecing cache file to be left in place.
Subsequent backups will fail until you fix it.

The fix is annoying.

Part 1
The VM has the cache file locked until you either reboot the VM or (it's weird that this works) rename the file.
So bounce the affected VM or script something.

Part 2
If you did not rename the file, it's associated with the VM, and you can't delete it from the datastore GUI.
You also won't see it on the VM's config page. Open the console.
zfs list ... get the name. It's gonna say something about fleecing.
zfs destroy thefleecingthing
 
Last edited:
Hello,

I would like to give my feedback here in the hope it is useful for others too.

I have fleecing enabled and in the General tab, backup mode is set to "stop". What happens when backing up virtual machines is (I guess):

1. The VM shuts down
2. Fleecing is "started" so new writes goes to the temporary storage
3. The Backup is started from the VM storage
4. The VM is started at the same time the backup is running.

In my case, both, the VM storage and the VM backup are in the same NFS server separated from the server running Proxmox. This is an important detail. The Proxmox server and the storage server are in a separate physical network.
That network is under a 10G switch (cisco managed switch)
Backup speed is limited to 30Mbps

The amount of time the VM (and the service) is stop has been reduced to just a few seconds (10 to 20 secods)
BUT some times (randomly) the VM does not starts properly due to network timeouts and the VM has to be manually restarted later.

I still don't know if it is possible to delay the backup process some time so I give the VM some time to start correctly and then the backup process consumes the full available bandwidth once the server has properly started.

Hope this feedback is useful to anyone.

Best
Ignacio
 
So your network is congested to the point of packet loss (presuming you’re not the same server for both backup and images) then I would suggest looking at some kind of priority queue in your network stack, either on the Proxmox side or on the network switch/router and I would not dynamically increase your backup speed, because a timeout/packet loss while your VM is operational would probably be worse than a slow backup.

Ideally though you would get a better network fabric.

However I’m confused as you said it’s the same NFS server, then you’re not making backups, you’re just making copies. It could also be that your NFS server has issues servicing that many requests and that is another issue altogether.
 
  • Like
Reactions: UdoB and Johannes S
Hi,
However I’m confused as you said it’s the same NFS server, then you’re not making backups, you’re just making copies
no, Proxmox VE needs to read the data via QEMU's block layer and then write the data to the backup target, so in fact it will go over the network twice! Using the same storage to store data and backups is not really recommended in any case.
 
  • Like
Reactions: Johannes S
Hi,

no, Proxmox VE needs to read the data via QEMU's block layer and then write the data to the backup target, so in fact it will go over the network twice! Using the same storage to store data and backups is not really recommended in any case.
Yes, but it’s not considered a backup if source and target are the same. And that may be causing the issue, if your storage can’t handle the read/write load, that may also cause issues even if the network is fine.
 
  • Like
Reactions: fiona
Yes, but it’s not considered a backup if source and target are the same. And that may be causing the issue, if your storage can’t handle the read/write load, that may also cause issues even if the network is fine.
Ah, reading your sentence again, I see how it was meant now :)