Fleecing - "Orphaned" fleece files

Indy-Admin

New Member
May 9, 2024
3
0
1
Hello all,
So I was testing the new fleece function in my home lab.

During a backup operation of a large VM, I was forced to hit "stop" in the middle of the backup job because the VM itself was performing some un-anticipated write operations that were ultimately going to fill up the storage before the backup job could finish.

The backup job terminated, but left the fleece files in place and they are consuming significant disk capacity. I can see them under local-lvm -> vm Disks tab, but cannot remove them from there.

Is there a command line option to detach and delete these fleece disk files from the VM? The feature seems to be so new that I can't find any existing info.
 
Last edited:
which kind of storage was the fleecing image put on?
can you post the task log? (i could not reproduce here an orphaned fleecing image)


you can remove volumes with

Code:
pvesm free <volume>

so that should also work for the fleecing image
 
Here is the portion of the log covering the affected VM. Fleece was pointed at the first of two LVM-Thin volumes. Storage is SSD mirror on top of hardware Dell raid controller. I cannot remember what filesystem I chose, but it was not ZFS.

INFO: Starting Backup of VM 106 (qemu)
INFO: Backup started at 2024-05-08 07:54:12
INFO: status = running
INFO: VM Name: GCRFS02
INFO: include disk 'scsi1' 'local-lvm:vm-106-disk-1' 200G
INFO: include disk 'scsi2' 'SATA:vm-106-disk-2' 3584G
INFO: include disk 'scsi3' 'SATA:vm-106-disk-3' 500G
INFO: include disk 'efidisk0' 'local-lvm:vm-106-disk-0' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'SATA:vm-106-disk-0' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/106/2024-05-08T11:54:12Z'
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-0" created.
WARNING: Sum of all thin volume sizes (<2.89 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-1" created.
WARNING: Sum of all thin volume sizes (<6.39 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-2" created.
WARNING: Sum of all thin volume sizes (<6.88 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
INFO: drive-scsi1: attaching fleecing image local-lvm:vm-106-fleece-0 to QEMU
INFO: drive-scsi2: attaching fleecing image local-lvm:vm-106-fleece-1 to QEMU
INFO: drive-scsi3: attaching fleecing image local-lvm:vm-106-fleece-2 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '2196b3b6-1279-452d-be59-4d77df1a6bf8'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: created new
INFO: scsi1: dirty-bitmap status: created new
INFO: scsi2: dirty-bitmap status: created new
INFO: scsi3: dirty-bitmap status: created new
INFO: 0% (3.7 GiB of 4.2 TiB) in 3s, read: 1.2 GiB/s, write: 32.0 MiB/s
INFO: 1% (43.0 GiB of 4.2 TiB) in 41s, read: 1.0 GiB/s, write: 538.9 KiB/s
INFO: 2% (86.9 GiB of 4.2 TiB) in 1m 16s, read: 1.3 GiB/s, write: 3.2 MiB/s
INFO: 3% (129.3 GiB of 4.2 TiB) in 1m 52s, read: 1.2 GiB/s, write: 4.3 MiB/s
INFO: 4% (172.8 GiB of 4.2 TiB) in 2m 27s, read: 1.2 GiB/s, write: 4.9 MiB/s
INFO: 5% (214.9 GiB of 4.2 TiB) in 3m 2s, read: 1.2 GiB/s, write: 4.8 MiB/s
INFO: 6% (257.3 GiB of 4.2 TiB) in 3m 39s, read: 1.1 GiB/s, write: 4.6 MiB/s
INFO: 7% (300.7 GiB of 4.2 TiB) in 4m 17s, read: 1.1 GiB/s, write: 4.5 MiB/s
INFO: 8% (343.5 GiB of 4.2 TiB) in 4m 53s, read: 1.2 GiB/s, write: 4.8 MiB/s
INFO: 9% (386.6 GiB of 4.2 TiB) in 5m 30s, read: 1.2 GiB/s, write: 4.6 MiB/s
INFO: 10% (429.1 GiB of 4.2 TiB) in 6m 8s, read: 1.1 GiB/s, write: 4.5 MiB/s
INFO: 11% (472.1 GiB of 4.2 TiB) in 6m 44s, read: 1.2 GiB/s, write: 4.7 MiB/s
INFO: 12% (514.1 GiB of 4.2 TiB) in 12m 15s, read: 129.9 MiB/s, write: 43.1 MiB/s
INFO: 13% (556.9 GiB of 4.2 TiB) in 26m 22s, read: 51.8 MiB/s, write: 51.4 MiB/s
INFO: 14% (599.8 GiB of 4.2 TiB) in 39m 13s, read: 56.9 MiB/s, write: 56.7 MiB/s
INFO: 15% (642.6 GiB of 4.2 TiB) in 52m 32s, read: 54.9 MiB/s, write: 54.8 MiB/s
INFO: 16% (685.5 GiB of 4.2 TiB) in 1h 6m 15s, read: 53.3 MiB/s, write: 53.3 MiB/s
INFO: 17% (728.3 GiB of 4.2 TiB) in 1h 20m 47s, read: 50.3 MiB/s, write: 50.2 MiB/s
INFO: 18% (771.2 GiB of 4.2 TiB) in 1h 36m 6s, read: 47.8 MiB/s, write: 47.2 MiB/s
INFO: 19% (814.0 GiB of 4.2 TiB) in 1h 50m 43s, read: 50.1 MiB/s, write: 49.6 MiB/s
INFO: 20% (856.8 GiB of 4.2 TiB) in 2h 3m 58s, read: 55.1 MiB/s, write: 53.7 MiB/s
INFO: 21% (899.7 GiB of 4.2 TiB) in 2h 18m 39s, read: 49.8 MiB/s, write: 48.1 MiB/s
INFO: 22% (942.5 GiB of 4.2 TiB) in 2h 34m 57s, read: 44.8 MiB/s, write: 44.1 MiB/s
INFO: 23% (985.4 GiB of 4.2 TiB) in 2h 50m 55s, read: 45.8 MiB/s, write: 45.4 MiB/s
INFO: 24% (1.0 TiB of 4.2 TiB) in 3h 6m 44s, read: 46.2 MiB/s, write: 45.5 MiB/s
INFO: 25% (1.0 TiB of 4.2 TiB) in 3h 20m 50s, read: 51.9 MiB/s, write: 51.6 MiB/s
INFO: 26% (1.1 TiB of 4.2 TiB) in 3h 35m 40s, read: 49.3 MiB/s, write: 48.9 MiB/s
INFO: 27% (1.1 TiB of 4.2 TiB) in 3h 49m 53s, read: 51.4 MiB/s, write: 51.1 MiB/s
INFO: 28% (1.2 TiB of 4.2 TiB) in 4h 4m 42s, read: 49.4 MiB/s, write: 49.1 MiB/s
INFO: 29% (1.2 TiB of 4.2 TiB) in 4h 17m 53s, read: 55.5 MiB/s, write: 55.4 MiB/s
INFO: 30% (1.3 TiB of 4.2 TiB) in 4h 32m 59s, read: 48.4 MiB/s, write: 48.0 MiB/s
INFO: 31% (1.3 TiB of 4.2 TiB) in 4h 47m 37s, read: 50.0 MiB/s, write: 49.8 MiB/s
INFO: 32% (1.3 TiB of 4.2 TiB) in 5h 1m 32s, read: 52.5 MiB/s, write: 52.4 MiB/s
INFO: 33% (1.4 TiB of 4.2 TiB) in 5h 16m 26s, read: 49.1 MiB/s, write: 48.7 MiB/s
INFO: 34% (1.4 TiB of 4.2 TiB) in 5h 32m 20s, read: 46.0 MiB/s, write: 45.9 MiB/s
INFO: 35% (1.5 TiB of 4.2 TiB) in 5h 50m 53s, read: 39.4 MiB/s, write: 39.1 MiB/s
INFO: 36% (1.5 TiB of 4.2 TiB) in 6h 10m 37s, read: 37.1 MiB/s, write: 37.0 MiB/s
INFO: 37% (1.5 TiB of 4.2 TiB) in 6h 31m 37s, read: 34.8 MiB/s, write: 34.8 MiB/s
INFO: 38% (1.6 TiB of 4.2 TiB) in 6h 53m 16s, read: 33.8 MiB/s, write: 33.7 MiB/s
INFO: 39% (1.6 TiB of 4.2 TiB) in 7h 14m 49s, read: 33.9 MiB/s, write: 33.9 MiB/s
INFO: 40% (1.7 TiB of 4.2 TiB) in 7h 37m 28s, read: 32.3 MiB/s, write: 32.1 MiB/s
INFO: 41% (1.7 TiB of 4.2 TiB) in 8h 1m 41s, read: 30.2 MiB/s, write: 30.1 MiB/s
INFO: 42% (1.8 TiB of 4.2 TiB) in 8h 23m 49s, read: 33.0 MiB/s, write: 33.0 MiB/s
INFO: 43% (1.8 TiB of 4.2 TiB) in 8h 46m 59s, read: 31.6 MiB/s, write: 30.7 MiB/s
INFO: 44% (1.8 TiB of 4.2 TiB) in 9h 11m 29s, read: 29.8 MiB/s, write: 29.7 MiB/s
INFO: 45% (1.9 TiB of 4.2 TiB) in 9h 34m 55s, read: 31.2 MiB/s, write: 30.9 MiB/s
INFO: 46% (1.9 TiB of 4.2 TiB) in 9h 58m 49s, read: 30.6 MiB/s, write: 30.5 MiB/s
INFO: 47% (2.0 TiB of 4.2 TiB) in 10h 24m 12s, read: 28.8 MiB/s, write: 28.8 MiB/s
INFO: 48% (2.0 TiB of 4.2 TiB) in 10h 49m 15s, read: 29.2 MiB/s, write: 28.7 MiB/s
INFO: 49% (2.0 TiB of 4.2 TiB) in 11h 13m 7s, read: 30.7 MiB/s, write: 30.5 MiB/s
INFO: 50% (2.1 TiB of 4.2 TiB) in 11h 38m 6s, read: 29.3 MiB/s, write: 29.1 MiB/s
INFO: 51% (2.1 TiB of 4.2 TiB) in 12h 4m 18s, read: 27.9 MiB/s, write: 27.7 MiB/s
INFO: 52% (2.2 TiB of 4.2 TiB) in 12h 28m 17s, read: 30.5 MiB/s, write: 30.1 MiB/s
INFO: 53% (2.2 TiB of 4.2 TiB) in 12h 51m 10s, read: 32.0 MiB/s, write: 31.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
 
Last edited:
We have comparable issue here... when PVE-Host or PBS crashed for whatever reason during Backup with enabled fleecing... all fleecing disks stay and never get deleted... the next backup for this VM will fail with error "already existing" fleecing disk.

Sometimes with qm rescan the disks are detected and assigned to the VM as unused. So manual delete is possible... but sometimes they just sit in the storage and block the backup. Delete with GUI is not possible cause VMID exists.... weird....
 
Sometimes with qm rescan the disks are detected and assigned to the VM as unused
this just saved me a lot of time :)

Sadly i couldn't free images yet but had to reboot the individual vm. so for anyone with the same problem i took the following steps:

Bash:
qm rescan
store=z_fleecing   # i have an lvmthin which is used for those
pvesm list $store |awk 'NR>1 {print $NF}' |uniq|xargs -n 1 qm reboot
pvesm list $store |sed '1d;s/ .*//g' |xargs -n 1 pvesm free
which left me now with lingering disks in the configurations of each vm
Bash:
root@pve ~ # grep -r unused /etc/pve/qemu-server
/etc/pve/qemu-server/103.conf:unused0: z_fleecing:vm-103-fleece-0
/etc/pve/qemu-server/103.conf:unused1: z_fleecing:vm-103-fleece-1
/etc/pve/qemu-server/103.conf:unused2: z_fleecing:vm-103-fleece-2
/etc/pve/qemu-server/103.conf:unused3: z_fleecing:vm-103-fleece-3
/etc/pve/qemu-server/107.conf:unused0: z_fleecing:vm-107-fleece-0
/etc/pve/qemu-server/104.conf:unused0: z_fleecing:vm-104-fleece-0
/etc/pve/qemu-server/104.conf:unused1: z_fleecing:vm-104-fleece-1
/etc/pve/qemu-server/104.conf:unused2: z_fleecing:vm-104-fleece-2
/etc/pve/qemu-server/102.conf:unused0: z_fleecing:vm-102-fleece-0
/etc/pve/qemu-server/102.conf:unused1: z_fleecing:vm-102-fleece-1
/etc/pve/qemu-server/105.conf:unused0: z_fleecing:vm-105-fleece-0
/etc/pve/qemu-server/106.conf:unused0: z_fleecing:vm-106-fleece-0
/etc/pve/qemu-server/100.conf:unused0: z_fleecing:vm-100-fleece-0
/etc/pve/qemu-server/100.conf:unused1: z_fleecing:vm-100-fleece-1
/etc/pve/qemu-server/100.conf:unused2: z_fleecing:vm-100-fleece-2
and only after that i learned about qm disk unlink $vmid --idlist unused*
so i guess if that ever happens the steps i would take would look like this
Bash:
#!/bin/sh
# NOTICE - THIS WILL REBOOT YOUR VM AUTOMATICALLY!!!!!!
# the storage which is used for fleecing
store='z_fleecing'
qm rescan
# this could probably be done for each vm but that seemed to work so i leave it up here
pvesm list "$store" |sed '1d;s/ .*//g' |xargs -n 1 pvesm free
# get all vms which have a disk associated in $store
# (with a dedicated fleecing store that is easy but it might nasty where you use
# the normal images store as fleecing aswell so i revised the whole thingy....
for vmid in $(pvesm list "$store" |awk 'NR>1 {print $NF}' |uniq); do
    # looks for disks in configuration where the name begins with ^unused,
    # the storage is $store and end with the match "fleece-NUMBER"
    # e.g. /etc/pve/qemu-server/103.conf:unused0: z_fleecing:vm-103-fleece-0
    idlist="$(qm config "$vmid" |awk -F: -v store="$store" '
        $1 ~ "^unused") &&
        $2 ~ store &&
        $3 ~ "fleece-[[:digit:]]+$"
        {printf"%s,",$1}' |
            sed 's/,$//g')"
    # shellcheck disable=SC2086,SC2236  # NO!
    if [ ! -z $idlist ]; then
        # until the vm isn't shutdown or rebooted the fleece will be locked
        qm reboot "$vmid"
        # wait for the vm to be back online
        while true; do qm agent "$vmid" ping && return ; sleep 5; done
        # temporarily disable protection to be able to remove the unused disks
        qm set "$vmid" --protection 0
        # remove the unused disks
        qm disk unlink "$vmid" --idlist "$idlist"
        # enable protection again
        qm set "$vmid" --protection 1
    fi
done

And because i didn't do it in the particular order i ended up with this.. (yes, i'm a oneliner by trade :p)

Bash:
store=z_fleecing; for vm in $(qm list |awk 'NR>1{print$1}'); do idlist=$(qm config $vm |awk -F: -v store=$store '($2 ~ store){printf"%s,",$1}'|sed 's/,$//g'); if [ ! -z $idlist ]; then qm set $vm --protection 0; COPYPASTEqmBLOCKADE disk unlink $vm --idlist $idlist; qm set $vm --protection 1; fi ; done


After all that i noticed in PBS that there are still jobs/snapshots kinda lingering (a spinning symbol) for all VMs which were affected so i additionaly ran this

Bash:
root@pve ~ # pvesm list pbs |awk '($4 == 1)'
pbs:backup/vm/101/2024-12-07T03:43:28Z pbs-vm  backup               1 101
pbs:backup/vm/102/2024-12-07T03:55:50Z pbs-vm  backup               1 102
pbs:backup/vm/103/2024-12-07T04:29:04Z pbs-vm  backup               1 103
pbs:backup/vm/104/2024-12-07T05:22:37Z pbs-vm  backup               1 104
pbs:backup/vm/105/2024-12-07T06:06:01Z pbs-vm  backup               1 105
pbs:backup/vm/106/2024-12-07T06:29:01Z pbs-vm  backup               1 106
pbs:backup/vm/107/2024-12-07T06:52:07Z pbs-vm  backup               1 107
root@pve ~ # pvesm list pbs |awk '($4 == 1) {print $1}' |xargs -n 1 pvesm free
Removed volume 'pbs:backup/vm/101/2024-12-07T03:43:28Z'
Removed volume 'pbs:backup/vm/102/2024-12-07T03:55:50Z'
Removed volume 'pbs:backup/vm/103/2024-12-07T04:29:04Z'
Removed volume 'pbs:backup/vm/104/2024-12-07T05:22:37Z'
Removed volume 'pbs:backup/vm/105/2024-12-07T06:06:01Z'
Removed volume 'pbs:backup/vm/106/2024-12-07T06:29:01Z'
Removed volume 'pbs:backup/vm/107/2024-12-07T06:52:07Z'
 
Last edited:
  • Like
Reactions: itNGO
Gawd. Yes, there's a problem with fleecing files if the backup fails. This is a known issue.
@fiona says there's a fix in the pipeline, but it didn't make it into this new PBS 3.3 version.
But ... i don't even understand what the previous poster was doing. Don't do that.

First, get the list of fleecing files. Unless you have an active backup running, all of these can be nuked.
zfs list | grep fleece

Nuke em.
zfs destroy {result from above}

And then, surprise! Some of them won't delete. Locked disk.
You have two options.

Rename them. To anything. Maybe something that doesn't have the vmid in it ... ? But anything at all will work to get the backups running again.

or

Power the VM all the way off. It will release the lock on the fleecing disk, allowing you to zfs destroy it.


Edit ... Obviously this advice applies only to cases where your fleecing files are on ZFS datastores. Although, this has been the case for every instance I've seen reported. I don't know it to be a ZFS-specific issue, and may well occur on LVM with a different set of quirks. I'm a ZFSer.
 
Last edited:
But ... i don't even understand what the previous poster was doing. Don't do that.
You are free to ask.

Edit ... Obviously this advice applies only to cases where your fleecing files are on ZFS datastores. Although, this has been the case for every instance I've seen reported. I don't know it to be a ZFS-specific issue, and may well occur on LVM with a different set of quirks. I'm a ZFSer.
is that somewhat related to this? :p
store=z_fleecing # i have an lvmthin which is used for those
Well, i'm a ZFSer at heart too. But as this is my first PVE System and i'm used to TrueNAS-Core(BSD!) + ESX and the System on which PVE runs hasn't got Disks for actual Storage so i did install it with LVM instead of ZFS. Anyhow, the Backups worked now for a couple months until, wait for it, PBS 3.3! These lingering fleecing disks were only the result of the backup beeing botched completly. After doing much trial and error i now reverted to 3.2.9-1 set pinning for proxmox-backup-{manager,client}, restarted everything and it is now working again w/o a hitch. I already did multiple snapshot and stop backups, no more issues.
For now i will stick to that version at least for a couple weeks/end of the year and will try again with a newer release in '25 i guess.
 
You are free to ask.
Don't care. That was all needlessly complex and likely to send people down the wrong path.

is that somewhat related to this?
Might be. This is a quote from the dev that wrote the feature.


https://forum.proxmox.com/threads/orphan-fleecing-files-make-backups-fail.151061/#post-684318
If the disk did not finish detaching from QEMU yet (i.e. the detach code doesn't wait long enough and the cleanup routine will continue), ZFS will still consider it busy.

Here's the bugzilla. It's ZFS left and right.
https://bugzilla.proxmox.com/show_bug.cgi?id=5440

Again, I don't know this to be ZFS-specific, but after all that wild code you spit out ... well I'll just go with what the Devs say.
 
https://forum.proxmox.com/threads/orphan-fleecing-files-make-backups-fail.151061/#post-684318
If the disk did not finish detaching from QEMU yet (i.e. the detach code doesn't wait long enough and the cleanup routine will continue), ZFS will still consider it busy.

Here's the bugzilla. It's ZFS left and right.
https://bugzilla.proxmox.com/show_bug.cgi?id=5440

Again, I don't know this to be ZFS-specific, but after all that wild code you spit out ... well I'll just go with what the Devs say.
Thx for the info. And yeah that all seems like zfs related but mine came up because the whole backup process did fail with timeouts and left all those fleecing disks.
But this is also what this is topic is about or not?
And all this shell-foo is just to clean up the mess after you are left with stuck fleecing disks w/o ever looking at the webgui or doing something iffy (IMHO) like editing the /etc/pve/.../$VMID.conf manually if you can just use the provided tools.

But after looking at the other issues i now completly revised the "script" to make it safe to use AFAICT @fabian could you check that? :P
 
  • Like
Reactions: tcabernoch
Cool.

Yes, we explore technical topics in detail and tear the vendor's creations apart to inspect them. We do all of that here. There are folks with truly advanced knowledge frequenting this forum.

And we are also the first destination for a vast army of n00b homelabbers. And then there's the random professional that does all this stuff for a living. That's the audience here. A very mixed bag. Personally, I try to NOT leave any code landmines in the forum. I keep it safe, or if not, indicate dangerous-don't-do-this-in-Prod sort of activities as such.

BTW ... Mucking about on the PBS datastore ... that's a fraught activity. Unless you really want to dig into .chunks mechanisms, I've found it better to just learn how it works and then use it effectively. It's a darned interesting topic, but a much deeper rabbit hole than some might expect.
 
Last edited:
AFAIK fleecing volume cleanup has been improved in the meantime, @fiona knows more.. unfinished snapshots on the PBS side are removed once a newer finished one exists (when pruning the group).
 
AFAIK fleecing volume cleanup has been improved in the meantime, @fiona knows more..
No, the patches were not applied, as we decided on a bit different approach (track the fleecing images in a special config section rather than an internal-only config option) and I haven't gotten around to finishing this yet.
 
thanks for the clarification!

@crpb that script looks rather dangerous and broken to me - first it removes all volumes from the storage (if the storage contains anything other than leftover fleecing images, you just lost a lot of data?), and then it queries the storage again (which should now be empty) and loops over the result (which should do nothing?).. there are like three variants in your post do, I am not sure which one I am supposed to look at..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!