Fleecing - "Orphaned" fleece files

Indy-Admin · May 9, 2024

Hello all,
So I was testing the new fleece function in my home lab.

During a backup operation of a large VM, I was forced to hit "stop" in the middle of the backup job because the VM itself was performing some un-anticipated write operations that were ultimately going to fill up the storage before the backup job could finish.

The backup job terminated, but left the fleece files in place and they are consuming significant disk capacity. I can see them under local-lvm -> vm Disks tab, but cannot remove them from there.

Is there a command line option to detach and delete these fleece disk files from the VM? The feature seems to be so new that I can't find any existing info.

dcsapak · May 10, 2024

which kind of storage was the fleecing image put on?
can you post the task log? (i could not reproduce here an orphaned fleecing image)

you can remove volumes with

Code:

pvesm free <volume>

so that should also work for the fleecing image

Indy-Admin · May 10, 2024

Here is the portion of the log covering the affected VM. Fleece was pointed at the first of two LVM-Thin volumes. Storage is SSD mirror on top of hardware Dell raid controller. I cannot remember what filesystem I chose, but it was not ZFS.

INFO: Starting Backup of VM 106 (qemu)
INFO: Backup started at 2024-05-08 07:54:12
INFO: status = running
INFO: VM Name: GCRFS02
INFO: include disk 'scsi1' 'local-lvm:vm-106-disk-1' 200G
INFO: include disk 'scsi2' 'SATA:vm-106-disk-2' 3584G
INFO: include disk 'scsi3' 'SATA:vm-106-disk-3' 500G
INFO: include disk 'efidisk0' 'local-lvm:vm-106-disk-0' 4M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: skip unused drive 'SATA:vm-106-disk-0' (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/106/2024-05-08T11:54:12Z'
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-0" created.
WARNING: Sum of all thin volume sizes (<2.89 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-1" created.
WARNING: Sum of all thin volume sizes (<6.39 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "vm-106-fleece-2" created.
WARNING: Sum of all thin volume sizes (<6.88 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (1.86 TiB).
INFO: drive-scsi1: attaching fleecing image local-lvm:vm-106-fleece-0 to QEMU
INFO: drive-scsi2: attaching fleecing image local-lvm:vm-106-fleece-1 to QEMU
INFO: drive-scsi3: attaching fleecing image local-lvm:vm-106-fleece-2 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '2196b3b6-1279-452d-be59-4d77df1a6bf8'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: created new
INFO: scsi1: dirty-bitmap status: created new
INFO: scsi2: dirty-bitmap status: created new
INFO: scsi3: dirty-bitmap status: created new
INFO: 0% (3.7 GiB of 4.2 TiB) in 3s, read: 1.2 GiB/s, write: 32.0 MiB/s
INFO: 1% (43.0 GiB of 4.2 TiB) in 41s, read: 1.0 GiB/s, write: 538.9 KiB/s
INFO: 2% (86.9 GiB of 4.2 TiB) in 1m 16s, read: 1.3 GiB/s, write: 3.2 MiB/s
INFO: 3% (129.3 GiB of 4.2 TiB) in 1m 52s, read: 1.2 GiB/s, write: 4.3 MiB/s
INFO: 4% (172.8 GiB of 4.2 TiB) in 2m 27s, read: 1.2 GiB/s, write: 4.9 MiB/s
INFO: 5% (214.9 GiB of 4.2 TiB) in 3m 2s, read: 1.2 GiB/s, write: 4.8 MiB/s
INFO: 6% (257.3 GiB of 4.2 TiB) in 3m 39s, read: 1.1 GiB/s, write: 4.6 MiB/s
INFO: 7% (300.7 GiB of 4.2 TiB) in 4m 17s, read: 1.1 GiB/s, write: 4.5 MiB/s
INFO: 8% (343.5 GiB of 4.2 TiB) in 4m 53s, read: 1.2 GiB/s, write: 4.8 MiB/s
INFO: 9% (386.6 GiB of 4.2 TiB) in 5m 30s, read: 1.2 GiB/s, write: 4.6 MiB/s
INFO: 10% (429.1 GiB of 4.2 TiB) in 6m 8s, read: 1.1 GiB/s, write: 4.5 MiB/s
INFO: 11% (472.1 GiB of 4.2 TiB) in 6m 44s, read: 1.2 GiB/s, write: 4.7 MiB/s
INFO: 12% (514.1 GiB of 4.2 TiB) in 12m 15s, read: 129.9 MiB/s, write: 43.1 MiB/s
INFO: 13% (556.9 GiB of 4.2 TiB) in 26m 22s, read: 51.8 MiB/s, write: 51.4 MiB/s
INFO: 14% (599.8 GiB of 4.2 TiB) in 39m 13s, read: 56.9 MiB/s, write: 56.7 MiB/s
INFO: 15% (642.6 GiB of 4.2 TiB) in 52m 32s, read: 54.9 MiB/s, write: 54.8 MiB/s
INFO: 16% (685.5 GiB of 4.2 TiB) in 1h 6m 15s, read: 53.3 MiB/s, write: 53.3 MiB/s
INFO: 17% (728.3 GiB of 4.2 TiB) in 1h 20m 47s, read: 50.3 MiB/s, write: 50.2 MiB/s
INFO: 18% (771.2 GiB of 4.2 TiB) in 1h 36m 6s, read: 47.8 MiB/s, write: 47.2 MiB/s
INFO: 19% (814.0 GiB of 4.2 TiB) in 1h 50m 43s, read: 50.1 MiB/s, write: 49.6 MiB/s
INFO: 20% (856.8 GiB of 4.2 TiB) in 2h 3m 58s, read: 55.1 MiB/s, write: 53.7 MiB/s
INFO: 21% (899.7 GiB of 4.2 TiB) in 2h 18m 39s, read: 49.8 MiB/s, write: 48.1 MiB/s
INFO: 22% (942.5 GiB of 4.2 TiB) in 2h 34m 57s, read: 44.8 MiB/s, write: 44.1 MiB/s
INFO: 23% (985.4 GiB of 4.2 TiB) in 2h 50m 55s, read: 45.8 MiB/s, write: 45.4 MiB/s
INFO: 24% (1.0 TiB of 4.2 TiB) in 3h 6m 44s, read: 46.2 MiB/s, write: 45.5 MiB/s
INFO: 25% (1.0 TiB of 4.2 TiB) in 3h 20m 50s, read: 51.9 MiB/s, write: 51.6 MiB/s
INFO: 26% (1.1 TiB of 4.2 TiB) in 3h 35m 40s, read: 49.3 MiB/s, write: 48.9 MiB/s
INFO: 27% (1.1 TiB of 4.2 TiB) in 3h 49m 53s, read: 51.4 MiB/s, write: 51.1 MiB/s
INFO: 28% (1.2 TiB of 4.2 TiB) in 4h 4m 42s, read: 49.4 MiB/s, write: 49.1 MiB/s
INFO: 29% (1.2 TiB of 4.2 TiB) in 4h 17m 53s, read: 55.5 MiB/s, write: 55.4 MiB/s
INFO: 30% (1.3 TiB of 4.2 TiB) in 4h 32m 59s, read: 48.4 MiB/s, write: 48.0 MiB/s
INFO: 31% (1.3 TiB of 4.2 TiB) in 4h 47m 37s, read: 50.0 MiB/s, write: 49.8 MiB/s
INFO: 32% (1.3 TiB of 4.2 TiB) in 5h 1m 32s, read: 52.5 MiB/s, write: 52.4 MiB/s
INFO: 33% (1.4 TiB of 4.2 TiB) in 5h 16m 26s, read: 49.1 MiB/s, write: 48.7 MiB/s
INFO: 34% (1.4 TiB of 4.2 TiB) in 5h 32m 20s, read: 46.0 MiB/s, write: 45.9 MiB/s
INFO: 35% (1.5 TiB of 4.2 TiB) in 5h 50m 53s, read: 39.4 MiB/s, write: 39.1 MiB/s
INFO: 36% (1.5 TiB of 4.2 TiB) in 6h 10m 37s, read: 37.1 MiB/s, write: 37.0 MiB/s
INFO: 37% (1.5 TiB of 4.2 TiB) in 6h 31m 37s, read: 34.8 MiB/s, write: 34.8 MiB/s
INFO: 38% (1.6 TiB of 4.2 TiB) in 6h 53m 16s, read: 33.8 MiB/s, write: 33.7 MiB/s
INFO: 39% (1.6 TiB of 4.2 TiB) in 7h 14m 49s, read: 33.9 MiB/s, write: 33.9 MiB/s
INFO: 40% (1.7 TiB of 4.2 TiB) in 7h 37m 28s, read: 32.3 MiB/s, write: 32.1 MiB/s
INFO: 41% (1.7 TiB of 4.2 TiB) in 8h 1m 41s, read: 30.2 MiB/s, write: 30.1 MiB/s
INFO: 42% (1.8 TiB of 4.2 TiB) in 8h 23m 49s, read: 33.0 MiB/s, write: 33.0 MiB/s
INFO: 43% (1.8 TiB of 4.2 TiB) in 8h 46m 59s, read: 31.6 MiB/s, write: 30.7 MiB/s
INFO: 44% (1.8 TiB of 4.2 TiB) in 9h 11m 29s, read: 29.8 MiB/s, write: 29.7 MiB/s
INFO: 45% (1.9 TiB of 4.2 TiB) in 9h 34m 55s, read: 31.2 MiB/s, write: 30.9 MiB/s
INFO: 46% (1.9 TiB of 4.2 TiB) in 9h 58m 49s, read: 30.6 MiB/s, write: 30.5 MiB/s
INFO: 47% (2.0 TiB of 4.2 TiB) in 10h 24m 12s, read: 28.8 MiB/s, write: 28.8 MiB/s
INFO: 48% (2.0 TiB of 4.2 TiB) in 10h 49m 15s, read: 29.2 MiB/s, write: 28.7 MiB/s
INFO: 49% (2.0 TiB of 4.2 TiB) in 11h 13m 7s, read: 30.7 MiB/s, write: 30.5 MiB/s
INFO: 50% (2.1 TiB of 4.2 TiB) in 11h 38m 6s, read: 29.3 MiB/s, write: 29.1 MiB/s
INFO: 51% (2.1 TiB of 4.2 TiB) in 12h 4m 18s, read: 27.9 MiB/s, write: 27.7 MiB/s
INFO: 52% (2.2 TiB of 4.2 TiB) in 12h 28m 17s, read: 30.5 MiB/s, write: 30.1 MiB/s
INFO: 53% (2.2 TiB of 4.2 TiB) in 12h 51m 10s, read: 32.0 MiB/s, write: 31.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again

itNGO · Oct 11, 2024

We have comparable issue here... when PVE-Host or PBS crashed for whatever reason during Backup with enabled fleecing... all fleecing disks stay and never get deleted... the next backup for this VM will fail with error "already existing" fleecing disk.

Sometimes with qm rescan the disks are detected and assigned to the VM as unused. So manual delete is possible... but sometimes they just sit in the storage and block the backup. Delete with GUI is not possible cause VMID exists.... weird....

crpb · Dec 7, 2024

itNGO said:
Sometimes with qm rescan the disks are detected and assigned to the VM as unused

this just saved me a lot of time

Sadly i couldn't free images yet but had to reboot the individual vm. so for anyone with the same problem i took the following steps:

DELETED
which left me now with lingering disks in the configurations of each vm

Bash:

root@pve ~ # grep -r unused /etc/pve/qemu-server
/etc/pve/qemu-server/103.conf:unused0: z_fleecing:vm-103-fleece-0
/etc/pve/qemu-server/103.conf:unused1: z_fleecing:vm-103-fleece-1
/etc/pve/qemu-server/103.conf:unused2: z_fleecing:vm-103-fleece-2
/etc/pve/qemu-server/103.conf:unused3: z_fleecing:vm-103-fleece-3
/etc/pve/qemu-server/107.conf:unused0: z_fleecing:vm-107-fleece-0
/etc/pve/qemu-server/104.conf:unused0: z_fleecing:vm-104-fleece-0
/etc/pve/qemu-server/104.conf:unused1: z_fleecing:vm-104-fleece-1
/etc/pve/qemu-server/104.conf:unused2: z_fleecing:vm-104-fleece-2
/etc/pve/qemu-server/102.conf:unused0: z_fleecing:vm-102-fleece-0
/etc/pve/qemu-server/102.conf:unused1: z_fleecing:vm-102-fleece-1
/etc/pve/qemu-server/105.conf:unused0: z_fleecing:vm-105-fleece-0
/etc/pve/qemu-server/106.conf:unused0: z_fleecing:vm-106-fleece-0
/etc/pve/qemu-server/100.conf:unused0: z_fleecing:vm-100-fleece-0
/etc/pve/qemu-server/100.conf:unused1: z_fleecing:vm-100-fleece-1
/etc/pve/qemu-server/100.conf:unused2: z_fleecing:vm-100-fleece-2

and only after that i learned about qm disk unlink $vmid --idlist unused*
so i guess if that ever happens the steps i would take would look like this

DELETED

And because i didn't do it in the particular order i ended up with this.. (yes, i'm a oneliner by trade

)

DELETED

After all that i noticed in PBS that there are still jobs/snapshots kinda lingering (a spinning symbol) for all VMs which were affected so i additionaly ran this

Bash:

root@pve ~ # pvesm list pbs |awk '($4 == 1)'
pbs:backup/vm/101/2024-12-07T03:43:28Z pbs-vm  backup               1 101
pbs:backup/vm/102/2024-12-07T03:55:50Z pbs-vm  backup               1 102
pbs:backup/vm/103/2024-12-07T04:29:04Z pbs-vm  backup               1 103
pbs:backup/vm/104/2024-12-07T05:22:37Z pbs-vm  backup               1 104
pbs:backup/vm/105/2024-12-07T06:06:01Z pbs-vm  backup               1 105
pbs:backup/vm/106/2024-12-07T06:29:01Z pbs-vm  backup               1 106
pbs:backup/vm/107/2024-12-07T06:52:07Z pbs-vm  backup               1 107
root@pve ~ # pvesm list pbs |awk '($4 == 1) {print $1}' |xargs -n 1 pvesm free
Removed volume 'pbs:backup/vm/101/2024-12-07T03:43:28Z'
Removed volume 'pbs:backup/vm/102/2024-12-07T03:55:50Z'
Removed volume 'pbs:backup/vm/103/2024-12-07T04:29:04Z'
Removed volume 'pbs:backup/vm/104/2024-12-07T05:22:37Z'
Removed volume 'pbs:backup/vm/105/2024-12-07T06:06:01Z'
Removed volume 'pbs:backup/vm/106/2024-12-07T06:29:01Z'
Removed volume 'pbs:backup/vm/107/2024-12-07T06:52:07Z'

tcabernoch · Dec 7, 2024

Gawd. Yes, there's a problem with fleecing files if the backup fails. This is a known issue.
@fiona says there's a fix in the pipeline, but it didn't make it into this new PBS 3.3 version.
But ... i don't even understand what the previous poster was doing. Don't do that.

First, get the list of fleecing files. Unless you have an active backup running, all of these can be nuked.
zfs list | grep fleece

Nuke em.
zfs destroy {result from above}

And then, surprise! Some of them won't delete. Locked disk.
You have two options.

Rename them. To anything. Maybe something that doesn't have the vmid in it ... ? But anything at all will work to get the backups running again.

or

Power the VM all the way off. It will release the lock on the fleecing disk, allowing you to zfs destroy it.

Edit ... Obviously this advice applies only to cases where your fleecing files are on ZFS datastores. Although, this has been the case for every instance I've seen reported. I don't know it to be a ZFS-specific issue, and may well occur on LVM with a different set of quirks. I'm a ZFSer.

crpb · Dec 8, 2024

tcabernoch said:
But ... i don't even understand what the previous poster was doing. Don't do that.

You are free to ask.

tcabernoch said:
Edit ... Obviously this advice applies only to cases where your fleecing files are on ZFS datastores. Although, this has been the case for every instance I've seen reported. I don't know it to be a ZFS-specific issue, and may well occur on LVM with a different set of quirks. I'm a ZFSer.

is that somewhat related to this?

crpb said:
store=z_fleecing # i have an lvmthin which is used for those

Well, i'm a ZFSer at heart too. But as this is my first PVE System and i'm used to TrueNAS-Core(BSD!) + ESX and the System on which PVE runs hasn't got Disks for actual Storage so i did install it with LVM instead of ZFS. Anyhow, the Backups worked now for a couple months until, wait for it, PBS 3.3! These lingering fleecing disks were only the result of the backup beeing botched completly. After doing much trial and error i now reverted to 3.2.9-1 set pinning for proxmox-backup-{manager,client}, restarted everything and it is now working again w/o a hitch. I already did multiple snapshot and stop backups, no more issues.
For now i will stick to that version at least for a couple weeks/end of the year and will try again with a newer release in '25 i guess.

tcabernoch · Dec 8, 2024

crpb said:
You are free to ask.

Don't care. That was all needlessly complex and likely to send people down the wrong path.

crpb said:
is that somewhat related to this?

Might be. This is a quote from the dev that wrote the feature.

https://forum.proxmox.com/threads/orphan-fleecing-files-make-backups-fail.151061/#post-684318
If the disk did not finish detaching from QEMU yet (i.e. the detach code doesn't wait long enough and the cleanup routine will continue), ZFS will still consider it busy.

Here's the bugzilla. It's ZFS left and right.
https://bugzilla.proxmox.com/show_bug.cgi?id=5440

Again, I don't know this to be ZFS-specific, but after all that wild code you spit out ... well I'll just go with what the Devs say.

crpb · Dec 8, 2024

tcabernoch said:
https://forum.proxmox.com/threads/orphan-fleecing-files-make-backups-fail.151061/#post-684318
If the disk did not finish detaching from QEMU yet (i.e. the detach code doesn't wait long enough and the cleanup routine will continue), ZFS will still consider it busy.

Here's the bugzilla. It's ZFS left and right.
https://bugzilla.proxmox.com/show_bug.cgi?id=5440

Again, I don't know this to be ZFS-specific, but after all that wild code you spit out ... well I'll just go with what the Devs say.

Thx for the info. And yeah that all seems like zfs related but mine came up because the whole backup process did fail with timeouts and left all those fleecing disks.
But this is also what this is topic is about or not?
And all this shell-foo is just to clean up the mess after you are left with stuck fleecing disks w/o ever looking at the webgui or doing something iffy (IMHO) like editing the /etc/pve/.../$VMID.conf manually if you can just use the provided tools.

But after looking at the other issues i now completly revised the "script" to make it safe to use AFAICT @fabian could you check that?

tcabernoch · Dec 10, 2024

Cool.

Yes, we explore technical topics in detail and tear the vendor's creations apart to inspect them. We do all of that here. There are folks with truly advanced knowledge frequenting this forum.

And we are also the first destination for a vast army of n00b homelabbers. And then there's the random professional that does all this stuff for a living. That's the audience here. A very mixed bag. Personally, I try to NOT leave any code landmines in the forum. I keep it safe, or if not, indicate dangerous-don't-do-this-in-Prod sort of activities as such.

BTW ... Mucking about on the PBS datastore ... that's a fraught activity. Unless you really want to dig into .chunks mechanisms, I've found it better to just learn how it works and then use it effectively. It's a darned interesting topic, but a much deeper rabbit hole than some might expect.

fabian · Dec 11, 2024

AFAIK fleecing volume cleanup has been improved in the meantime, @fiona knows more.. unfinished snapshots on the PBS side are removed once a newer finished one exists (when pruning the group).

fiona · Dec 11, 2024

fabian said:
AFAIK fleecing volume cleanup has been improved in the meantime, @fiona knows more..

No, the patches were not applied, as we decided on a bit different approach (track the fleecing images in a special config section rather than an internal-only config option) and I haven't gotten around to finishing this yet.

fabian · Dec 11, 2024

thanks for the clarification!

@crpb that script looks rather dangerous and broken to me - first it removes all volumes from the storage (if the storage contains anything other than leftover fleecing images, you just lost a lot of data?), and then it queries the storage again (which should now be empty) and loops over the result (which should do nothing?).. there are like three variants in your post do, I am not sure which one I am supposed to look at..

tcabernoch · Dec 14, 2024

Thanks for the update @fiona . This is a subject that affects a lot of us.

It's pernicious, because one rando backup failure can cause all subsequent backups to fail until you go (manually) fix the problem.

My own company has engineered a solution via a script triggered by Zabbix. I'm looking forward to the day that I can disable that script.

jlauro · Feb 20, 2025

I am having this problem with fleecing on lvm disks. The GUI shows the fleece volume but refuses to remove it because of matching vmid.

What is the procedure/commands for manual cleanup when the fleece is on a LVM thin volume?

fiona · Feb 20, 2025

Hi,

jlauro said:
I am having this problem with fleecing on lvm disks. The GUI shows the fleece volume but refuses to remove it because of matching vmid.

What is the procedure/commands for manual cleanup when the fleece is on a LVM thin volume?

the safest way is using qm rescan --vmid 1234 on the CLI with your VM ID and then removing it from the Hardware tab of the VM in the UI.

On the CLI, you could do pvesm free lvmthin:vm-1234-fleece-0 where you need to adapt the storage ID instead of lvmthin, the correct VM ID and disk ID.

jlauro · Feb 20, 2025

fiona said:
Hi,

the safest way is using qm rescan --vmid 1234 on the CLI with your VM ID and then removing it from the Hardware tab of the VM in the UI.

On the CLI, you could do pvesm free lvmthin:vm-1234-fleece-0 where you need to adapt the storage ID instead of lvmthin, the correct VM ID and disk ID.

Thank you. That worked

Code:

root@pm4:~# qm rescan --vmid 405
rescan volumes...
VM 405 add unreferenced volume 'nvmethin:vm-405-fleece-0' as 'unused0' to config
VM 405 add unreferenced volume 'nvmethin:vm-405-fleece-1' as 'unused1' to config

and was then able to remove it from the Hardware tab in the gui.

tcabernoch · Feb 20, 2025

As far as I know, this is the first online mention of this issue happening on LVM. Hmm.

jlauro · Feb 21, 2025

tcabernoch said:
As far as I know, this is the first online mention of this issue happening on LVM. Hmm.

It happened during a power outage of the host and vm being backed up. Normally stable power with UPS, but... sometimes bad stuff happens... Haven't seen it of otherwise failed backup.

tcabernoch · Feb 21, 2025

The bug tracker and this forum is covered with ZFS references to fleecing issues. I thought it might be ZFS specific.

Fleecing - "Orphaned" fleece files

Member

Proxmox Staff Member

Member

Famous Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Proxmox Staff Member

Proxmox Staff Member

Proxmox Staff Member

Well-Known Member

Active Member

Proxmox Staff Member

Active Member

Well-Known Member

Active Member

Well-Known Member

We value your privacy