Can't remove replicated VM disk

alc

Active Member
Feb 18, 2020
24
2
43
40
After launching a replication, the task failed and I've received the following error :
Code:
end replication job with error: No common base snapshot on volume(s) local-zfs:vm-102-disk-0
Note that there are no snapshots associated with this VM.
Also, I've had this error with VM's and CT's too, neither having snapshots.

Then when trying to remove the destination disk, I get "zfs error: cannot destroy 'rpool/vm-102-disk-0': dataset is busy".
Also note that the replication seems to be borked, as the destination drive shows a size of 0B.

I've waited a few hours and tried again with the same results.
I've also removed the replication tasks (and they did go away successfully, but I still couldn't delete the target disk).

What else can I try ?
 
Last edited:
After launching a replication, the task failed and I've received the following error :
Code:
end replication job with error: No common base snapshot on volume(s) local-zfs:vm-102-disk-0
Note that there are no snapshots associated with this VM.
Also, I've had this error with VM's and CT's too, neither having snapshots.

Then when trying to remove the destination disk, I get "zfs error: cannot destroy 'rpool/vm-102-disk-0': dataset is busy".
Also note that the replication seems to be borked, as the destination drive shows a size of 0B.

I've waited a few hours and tried again with the same results.
I've also removed the replication tasks (and they did go away successfully, but I still couldn't delete the target disk).

What else can I try ?

1. Check if the dataset is in use

  • Run the following command to check active usage:
    lsof | grep rpool/vm-102-disk-0

  • Ensure no mounts or locks are holding it:
    zfs get mounted rpool/vm-102-disk-0

  • If mounted, unmount it:
    zfs unmount rpool/vm-102-disk-0

2. Verify Replication State

  • Check for lingering replication configurations:
    cat /etc/pve/replication.cfg

  • If there are still replication entries for the affected VM/CT, remove them manually.

3. Force Delete the ZFS Dataset

  • If the dataset is still "busy":
    zfs destroy -f rpool/vm-102-disk-0

  • If it still refuses:
    zfs destroy -f -r rpool/vm-102-disk-0

4. Restart ZFS Services

  • Restart ZFS-related services:
    systemctl restart zfs.target

  • If the issue persists, reboot the node:
    reboot

5. Fix Future Replication Issues

Since replication requires snapshots:

  • Create a manual snapshot before starting a new replication:
    zfs snapshot local-zfs/vm-102-disk-0@base
  • Then recreate the replication task in the Proxmox UI.
  • If issues persist, check system logs:
    journalctl -xe | grep zfs
 
  • Like
Reactions: esod
Hi,
the replication uses ZFS snapshots to transfer the data. If there is no common snapshot between source and target, replication will not be possible. Please share the output of zpool history | grep vm-102-disk-0 on the target as well as pveversion -v and zfs list -r -t all rpool/data/vm-102-disk-0 on both source and target.

Is there any task referencing the disk still running, ps aux | grep -e pvesr -e vm-102-disk-0 -e zfs?
 
Hi,
the replication uses ZFS snapshots to transfer the data. If there is no common snapshot between source and target, replication will not be possible. Please share the output of zpool history | grep vm-102-disk-0 on the target as well as pveversion -v and zfs list -r -t all rpool/data/vm-102-disk-0 on both source and target.

Is there any task referencing the disk still running, ps aux | grep -e pvesr -e vm-102-disk-0 -e zfs?
Hello, sorry for the very long delay.
After an upgrade, I still can't delete the problematic disk image.

Here are the requested outputs, pasted externally when too long:

zpool history | grep vm-102-disk-0 on the target node [pastebin]
pveversion -v on the target [pastebin]
pveversion -v on the source [pastebin]

zfs list -r -t all rpool/vm-102-disk-0
on the target :
Code:
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool/vm-102-disk-0  1.02G  1.42T     8K  -

zfs list -r -t all rpool/vm-102-disk-0 on the source :
Code:
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool/vm-102-disk-0  1.02G  1.20T   445M  -


Also, thank you @shbaek, I've tried all the commands you've provided with no success. Here are the results :
  • Run the following command to check active usage:
    lsof | grep rpool/vm-102-disk-0
On the target :
- No output
On the source :
- An extremely long list of lsof: no pwd entry for UID 100100, with various UID's (100100, 100033, 100133, ...)
  • Ensure no mounts or locks are holding it:
    zfs get mounted rpool/vm-102-disk-0
On both target and source :
Code:
NAME                 PROPERTY  VALUE    SOURCE
rpool/vm-102-disk-0  mounted   -        -
  • If mounted, unmount it:
    zfs unmount rpool/vm-102-disk-0
On both target and source :
Code:
cannot open 'rpool/vm-102-disk-0': operation not applicable to datasets of this type
  • Check for lingering replication configurations:
    cat /etc/pve/replication.cfg
On both target and source :
- No mention of VM 102
  • If the dataset is still "busy":
    zfs destroy -f rpool/vm-102-disk-0
On both target and source :
cannot destroy 'rpool/vm-102-disk-0': dataset is busy
  • If it still refuses:
    zfs destroy -f -r rpool/vm-102-disk-0
On both target and source :
cannot destroy 'rpool/vm-102-disk-0': dataset is busy
  • Restart ZFS-related services:
    systemctl restart zfs.target

  • If the issue persists, reboot the node:
    reboot
Rebooting had no effect.
 
Last edited:
Hello, sorry for the very long delay.
After an upgrade, I still can't delete the problematic disk image.

Here are the requested outputs, pasted externally when too long:

zpool history | grep vm-102-disk-0 on the target node [pastebin]
So the last operation there was a removal on March 4th, it's very strange that the image still seems to exist then:
2025-03-04.13:13:04 zfs destroy -r rpool/vm-102-disk-0

What does zpool status -v say? Do you see the image mentioned when you run ps aux | grep vm-102-disk-0?