Clone process hangs or cfs lock cannot be acquired

robertlukan

New Member
Feb 26, 2025
8
2
3
We are using Proxmox VE 8.4.1 with enterprise subscription repos. I have noticed that the clone process either does not start or hangs at the same place(% done for a disk) for several VMs. We have tried quite few VMs and results are mostly negative(oddly it works for a few VMs). However, if we create a new VM and clone it - works fine. We are running Ceph RBD(hyperconverged) with 19.2.1. Ceph does not show any errors, also fio test from one of the VMs is fine, also performance wise read and write. We can add a new disk to an existing VM. Restore of a VM works fine. The migration offline and online is working fine. The existing setup/version has been working till yesterday for about 3 months(on 8.4.1) and for about 6 months on version 8.

More oddly, cloning of a live VM works fine(the same VM that offline clone fails). We have tried the same operation from the command line and the same results are the same. We have tried to reboot one host and perform the operations on that host = the same results.

We have reported this issue to Proxmox already, so this post is for the community if someone has encountered anything similar.

The output of a failed job is very simple, the clone job just stops showing the progress = it is stuck at some %, always the same % if we repeat the process. It "never" finishes, we canceled the job after 20 min. After the cancelation, a disk that was not finished it has to be cleaned manually.

I guess the first question is, what is the difference between live clone and offline clone(copy job) ?
 
I am not sure if anyone will face the same issue, but maybe it will help someone. We have upgraded PVE to 8.4.14, unfortunately ceph is still on 19.2.1, as Proxmox has not pushed into the enterprise repo 19.2.3.

Interesting "workaround" is to reboot/power off one host, such that Ceph is degraded and the clone process works fine. Regardless which host is offline.

The second workaround is to clone to an external storage such as NFS and to clone back to rbd.

The third work around is to use rbd command in the cli to clone the image and manually map/mount disk to the new vm.

With help of Proxmox support team, we have found one another workaround. Enabling krbd enables cloning again.

We have not opt-in to the new kernel.

We are somehow in the dilemma, try new kernel with 8.4 or enable kernel rbd or do nothing = use an intermediary storage for the cloning process and wait till PVE 9.1.x comes out.

So far it looks the easiest workaround is to enable krbd. I am not sure what are the performance/stability implications. Anyone can share their opinion ?