Deleting several VMs may leave behind Disks

tschmidt

New Member
Oct 11, 2023
17
5
3
I'm working on automatically deploying VMs via the API. To update them I delete the old and clone a new one from an uptodate template. During this I noticed that when deleting several VMs some may leave their disk behind.

I think I had i happen with 5 VMs but for testing this its easier to just use a hundred or so. Then most of em left their disk.

I'm using a 4 Node hyper-converged Ceph cluster running uptodate non-subscriber and call the http API directly from python.
 
Here's a simplified demonstrator (the real code watches the tasks for completion and makes sure it only has a maximum of 5 running, most likely of mixed types like clones, etc.)
 

Attachments

have you checked the task log of one of the VMs that failed to get all its disks removed?
 
No, I only ever checked the status (which said OK), the output indeed reveals that there is a problem:

Code:
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
trying to acquire cfs lock 'storage-pool1' ...
Could not remove disk 'pool1:base-101-disk-0/vm-199-disk-0', check manually: cfs-lock 'storage-pool1' error: got lock request timeout
purging VM 199 from related configurations..
TASK OK

I look if I can post the status too. Proxmox disallos cut'n'pasting even the taskid.

EDIT:
JSON:
{
    "data": {
        "type": "qmdestroy",
        "user": "root@pam",
        "pstart": 24283591,
        "exitstatus": "OK",
        "status": "stopped",
        "pid": 1383677,
        "node": "pve1",
        "tokenid": "apitest",
        "id": "199",
        "starttime": 1721635829,
        "upid": "UPID:pve1:00151CFD:017289C7:669E13F5:qmdestroy:199:root@pam!apitest:"
    }
}
 
Last edited:
yeah, this is caused by contention on the storage layer. probably those should be more visible in the task status by being properly logged as warnings.
 
definitely, but even better would be higher timeouts and/or retries

but thanks for your help, now at least I have an easier time trying to work around it
 
Last edited: