API update vm will fail while proxmox UI will succeed

layer7.net

Member
Oct 5, 2021
41
3
13
23
Hi,

this is from:

pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.5.13-3-pve)


A qcow2 based template on an NFS server is link cloned on the same NFS server

Doing this via API will result in tasks with this output:

Code:
create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2)
clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

49 seconds later changing the IO limits:

Code:
update VM 457: -virtio0 nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2,iops_rd=50000,iops_wr=50000,mbps_rd=244,mbps_wr=244,discard=on
TASK ERROR: volume 'nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2' does not exist

Which will fail...

While the basically identical process will not fail when executed manually through the proxmox UI:


via proxmox UI:


Code:
create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2)
clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK


23 seconds later:


Code:
update VM 457: -virtio0 nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2,discard=on,size=2G,mbps_rd=244,mbps_wr=244,iops_rd=50000,iops_wr=50000
TASK OK


So the question would be now, what could be possible reasons why the API would behave differently from the actions triggered by the Proxmox UI.

Any suggestions for further debugging are welcome. We could already rule out performance issues and similar kinds of root causes.

Thank you!
 
compare the actual requests made (the one by the UI can be retrieved via the browser's dev console)..
 
compare the actual requests made (the one by the UI can be retrieved via the browser's dev console)..
Hi Fabian,

thank you for your time!

The post request from the browser is:


virtio0"nfs-server:322/base-322-disk-0.qcow2/110/vm-110-disk-0.qcow2,discard=on,iops_rd=50001,iops_wr=50001,mbps_rd=251,mbps_wr=252,size=2G"
digest"a0ab606e900238de1632c6834b53c5268c18076b"
background_delay"5"


We added now background_delay 30

Digest we dont use, as there is nothing that can come in between.

Issue is still there, no change, even we wait the maximum duration.
 
Hi,

ok after some deeper debugging it turned out that the TASK OK from the clone is actually the problem:

Code:
create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2) Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2) clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

Is actually simply a lie.

When the proxmox task reports an OK, the file file did not exist in the filesystem:

Code:
# ls -la /mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2
ls: cannot access '/mnt/pve/nfs-server/457/vm-457-disk-0.qcow2': No such file or directory

Unfortunately the clone command does not have a waiting parameter for background tasks as it seems to me. Maybe it wont be bad to add this...

Thank you @fabian for your time and support!
 
I am not sure which API client you are using, but the clone endpoint most definitely spawns a worker task and returns that task's ID:

https://git.proxmox.com/?p=qemu-ser...9d4126967a5034e36621d2fae12948c;hb=HEAD#l3948

maybe you need to poll the task to wait for its completion before issuing the next request?
Hi,

thank you for pointing us to this!

After every request we do, we are checking using UPID the status of the task before we issue the next command.

So for now that does not seem to be the case.

During our debug we did:

- issue clone -> waiting for task done via code
- issue config -> failing with the error ( we observed this through proxmox UI )

-> checking manually right after we see the error in the proxmox UI task history the existence of the file on the cli.

And this file was not there ( as shown above ), even several seconds after clone gave back an OK

AFIAK we had this behavior already in another situation, where the API returned an OK ( means query the task ID and checking for it to be finished ) while in fact the task was not fully done. Thats why we were forced to build in sleep delays in the code to give things some extra time to settle things for real.

So unfortunately it seems that the API replies are not always 100% precise.
 
I am fairly certain you are doing something wrong with your API access ;) the way task workers work rules out your sequence of events. either you are mixing up task IDs, or something else is going on..

if the clone task were still running when you tried the config update, the config update would fail because the clone task is still holding both an flock and the config file is locked as well.
 
Hi,

i am also wondering myself.

But in fact, when we see this in the ProxmoxUI:

Code:
create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2) Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2) clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

this here was executed on the cli ( on the right server ) ending with this result:

Code:
# ls -la /mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2
ls: cannot access '/mnt/pve/nfs-server/457/vm-457-disk-0.qcow2': No such file or directory

There was no API call or what so ever involved in this. I am just giving here feedback on what happend in a reproduceable way.

IMHO after the proxmox UI will give TASK OK, this file must have been existed on the filesystem. I did not dive in it even deeper by observing the hostmachines processes to actually create this file during clone to give any feedback what happend during this time on the filesystem.

Maybe the combination nfs + qcow2 template + qcow2 linked clone creates a special situation where proxmox will report back TASK OK while its not yet _really_ on the filesystem.

When i have time i will dive into that deeper and check what happens on the filesystem.

In any way, thank you very much for your effort and time, its highly appreciated!
 
There was no API call or what so ever involved in this. I am just giving here feedback on what happend in a reproduceable way.

IMHO after the proxmox UI will give TASK OK, this file must have been existed on the filesystem. I did not dive in it even deeper by observing the hostmachines processes to actually create this file during clone to give any feedback what happend during this time on the filesystem.

Maybe the combination nfs + qcow2 template + qcow2 linked clone creates a special situation where proxmox will report back TASK OK while its not yet _really_ on the filesystem.

When i have time i will dive into that deeper and check what happens on the filesystem.

In any way, thank you very much for your effort and time, its highly appreciated!

like I said, that seems very unlikely (the task is only marked as completed when the worker process exits, and the worker process is the one creating the file in a blocking fashion), unless you are testing the file existence from a different node and have some weird caching issue with your NFS server ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!