API update vm will fail while proxmox UI will succeed

layer7.net · Apr 11, 2024

Hi,

this is from:

pve-manager/8.1.10/4b06efb5db453f29 (running kernel: 6.5.13-3-pve)

A qcow2 based template on an NFS server is link cloned on the same NFS server

Doing this via API will result in tasks with this output:

Code:

create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2)
clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

49 seconds later changing the IO limits:

Code:

update VM 457: -virtio0 nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2,iops_rd=50000,iops_wr=50000,mbps_rd=244,mbps_wr=244,discard=on
TASK ERROR: volume 'nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2' does not exist

Which will fail...

While the basically identical process will not fail when executed manually through the proxmox UI:

via proxmox UI:

Code:

create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2)
clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

23 seconds later:

Code:

update VM 457: -virtio0 nfs-server:321/base-321-disk-0.qcow2/457/vm-457-disk-0.qcow2,discard=on,size=2G,mbps_rd=244,mbps_wr=244,iops_rd=50000,iops_wr=50000
TASK OK

So the question would be now, what could be possible reasons why the API would behave differently from the actions triggered by the Proxmox UI.

Any suggestions for further debugging are welcome. We could already rule out performance issues and similar kinds of root causes.

Thank you!

fabian · Apr 11, 2024

compare the actual requests made (the one by the UI can be retrieved via the browser's dev console)..

layer7.net · Apr 11, 2024

fabian said:
compare the actual requests made (the one by the UI can be retrieved via the browser's dev console)..

Hi Fabian,

thank you for your time!

The post request from the browser is:

virtio0	"nfs-server:322/base-322-disk-0.qcow2/110/vm-110-disk-0.qcow2,discard=on,iops_rd=50001,iops_wr=50001,mbps_rd=251,mbps_wr=252,size=2G"
digest	"a0ab606e900238de1632c6834b53c5268c18076b"
background_delay	"5"

We added now background_delay 30

Digest we dont use, as there is nothing that can come in between.

Issue is still there, no change, even we wait the maximum duration.

layer7.net · Apr 11, 2024

Hi,

ok after some deeper debugging it turned out that the TASK OK from the clone is actually the problem:

Code:

create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2) Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2) clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

Is actually simply a lie.

When the proxmox task reports an OK, the file file did not exist in the filesystem:

Code:

# ls -la /mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2
ls: cannot access '/mnt/pve/nfs-server/457/vm-457-disk-0.qcow2': No such file or directory

Unfortunately the clone command does not have a waiting parameter for background tasks as it seems to me. Maybe it wont be bad to add this...

Thank you @fabian for your time and support!

fabian · Apr 11, 2024

I am not sure which API client you are using, but the clone endpoint most definitely spawns a worker task and returns that task's ID:

https://git.proxmox.com/?p=qemu-ser...9d4126967a5034e36621d2fae12948c;hb=HEAD#l3948

maybe you need to poll the task to wait for its completion before issuing the next request?

layer7.net · Apr 11, 2024

fabian said:
I am not sure which API client you are using, but the clone endpoint most definitely spawns a worker task and returns that task's ID:

https://git.proxmox.com/?p=qemu-ser...9d4126967a5034e36621d2fae12948c;hb=HEAD#l3948

maybe you need to poll the task to wait for its completion before issuing the next request?

Hi,

thank you for pointing us to this!

After every request we do, we are checking using UPID the status of the task before we issue the next command.

So for now that does not seem to be the case.

During our debug we did:

- issue clone -> waiting for task done via code
- issue config -> failing with the error ( we observed this through proxmox UI )

-> checking manually right after we see the error in the proxmox UI task history the existence of the file on the cli.

And this file was not there ( as shown above ), even several seconds after clone gave back an OK

AFIAK we had this behavior already in another situation, where the API returned an OK ( means query the task ID and checking for it to be finished ) while in fact the task was not fully done. Thats why we were forced to build in sleep delays in the code to give things some extra time to settle things for real.

So unfortunately it seems that the API replies are not always 100% precise.

fabian · Apr 12, 2024

I am fairly certain you are doing something wrong with your API access

the way task workers work rules out your sequence of events. either you are mixing up task IDs, or something else is going on..

if the clone task were still running when you tried the config update, the config update would fail because the clone task is still holding both an flock and the config file is locked as well.

layer7.net · Apr 12, 2024

Hi,

i am also wondering myself.

But in fact, when we see this in the ProxmoxUI:

Code:

create full clone of drive ide0 (nfs-server:321/vm-321-cloudinit.qcow2) Formatting '/mnt/pve/nfs-server/images/457/vm-457-cloudinit.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=4194304 lazy_refcounts=off refcount_bits=16
create linked clone of drive virtio0 (nfs-server:321/base-321-disk-0.qcow2) clone 321/base-321-disk-0.qcow2: images, vm-457-disk-0.qcow2, 457 to vm-457-disk-0.qcow2 (base=../321/base-321-disk-0.qcow2)
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 backing_file=../321/base-321-disk-0.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
TASK OK

this here was executed on the cli ( on the right server ) ending with this result:

Code:

# ls -la /mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2
ls: cannot access '/mnt/pve/nfs-server/457/vm-457-disk-0.qcow2': No such file or directory

There was no API call or what so ever involved in this. I am just giving here feedback on what happend in a reproduceable way.

IMHO after the proxmox UI will give TASK OK, this file must have been existed on the filesystem. I did not dive in it even deeper by observing the hostmachines processes to actually create this file during clone to give any feedback what happend during this time on the filesystem.

Maybe the combination nfs + qcow2 template + qcow2 linked clone creates a special situation where proxmox will report back TASK OK while its not yet _really_ on the filesystem.

When i have time i will dive into that deeper and check what happens on the filesystem.

In any way, thank you very much for your effort and time, its highly appreciated!

dcsapak · Apr 12, 2024

layer7.net said:
Formatting '/mnt/pve/nfs-server/images/457/vm-457-disk-0.qcow2'

vs

layer7.net said:
ls: cannot access '/mnt/pve/nfs-server/457/vm-457-disk-0.qcow2': No such file or directory

there is an 'images' folder missing in the 'ls' output

fabian · Apr 12, 2024

layer7.net said:
There was no API call or what so ever involved in this. I am just giving here feedback on what happend in a reproduceable way.

IMHO after the proxmox UI will give TASK OK, this file must have been existed on the filesystem. I did not dive in it even deeper by observing the hostmachines processes to actually create this file during clone to give any feedback what happend during this time on the filesystem.

Maybe the combination nfs + qcow2 template + qcow2 linked clone creates a special situation where proxmox will report back TASK OK while its not yet _really_ on the filesystem.

When i have time i will dive into that deeper and check what happens on the filesystem.

In any way, thank you very much for your effort and time, its highly appreciated!

like I said, that seems very unlikely (the task is only marked as completed when the worker process exits, and the worker process is the one creating the file in a blocking fashion), unless you are testing the file existence from a different node and have some weird caching issue with your NFS server

Search

Search

API update vm will fail while proxmox UI will succeed

layer7.net

Member

fabian

Proxmox Staff Member

layer7.net

Member

layer7.net

Member

fabian

Proxmox Staff Member

layer7.net

Member

fabian

Proxmox Staff Member

layer7.net

Member

dcsapak

Proxmox Staff Member

fabian

Proxmox Staff Member

We value your privacy