/storage/content API call not working - 596

kenosis · Feb 13, 2024

Hi all,

Specifically for one of the nodes in the cluster the /storage/content call as such (https://domain.tld:8006/api2/json/nodes/node-1/storage/local/content is not working. In the pveproxy log it returns a "596" error, which seems to indicate a "resource unavailable" error . Other API calls for this node are working fine, including the various /storage/ calls. When executing

Code:

pvesh get /nodes/node-1/storage/local/content

This also works fine, albeit with somewhat of a delay, with the "time" command prefix:

Code:

real    0m35.039s
user    0m17.313s
sys    0m18.277s

For the other node, with similar spec, the pvesh /storage/content call comes back in 13 seconds. There are definitely more VM's active on node-1 but looking at read/write behaviour it's not immediatly obvious why the difference should be so large.

My guess is that because the content call on node-1 goes over 30 seconds it may hit some type of time-out when going through the pveproxy / API. Can anyone confirm that this is the case? Is there also a way to change such a timeout setting?

We are running 7.4.-16.

bbgeek17 · Feb 13, 2024

What is the storage behind this? If its network based, its possible there is more latency/packet loss between this node and storage.
The timeouts are hardcoded afaik.
Examine the logs on both sides, measure "pvesm status ..." and "pvesm list ..." responses, a network trace may be very revealing.

Similar threads :
https://forum.proxmox.com/threads/proxmox-api-performance-issues.120747/
https://forum.proxmox.com/threads/pve-rest-api-liefert-zu-viele-http-596-und-599-status-codes.97169/

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kenosis · Feb 13, 2024

It's local storage on each server. That's why I was looking at iops for the node that has 30+ seconds response time. Granted, it's limited by SATA-600 so that is not optimal and I want to change that in the future.

But that's why I'm wondering about the timeout settings. I can accept that we've overloaded the hardware (though it doesn't seem that way now) but if I can change the timeout at least that can be a bridge towards new server architecture with NVME drives.

bbgeek17 · Feb 13, 2024

There are several types of local storage. As was mentioned before, try to reproduce the issue with pvesm or pvesh, to exclude external/network factors. Run it in a loop 1000 times. If the issue does not reproduce, then perhaps you have a network problem. Move your api client directly to PVE, try to reproduce it. If you cant again, then more credence to unstable network. The next step would be basic network troubleshooting or network trace.

Good luck

PS you can review various timeout settings by looking at PVE code:
/usr/share/perl5/PVE# grep -Ri timeout

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

kenosis · Feb 13, 2024

It's not network based storage, and this was not a networking issue.

The problem was as I suspected. In APIServer/AnyEvent.pm there is a general timeout for HTTP requests defined which is set at 30 seconds:

Code:

        my $w; $w = http_request(
            $method => $target,
            headers => $headers,
            timeout => 30,
            recurse => 0,
            proxy => undef, # avoid use of $ENV{HTTP_PROXY}

When changing this to 45 seconds and restarting the various services, the /content call loads without a problem. As I indicated, the content list of the storage takes over 30 seconds to load, even with pvesh. Weirdly though, when loading the different content types separately, the total time does not add up to over 30 seconds. For example:

/content?content=images takes 16s with the others (vztmpl, iso, backup) taking less than a second. Only when the full call /content is used, does it go over 30 seconds.

So still unclear to me why the listing is taking so long, but perhaps consideration could be given to this particular scenario and have the pveproxy provide a clearer error message.

Edit: Just to add, the total /content call currently takes 32 seconds and a bit, which is just about double the total of constituent calls. I wonder if in the /content call it somehow loads /content?content=images twice? Since that would very neatly explain the time difference. It's pure speculation on my part however.

Edit2: Same for the other node. /content takes 12 seconds, /content?content=images 6 seconds, vztmpl, iso, snippets and backup all less than 100ms. My guess is /content somehow lists images twice.

bbgeek17 · Feb 13, 2024

Although PVE stuff is present in the forum, if you feel that you stumbled upon a bug or code deficiency, the appropriate avenue to bring it up to PVE developers` attention is to file a bug in https://bugzilla.proxmox.com/.

If you decide to do so, I recommend making a very clear description, with all command examples, reporting full pversion output and all supporting information in a neat, organized way.

good luck

Blockbridge: Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

/storage/content API call not working - 596

kenosis

New Member

bbgeek17

Distinguished Member

kenosis

New Member

bbgeek17

Distinguished Member

kenosis

New Member

bbgeek17

Distinguished Member