Too many HTTP 599 in pve-proxy, pve-manger, pve-api

Hi,

I already tried that running server independently.
But the errors are still pop up.

We have 1XE NIC for Storage and 1XE for Network and 1GE for cluster only.

As you can see the average latency is less than 1ms. The max value may caused by other reasons.

Is there any other possible things to cause these situation( the 599 )
 
This is a report with omping -c 600 -i 1

The multicast ping test result looks better.

10.1.20.11 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.115/0.166/4.321/0.177
10.1.20.11 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.125/0.193/4.337/0.177
10.1.20.12 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.069/0.145/0.397/0.027
10.1.20.12 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.079/0.166/0.445/0.030
10.1.20.13 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.090/0.160/0.969/0.045
10.1.20.13 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.104/0.185/1.017/0.050
10.1.20.15 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.067/0.134/0.406/0.026
10.1.20.15 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.098/0.154/0.454/0.029
10.1.20.16 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.103/0.155/2.591/0.111
10.1.20.16 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.114/0.171/2.610/0.112
10.1.20.17 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.090/0.151/0.301/0.036
10.1.20.17 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.104/0.164/0.313/0.036
 
root@BT1HS01:~# cat /var/log/daemon.log | grep 'error'
...
Mar 19 03:23:38 BT1HS01 pveproxy[18486]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 19 10:55:01 BT1HS01 pveproxy[44052]: EV: error in callback (ignoring): Can't call method "push_write" on an undefined value at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 255.
...

I checked pvedaemon log. There are lots of same error like "RESTHandler.pm line 378"
 
Mar 19 03:23:38 BT1HS01 pveproxy[18486]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
hm - this error happens when decomposing the path compontent of an API call results in something there is no method for - that's rather strange to pop up during regular use...:

* If you're using a browser please try to access the site with a fresh profile, with all plugins disabled, and the cache cleared (or install another browser)
* please make sure that the pve-packages are indeed installed correctly (I would probably use debsums for that)
* if you're using some other client to access the api (some script relying on curl/pvesh) - please check whether there are some updates for it, or whether the problem is there

hope this helps!
 
- Enterprise repo, latest version.
- debsum all "OK"
- Remote Self-Dev Program using "CURL" to access REST API.

We use lots of "/node/xx/qemu/xx/status/current" and "/node/xx/qemu/x/rrddata?timeframe=hour&cf=AVERAGE" to update VM bandwidth and running status.

This is important. How to fix the timeout issues.

If we lost too much rrddata, the bandwidth usage calculation will be incorrect.

Few questions,
- In the designing of PVE, will there be a timeout for getting the running state of the VM and getting RRDDATA of the VM? or it is a bug.
- If we use "qm xxx status", is it possible to return timeout result"?
- If we use "pvesh" to get rrddata/QEMU configuration/QEMU running status, is it possible to return timeout?

We just want to fix the timeout issues to make API stable.
I found that PVESH faster than the REST API.

When we use CURL to get the API return. It always takes over than 30s.
I wrote a test python file. if we use pvesh, and then use CURL at the same server which running our self-dev API to access the python program. it only costs around 5~7s.

So, the last question is does the pvesh use same way with pveproxy (pvedaemon) to responding to the API request?

xx@pve [21/03/2019:04:23:41 -0400] "GET /api2/json/nodes/xx/qemu/xx/status/current HTTP/1.1" 599 -
xx@pve [21/03/2019:04:49:29 -0400] "GET /api2/json/nodes/xx/qemu/xx/rrddata?timeframe=hour&cf=AVERAGE HTTP/1.1" 599 -
 
* We tried to reproduce the issue locally - and only by setting up 10k+ requests in parallel to the api we managed to get the 599 errors in some case.
* However then the journal had quite a few messages about pveproxy running into the open file limit (ulimit/nofile). - please check the complete journal for potential issues

* In my other tests curl was faster than pvesh for one request (0.018s vs. 0.5s) - but ymmv - pvesh needs to load quite a bit of the perl-environment, but has the upside of not doing the network I/O (and probably rather small overhead for tls).
In any case 30s sounds very long for a single requests - is your network maybe overloaded (also 30s is the builtin timeout of pveproxy)?

as for the walking through each VM - you probably would be faster by querying /cluster/status from a node - it contains the netin and netout fields for each guest

hope that helps!
 
root@2JWSS42:~# ulimit -aH
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 773418
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 773418
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

We have enough network resource. But we reject SMTP protocol in our network and only allow 8006 and 80 TCP IN in cluster Internet connection.
(No limitation on Cluster Storage, management network. Also in the VM network)
 
root@2JWSS42:~# journalctl -xe | grep -v 'pmxcfs\|@pam\|pvesr\|replication\|--'
Mar 22 14:53:52 2JWSS42 pveproxy[43166]: Clearing outdated entries from certificate cache
Mar 22 14:53:53 2JWSS42 pveproxy[34937]: Clearing outdated entries from certificate cache
Mar 22 14:53:56 2JWSS42 pveproxy[34937]: worker exit
Mar 22 14:53:56 2JWSS42 pveproxy[9808]: worker 34937 finished
Mar 22 14:53:56 2JWSS42 pveproxy[9808]: starting 1 worker(s)
Mar 22 14:53:56 2JWSS42 pveproxy[9808]: worker 24794 started
Mar 22 14:54:23 2JWSS42 pveproxy[44251]: Clearing outdated entries from certificate cache
Mar 22 14:54:34 2JWSS42 pveproxy[24794]: Clearing outdated entries from certificate cache
Mar 22 14:56:15 2JWSS42 login[36264]: pam_unix(login:session): session opened for user root by root(uid=0)
Mar 22 14:56:15 2JWSS42 systemd-logind[12161]: New session 326 of user root.
Mar 22 14:56:15 2JWSS42 systemd[1]: Started Session 326 of user root.
Mar 22 14:56:15 2JWSS42 login[36269]: ROOT LOGIN on '/dev/pts/1'
Mar 22 15:01:23 2JWSS42 systemd-logind[12161]: Removed session 326.
Mar 22 15:01:54 2JWSS42 login[17541]: pam_unix(login:session): session opened for user root by root(uid=0)
Mar 22 15:01:54 2JWSS42 systemd-logind[12161]: New session 327 of user root.
Mar 22 15:01:54 2JWSS42 systemd[1]: Started Session 327 of user root.
Mar 22 15:01:54 2JWSS42 login[17546]: ROOT LOGIN on '/dev/pts/1'
Mar 22 15:04:43 2JWSS42 smartd[12171]: Device: /dev/bus/0 [megaraid_disk_11] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Mar 22 15:04:43 2JWSS42 smartd[12171]: Device: /dev/bus/0 [megaraid_disk_11] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 78 to 79
Mar 22 15:07:27 2JWSS42 pveproxy[9808]: worker 44251 finished
Mar 22 15:07:27 2JWSS42 pveproxy[9808]: starting 1 worker(s)
Mar 22 15:07:27 2JWSS42 pveproxy[9808]: worker 47366 started
Mar 22 15:07:28 2JWSS42 pveproxy[47365]: got inotify poll request in wrong process - disabling inotify
Mar 22 15:07:30 2JWSS42 pveproxy[47365]: worker exit
Mar 22 15:08:07 2JWSS42 pveproxy[47366]: Clearing outdated entries from certificate cache
Mar 22 15:14:00 2JWSS42 pveproxy[43166]: worker exit
Mar 22 15:14:00 2JWSS42 pveproxy[9808]: worker 43166 finished
Mar 22 15:14:00 2JWSS42 pveproxy[9808]: starting 1 worker(s)
Mar 22 15:14:00 2JWSS42 pveproxy[9808]: worker 33872 started

root@BT1HS01:~# journalctl -xe | grep 'pveproxy'
Mar 22 15:05:16 BT1HS01 pveproxy[18742]: worker 21810 finished
Mar 22 15:05:16 BT1HS01 pveproxy[18742]: starting 1 worker(s)
Mar 22 15:05:16 BT1HS01 pveproxy[18742]: worker 29732 started
Mar 22 15:05:19 BT1HS01 pveproxy[29728]: got inotify poll request in wrong process - disabling inotify
Mar 22 15:05:24 BT1HS01 pveproxy[29728]: worker exit
Mar 22 15:13:11 BT1HS01 pveproxy[3641]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:13:28 BT1HS01 pveproxy[29732]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:13:37 BT1HS01 pveproxy[5734]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:13:47 BT1HS01 pveproxy[3641]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:15:03 BT1HS01 pveproxy[5734]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:15:12 BT1HS01 pveproxy[3641]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:15:30 BT1HS01 pveproxy[3641]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:15:39 BT1HS01 pveproxy[3641]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
Mar 22 15:15:48 BT1HS01 pveproxy[29732]: internal error at /usr/share/perl5/PVE/RESTHandler.pm line 378.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!