Too many HTTP 599 in pve-proxy, pve-manger, pve-api

Sptrs_CA

Member
Dec 8, 2017
27
0
6
Dear all,

we are running an 8 nodes cluster.
There is an issue that after we re-install 2 nodes and rejoin the cluster with the same hostname and IP address.

There are lots of HTTP 599 TOO MANY REDIRECTION and HTTP 596 BROKEN PIPE on API, Manager even in the direct manager of these nodes. (If I use another node which in the same cluster, there will be more serious.)

root@BT1HS01:~# tail -f /var/log/pveproxy/access.log | grep ' 599 '

Self-dev API Server - whmcs@pve [17/03/2019:06:28:04 -0400] "GET /api2/json/nodes/BT1HS01/qemu/133/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:28:41 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/344/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:28:50 -0400] "GET /api2/json/nodes/BT1HS01/qemu/1915/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:24 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:25 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/354/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:49 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:30:37 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:30:56 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:31:25 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:10 -0400] "GET /api2/json/nodes/BT1HS01/qemu/356/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:19 -0400] "GET /api2/json/nodes/BT1HS01/qemu/2094/rrddata?timeframe=hour&cf=AVERAGE HTTP/1.1" 599-

Self-dev API Server - whmcs@pve [17/03/2019:06:32:28 -0400] "GET /api2/json/nodes/BT1HS01/qemu/344/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:56 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/341/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:26 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:51 -0400] "GET /api2/json/nodes/BT1HS01/tasks/UPID:BT1HS01:00002398:01388B9E:5C8E207E:qmclone:10229900:whmcs@pve:/status HTTP/1.1" 599 -



root@7Y29D02:~# tail -f /var/log/pveproxy/access.log | grep ' 599 '

Self-dev API Server - whmcs@pve [17/03/2019:06:31:01 -0400] "PUT /api2/json/nodes/7Y29D02/qemu/339/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:09 -0400] "GET /api2/json/nodes/7Y29D02/qemu/363/rrddata?timeframe=hour&cf=AVERAGE HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:19 -0400] "GET /api2/json/nodes/7Y29D02/storage?enabled=1 HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:33 -0400] "GET /api2/json/nodes/7Y29D02/qemu/443/rrddata?timeframe=hour&cf=MAX HTTP/1.1" 599



I tried to make it running in the same version. But it was unhelpful. Even I separate it from the cluster. (Running independent). We are now migrated all the VM to other nodes and going to reinstall it. Is there anyone know, what caused it?

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve) pve-manager: 5.3-8 (running version: 5.3-8/2929af8e) pve-kernel-4.15: 5.3-1 pve-kernel-4.15.18-10-pve: 4.15.18-32 corosync: 2.4.4-pve1 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: 1.2-2 libjs-extjs: 6.0.1-2 libpve-access-control: 5.1-3 libpve-apiclient-perl: 2.0-5 libpve-common-perl: 5.0-44 libpve-guest-common-perl: 2.0-19 libpve-http-server-perl: 2.0-11 libpve-storage-perl: 5.0-36 libqb0: 1.0.3-1~bpo9 lvm2: 2.02.168-pve6 lxc-pve: 3.1.0-2 lxcfs: 3.0.2-2 novnc-pve: 1.0.0-2 proxmox-widget-toolkit: 1.0-22 pve-cluster: 5.0-33 pve-container: 2.0-33 pve-docs: 5.3-1 pve-edk2-firmware: 1.20181023-1 pve-firewall: 3.0-17 pve-firmware: 2.0-6 pve-ha-manager: 2.0-6 pve-i18n: 1.0-9 pve-libspice-server1: 0.14.1-1 pve-qemu-kvm: 2.12.1-1 pve-xtermjs: 3.10.1-1 qemu-server: 5.0-45 smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3 zfsutils-linux: 0.7.12-pve1~bpo1
 
I have already upgraded to the newest version.

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)

pve-manager: 5.3-11 (running version: 5.3-11/d4907f84)

pve-kernel-4.15: 5.3-2

pve-kernel-4.15.18-11-pve: 4.15.18-34

pve-kernel-4.15.18-9-pve: 4.15.18-30

corosync: 2.4.4-pve1

criu: 2.11.1-1~bpo90

glusterfs-client: 3.8.8-1

ksm-control-daemon: 1.2-2

libjs-extjs: 6.0.1-2

libpve-access-control: 5.1-3

libpve-apiclient-perl: 2.0-5

libpve-common-perl: 5.0-47

libpve-guest-common-perl: 2.0-20

libpve-http-server-perl: 2.0-12

libpve-storage-perl: 5.0-38

libqb0: 1.0.3-1~bpo9

lvm2: 2.02.168-pve6

lxc-pve: 3.1.0-3

lxcfs: 3.0.3-pve1

novnc-pve: 1.0.0-3

proxmox-widget-toolkit: 1.0-23

pve-cluster: 5.0-33

pve-container: 2.0-34

pve-docs: 5.3-3

pve-edk2-firmware: 1.20181023-1

pve-firewall: 3.0-18

pve-firmware: 2.0-6

pve-ha-manager: 2.0-6

pve-i18n: 1.0-9

pve-libspice-server1: 0.14.1-2

pve-qemu-kvm: 2.12.1-2

pve-xtermjs: 3.10.1-2

qemu-server: 5.0-47

smartmontools: 6.5+svn4324-1

spiceterm: 3.0-5

vncterm: 1.5-3

zfsutils-linux: 0.7.12-pve1~bpo1
 
hm - rejoining a clusternode with the same name/ip can be tricky.

My first guess would be that you did not include the nodes in '/etc/hosts' (or that they are not resolveable to their IP by other means (DNS))
- you can check '/etc/pve/.members' for the ip/hostname matches PVE expects.

- else I'd take a closer look at the corosync configuration.
 
Hi,

I checked it. Everything is running well. But the REST API is very slow and always timeout.
If we tried multiple times. The REST API can give us the correct result.

The hosts file in every node is correct. But it only includes their own record.

The pvecm status

Quorum information
------------------
Date: Mon Mar 18 14:33:55 2019
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000001
Ring ID: 1/2516
Quorate: Yes

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 8
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.20.11 (local)
0x00000002 1 10.1.20.12
0x00000003 1 10.1.20.13
0x00000004 1 10.1.20.14
0x00000005 1 10.1.20.15
0x00000006 1 10.1.20.16
0x00000008 1 10.1.20.31
 
what is the output of:
* `cat /etc/pve/.members`?
* does the situation improve if you add every node to the /etc/hosts of every node?
 
All the 599 are reported by direct accessing.
(It also be reported by cluster accessing)
root@BT1HS01:~# cat /etc/pve/.members
{
"nodename": "BT1HS01",
"version": 24,
"cluster": { "name": "Common-Instance", "version": 24, "nodes": 8, "quorate": 1 },
"nodelist": {
"6MCYD22": { "id": 2, "online": 1, "ip": "10.1.20.12"},
"BT1HS01": { "id": 1, "online": 1, "ip": "10.1.20.11"},
"2JWSS42": { "id": 4, "online": 1, "ip": "10.1.20.14"},
"GX3TG32": { "id": 5, "online": 1, "ip": "10.1.20.15"},
"7Y29D02": { "id": 6, "online": 1, "ip": "10.1.20.16"},
"HBBJ762": { "id": 8, "online": 1, "ip": "10.1.20.31"},
"C2GWF62": { "id": 9, "online": 1, "ip": "10.1.20.32"},
"HZ0TG32": { "id": 3, "online": 1, "ip": "10.1.20.13"}
}
}
 
* Are all relevant services running on the nodes?(In that case please take a careful look at pvedaemon (daemon handling privileged requests) and pveproxy (daemon listening on 8006 and forwarding certain requests to pvaedaemon)
* Do you see any other Problems in the logs? (`journalctl -r`)
 
Hi,

Yes, I did.

Sometimes, I'll get "pvedaemon[21194]: VM 1926 qmp command failed - VM 1926 ....(unreadable from xterm.js console.)

The logs of VMxxx qmp command failed are less than the 599 logs.

But most of them are normal.

Mar 18 15:05:52 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:51 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:51 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:48 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:46 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:45 BT1HS01 pvedaemon[21194]: VM 1926 qmp command failed - VM 1926 q
Mar 18 15:05:44 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:41 BT1HS01 pvedaemon[13119]: <root@pam> successful auth for user 'w
Mar 18 15:05:38 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:37 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:35 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:35 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:30 BT1HS01 pvedaemon[40250]: <root@pam> successful auth for user 'w
Mar 18 15:05:30 BT1HS01 pvedaemon[21194]: <root@pam> successful auth for user 'w
Mar 18 15:05:28 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:26 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:22 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:19 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:19 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:17 BT1HS01 pmxcfs[33180]: [status] notice: received log
 
hm - the logs seem unrelated...

in any case - maybe still try to add all hosts to the /etc/hosts (PVE does quite a bit of resolving, despite the request not being proxied to another node)
 
I checked.
The pvedaemon replied some timeout. I thought it may be caused by this.

I'll try to add all hosts to the hosts file. Is there any better way to sync this file?
 
did you restart the services?

Edit: sorry - that won't help in that case ...

hmm - My next step would probably be to try to strace the pvedaemon (with children) and see where it hangs...
sorry
 
Hmm - how long do operations on the pmxcfs take on nodes which run into timeouts:
* `time cat /etc/pve/qemu-server/$vmid.conf` (for a vmid which exists on that node)
* `time touch /etc/pve/test`
* `time rm /etc/pve/test`

does the journal show anything relevant for corosync/pmxcfs:
* `journalctl -r -u corosync.service`
* `journalctl -r -u pve-cluster.service`
 
root@BT1HS01:~# time touch /etc/pve/test

real 0m0.011s
user 0m0.002s
sys 0m0.002s
root@BT1HS01:~# time rm /etc/pve/test

real 0m0.004s
user 0m0.001s
sys 0m0.001s

root@BT1HS01:~# time cat /etc/pve/qemu-server/101.conf
real 0m0.002s
user 0m0.002s
sys 0m0.000s

-- Logs begin at Thu 2019-03-14 21:31:19 EDT, end at Tue 2019-03-19 13:39:38 EDT
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 18 17:07:19 BT1HS01 corosync[10499]: [MAIN ] Completed service synchroniza
Mar 18 17:07:19 BT1HS01 corosync[10499]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: notice [MAIN ] Completed service sync
Mar 18 17:07:19 BT1HS01 corosync[10499]: notice [QUORUM] Members[9]: 1 2 3 4 5


-- Logs begin at Thu 2019-03-14 21:31:19 EDT, end at Tue 2019-03-19 13:40:01 EDT
Mar 19 13:40:00 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:40:00 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:55 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:49 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:47 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:44 BT1HS01 pmxcfs[22613]: [status] notice: received log
 
The times do look ok, however:
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
might indicate a problem with your cluster network (it being overloaded or having too much latency in general)

Please check the reference documentation and run the omping tests (need to be run on all nodes at the same time):
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
 
10.1.20.11 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.057/0.125/8.884/0.183
10.1.20.11 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.068/0.155/8.894/0.187
10.1.20.12 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.058/0.104/0.895/0.024
10.1.20.12 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.065/0.131/0.893/0.035
10.1.20.13 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.127/3.300/0.087
10.1.20.13 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.066/0.132/3.311/0.086
10.1.20.14 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.060/0.124/3.900/0.073
10.1.20.14 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.056/0.110/3.894/0.070
10.1.20.15 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.116/0.711/0.031
10.1.20.15 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.064/0.124/0.775/0.031
10.1.20.16 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.079/0.214/11.698/0.493
10.1.20.16 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.088/0.222/11.667/0.493
10.1.20.31 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.078/0.114/0.370/0.017
10.1.20.31 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.130/0.397/0.027
10.1.20.32 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.081/0.140/0.817/0.031
10.1.20.32 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.086/0.143/0.852/0.032
 
I guess that if it is possible the API pvedaemon worker stuck by our API using too much "guest-ping".
The guest-ping spend too much time resources?
we got lots of this timeout report.
VM 2014 qmp command failed - VM 2014 qmp command 'guest-ping' failed - got timeout

Also, is there any other way to test if the qemu-guest-agent already been installed in the client's VM or not?


I removed the guest-ping function. Tested again. The 599 is still ongoing.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!