Too many HTTP 599 in pve-proxy, pve-manger, pve-api

Sptrs_CA · Mar 17, 2019

Dear all,

we are running an 8 nodes cluster.
There is an issue that after we re-install 2 nodes and rejoin the cluster with the same hostname and IP address.

There are lots of HTTP 599 TOO MANY REDIRECTION and HTTP 596 BROKEN PIPE on API, Manager even in the direct manager of these nodes. (If I use another node which in the same cluster, there will be more serious.)

root@BT1HS01:~# tail -f /var/log/pveproxy/access.log | grep ' 599 '

Self-dev API Server - whmcs@pve [17/03/2019:06:28:04 -0400] "GET /api2/json/nodes/BT1HS01/qemu/133/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:28:41 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/344/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:28:50 -0400] "GET /api2/json/nodes/BT1HS01/qemu/1915/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:24 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:25 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/354/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:29:49 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:30:37 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:30:56 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:31:25 -0400] "PUT /api2/json/access/acl HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:10 -0400] "GET /api2/json/nodes/BT1HS01/qemu/356/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:19 -0400] "GET /api2/json/nodes/BT1HS01/qemu/2094/rrddata?timeframe=hour&cf=AVERAGE HTTP/1.1" 599-

Self-dev API Server - whmcs@pve [17/03/2019:06:32:28 -0400] "GET /api2/json/nodes/BT1HS01/qemu/344/status/current HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:56 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/341/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:26 -0400] "PUT /api2/json/nodes/BT1HS01/qemu/356/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:51 -0400] "GET /api2/json/nodes/BT1HS01/tasks/UPID:BT1HS01:00002398:01388B9E:5C8E207E:qmclone:10229900:whmcs@pve:/status HTTP/1.1" 599 -

root@7Y29D02:~# tail -f /var/log/pveproxy/access.log | grep ' 599 '

Self-dev API Server - whmcs@pve [17/03/2019:06:31:01 -0400] "PUT /api2/json/nodes/7Y29D02/qemu/339/config HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:32:09 -0400] "GET /api2/json/nodes/7Y29D02/qemu/363/rrddata?timeframe=hour&cf=AVERAGE HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:19 -0400] "GET /api2/json/nodes/7Y29D02/storage?enabled=1 HTTP/1.1" 599 -

Self-dev API Server - whmcs@pve [17/03/2019:06:33:33 -0400] "GET /api2/json/nodes/7Y29D02/qemu/443/rrddata?timeframe=hour&cf=MAX HTTP/1.1" 599

I tried to make it running in the same version. But it was unhelpful. Even I separate it from the cluster. (Running independent). We are now migrated all the VM to other nodes and going to reinstall it. Is there anyone know, what caused it?

proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve) pve-manager: 5.3-8 (running version: 5.3-8/2929af8e) pve-kernel-4.15: 5.3-1 pve-kernel-4.15.18-10-pve: 4.15.18-32 corosync: 2.4.4-pve1 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: 1.2-2 libjs-extjs: 6.0.1-2 libpve-access-control: 5.1-3 libpve-apiclient-perl: 2.0-5 libpve-common-perl: 5.0-44 libpve-guest-common-perl: 2.0-19 libpve-http-server-perl: 2.0-11 libpve-storage-perl: 5.0-36 libqb0: 1.0.3-1~bpo9 lvm2: 2.02.168-pve6 lxc-pve: 3.1.0-2 lxcfs: 3.0.2-2 novnc-pve: 1.0.0-2 proxmox-widget-toolkit: 1.0-22 pve-cluster: 5.0-33 pve-container: 2.0-33 pve-docs: 5.3-1 pve-edk2-firmware: 1.20181023-1 pve-firewall: 3.0-17 pve-firmware: 2.0-6 pve-ha-manager: 2.0-6 pve-i18n: 1.0-9 pve-libspice-server1: 0.14.1-1 pve-qemu-kvm: 2.12.1-1 pve-xtermjs: 3.10.1-1 qemu-server: 5.0-45 smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3 zfsutils-linux: 0.7.12-pve1~bpo1

Sptrs_CA · Mar 18, 2019

I have already upgraded to the newest version.

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)

pve-manager: 5.3-11 (running version: 5.3-11/d4907f84)

pve-kernel-4.15: 5.3-2

pve-kernel-4.15.18-11-pve: 4.15.18-34

pve-kernel-4.15.18-9-pve: 4.15.18-30

corosync: 2.4.4-pve1

criu: 2.11.1-1~bpo90

glusterfs-client: 3.8.8-1

ksm-control-daemon: 1.2-2

libjs-extjs: 6.0.1-2

libpve-access-control: 5.1-3

libpve-apiclient-perl: 2.0-5

libpve-common-perl: 5.0-47

libpve-guest-common-perl: 2.0-20

libpve-http-server-perl: 2.0-12

libpve-storage-perl: 5.0-38

libqb0: 1.0.3-1~bpo9

lvm2: 2.02.168-pve6

lxc-pve: 3.1.0-3

lxcfs: 3.0.3-pve1

novnc-pve: 1.0.0-3

proxmox-widget-toolkit: 1.0-23

pve-cluster: 5.0-33

pve-container: 2.0-34

pve-docs: 5.3-3

pve-edk2-firmware: 1.20181023-1

pve-firewall: 3.0-18

pve-firmware: 2.0-6

pve-ha-manager: 2.0-6

pve-i18n: 1.0-9

pve-libspice-server1: 0.14.1-2

pve-qemu-kvm: 2.12.1-2

pve-xtermjs: 3.10.1-2

qemu-server: 5.0-47

smartmontools: 6.5+svn4324-1

spiceterm: 3.0-5

vncterm: 1.5-3

zfsutils-linux: 0.7.12-pve1~bpo1

Stoiko Ivanov · Mar 18, 2019

hm - rejoining a clusternode with the same name/ip can be tricky.

My first guess would be that you did not include the nodes in '/etc/hosts' (or that they are not resolveable to their IP by other means (DNS))
- you can check '/etc/pve/.members' for the ip/hostname matches PVE expects.

- else I'd take a closer look at the corosync configuration.

Sptrs_CA · Mar 18, 2019

Hi,

I checked it. Everything is running well. But the REST API is very slow and always timeout.
If we tried multiple times. The REST API can give us the correct result.

The hosts file in every node is correct. But it only includes their own record.

The pvecm status

Quorum information
------------------
Date: Mon Mar 18 14:33:55 2019
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000001
Ring ID: 1/2516
Quorate: Yes

Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 8
Quorum: 5
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.20.11 (local)
0x00000002 1 10.1.20.12
0x00000003 1 10.1.20.13
0x00000004 1 10.1.20.14
0x00000005 1 10.1.20.15
0x00000006 1 10.1.20.16
0x00000008 1 10.1.20.31

Stoiko Ivanov · Mar 18, 2019

what is the output of:
* `cat /etc/pve/.members`?
* does the situation improve if you add every node to the /etc/hosts of every node?

Sptrs_CA · Mar 18, 2019

All the 599 are reported by direct accessing.
(It also be reported by cluster accessing)
root@BT1HS01:~# cat /etc/pve/.members
{
"nodename": "BT1HS01",
"version": 24,
"cluster": { "name": "Common-Instance", "version": 24, "nodes": 8, "quorate": 1 },
"nodelist": {
"6MCYD22": { "id": 2, "online": 1, "ip": "10.1.20.12"},
"BT1HS01": { "id": 1, "online": 1, "ip": "10.1.20.11"},
"2JWSS42": { "id": 4, "online": 1, "ip": "10.1.20.14"},
"GX3TG32": { "id": 5, "online": 1, "ip": "10.1.20.15"},
"7Y29D02": { "id": 6, "online": 1, "ip": "10.1.20.16"},
"HBBJ762": { "id": 8, "online": 1, "ip": "10.1.20.31"},
"C2GWF62": { "id": 9, "online": 1, "ip": "10.1.20.32"},
"HZ0TG32": { "id": 3, "online": 1, "ip": "10.1.20.13"}
}
}

Stoiko Ivanov · Mar 18, 2019

* Are all relevant services running on the nodes?(In that case please take a careful look at pvedaemon (daemon handling privileged requests) and pveproxy (daemon listening on 8006 and forwarding certain requests to pvaedaemon)
* Do you see any other Problems in the logs? (`journalctl -r`)

Sptrs_CA · Mar 18, 2019

Hi,

Yes, I did.

Sometimes, I'll get "pvedaemon[21194]: VM 1926 qmp command failed - VM 1926 ....(unreadable from xterm.js console.)

The logs of VMxxx qmp command failed are less than the 599 logs.

But most of them are normal.

Mar 18 15:05:52 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:51 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:51 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:48 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:46 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:45 BT1HS01 pvedaemon[21194]: VM 1926 qmp command failed - VM 1926 q
Mar 18 15:05:44 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:41 BT1HS01 pvedaemon[13119]: <root@pam> successful auth for user 'w
Mar 18 15:05:38 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:37 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:35 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:35 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:30 BT1HS01 pvedaemon[40250]: <root@pam> successful auth for user 'w
Mar 18 15:05:30 BT1HS01 pvedaemon[21194]: <root@pam> successful auth for user 'w
Mar 18 15:05:28 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:26 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:22 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:19 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:19 BT1HS01 pmxcfs[33180]: [status] notice: received log
Mar 18 15:05:17 BT1HS01 pmxcfs[33180]: [status] notice: received log

Stoiko Ivanov · Mar 19, 2019

hm - the logs seem unrelated...

in any case - maybe still try to add all hosts to the /etc/hosts (PVE does quite a bit of resolving, despite the request not being proxied to another node)

Sptrs_CA · Mar 19, 2019

I checked.
The pvedaemon replied some timeout. I thought it may be caused by this.

I'll try to add all hosts to the hosts file. Is there any better way to sync this file?

Sptrs_CA · Mar 19, 2019

After I add the hosts to file. The errors are still there.

Stoiko Ivanov · Mar 19, 2019

~~did you restart the services?~~

Edit: sorry - that won't help in that case ...

hmm - My next step would probably be to try to strace the pvedaemon (with children) and see where it hangs...
sorry

Sptrs_CA · Mar 19, 2019

Sure, I restart pve-cluster, pve-daemon and pve-proxy

Sptrs_CA · Mar 19, 2019

Hi

OK, Please let me know what info you need.

Stoiko Ivanov · Mar 19, 2019

Hmm - how long do operations on the pmxcfs take on nodes which run into timeouts:
* `time cat /etc/pve/qemu-server/$vmid.conf` (for a vmid which exists on that node)
* `time touch /etc/pve/test`
* `time rm /etc/pve/test`

does the journal show anything relevant for corosync/pmxcfs:
* `journalctl -r -u corosync.service`
* `journalctl -r -u pve-cluster.service`

Sptrs_CA · Mar 19, 2019

root@BT1HS01:~# time touch /etc/pve/test

real 0m0.011s
user 0m0.002s
sys 0m0.002s
root@BT1HS01:~# time rm /etc/pve/test

real 0m0.004s
user 0m0.001s
sys 0m0.001s

root@BT1HS01:~# time cat /etc/pve/qemu-server/101.conf
real 0m0.002s
user 0m0.002s
sys 0m0.000s

-- Logs begin at Thu 2019-03-14 21:31:19 EDT, end at Tue 2019-03-19 13:39:38 EDT
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 18 17:07:19 BT1HS01 corosync[10499]: [MAIN ] Completed service synchroniza
Mar 18 17:07:19 BT1HS01 corosync[10499]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: [CPG ] downlist left_list: 0 receive
Mar 18 17:07:19 BT1HS01 corosync[10499]: notice [MAIN ] Completed service sync
Mar 18 17:07:19 BT1HS01 corosync[10499]: notice [QUORUM] Members[9]: 1 2 3 4 5

-- Logs begin at Thu 2019-03-14 21:31:19 EDT, end at Tue 2019-03-19 13:40:01 EDT
Mar 19 13:40:00 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:40:00 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:55 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:49 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:47 BT1HS01 pmxcfs[22613]: [status] notice: received log
Mar 19 13:39:44 BT1HS01 pmxcfs[22613]: [status] notice: received log

Stoiko Ivanov · Mar 19, 2019

The times do look ok, however:

Sptrs_CA said:
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: notice [TOTEM ] Retransmit List: 2b824
Mar 19 03:02:00 BT1HS01 corosync[10499]: [TOTEM ] Retransmit List: 2b8242 2b824

might indicate a problem with your cluster network (it being overloaded or having too much latency in general)

Please check the reference documentation and run the omping tests (need to be run on all nodes at the same time):
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Sptrs_CA · Mar 19, 2019

10.1.20.11 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.057/0.125/8.884/0.183
10.1.20.11 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.068/0.155/8.894/0.187
10.1.20.12 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.058/0.104/0.895/0.024
10.1.20.12 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.065/0.131/0.893/0.035
10.1.20.13 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.127/3.300/0.087
10.1.20.13 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.066/0.132/3.311/0.086
10.1.20.14 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.060/0.124/3.900/0.073
10.1.20.14 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.056/0.110/3.894/0.070
10.1.20.15 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.116/0.711/0.031
10.1.20.15 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.064/0.124/0.775/0.031
10.1.20.16 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.079/0.214/11.698/0.493
10.1.20.16 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.088/0.222/11.667/0.493
10.1.20.31 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.078/0.114/0.370/0.017
10.1.20.31 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.130/0.397/0.027
10.1.20.32 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.081/0.140/0.817/0.031
10.1.20.32 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.086/0.143/0.852/0.032

Sptrs_CA · Mar 19, 2019

Also,

The last log for the Retransmit List at 3 AM today. But the 599 log is still ongoing.

Sptrs_CA · Mar 20, 2019

I guess that if it is possible the API pvedaemon worker stuck by our API using too much "guest-ping".
The guest-ping spend too much time resources?
we got lots of this timeout report.
VM 2014 qmp command failed - VM 2014 qmp command 'guest-ping' failed - got timeout

Also, is there any other way to test if the qemu-guest-agent already been installed in the client's VM or not?

I removed the guest-ping function. Tested again. The 599 is still ongoing.

Too many HTTP 599 in pve-proxy, pve-manger, pve-api

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member

Member

We value your privacy