[SOLVED] WebUI unresponsive in cluster after all vms became grayed out. (pveproxy failure)

EmilioMoreno

Well-Known Member
Feb 27, 2020
48
4
48
52
I have a 9 node production cluster based on pve6 (pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)). The older server in the cluster have been up & running 563 days while the newest is 201 days so far. 2 days ago all servers in the cluster become grayed out, but I was able to access node information and network config. Today, I can not even login into the web interface in any of the serves, as I get

" Connection failure. Network error or Proxmox VE services not running"

All the vms and containers works ok.

pvecm status reports OK (9 out of 9 votes, quorate in all the servers
pvesm status shows all local storages without any issues
corosync-cfgtool -s shows all nodes as connected (or localhost)

The following service are active and running
systemctl status pvestatd
systemctl status pve-cluster
systemctl status corosync
systemctl status pveproxy
systemctl status pvedaemon

However pveproxy shows the following in the journal log in all servers
Jun 01 00:00:00 pve-10 systemd[1]: Reloading PVE API Proxy Server.
Jun 01 00:01:30 pve-10 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-10 systemd[1]: Reload failed for PVE API Proxy Server.

I have tried to restart the pveproxy service in one server with the following result:

Code:
Jun 01 23:00:29 pve-10 systemd[1]: pveproxy.service: Stopping timed out. Terminating.
Jun 01 23:00:29 pve-10 pveproxy[1627]: received signal TERM
Jun 01 23:00:29 pve-10 pveproxy[1627]: server closing
Jun 01 23:00:29 pve-10 pveproxy[47818]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[47816]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47817 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47816 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47818 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: server stopped
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:03:30 pve-10 systemd[1]: pveproxy.service: Processes still around after SIGKILL. Ignoring.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Jun 01 23:06:30 pve-10 systemd[1]: Stopped PVE API Proxy Server.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 25740 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 3509 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: Starting PVE API Proxy Server...
Jun 01 23:07:00 pve-10 pvecm[5797]: got timeout


any ideas? This is a production cluster so restarting all nodes is not really an option for us.

This is the output of pveversion -v for the server I am trying the reset

Code:
root@pve-10:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.203-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
 
Long story short stoping (yes, stopping) the services in one node, fixed the problem in all the nodes. I do not know how or why, but

systemctl stop pve-cluster && systemctl stop corosync && systemctl stop pvedaemon && systemctl stop pveproxy && systemctl stop pvestatd

and everything worked fine again.
some extra info in case in helps anyone

1.- pveproxy service had failed in all cluster nodes with the following message

Jun 01 00:01:30 pve-3 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-3 systemd[1]: Reload failed for PVE API Proxy Server.

2.-Nothing was listening on 8006 port. a telnet to localhost on this port resulted connection refused in any of the servers

3.-All nodes had the same processes running as root. Actually the pveproxy restart at 0:00 24h ago, looked weird:
Code:
root      6847  0.0  0.0 282420 95144 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
root     25740  0.0  0.0 280748 93384 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     26280  0.0  0.0  86168  2236 ?        Ssl  00:01   0:08 /usr/sbin/pvefw-logger
root     23331  0.0  0.0 290212 97504 ?        Ds   04:26   0:00 /usr/bin/perl /usr/bin/pveupdate
root     46649  0.0  0.0 276808 89376 ?        D    22:42   0:00 /usr/bin/perl /usr/bin/pvestatd status
these commands were still running in all the servers, all 9.

4.-Restarting only pveproxy service was clearly a mistake; it failed with timeout, but systemd restarted again and again and again
Code:
Jun 02 00:36:49 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 02 00:36:49 pve-10 systemd[1]: Starting PVE API Proxy Server...
Jun 02 00:37:19 pve-10 pvecm[33755]: got timeout
Jun 02 00:38:19 pve-10 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Jun 02 00:39:23 pve-10 pvecm[33755]: interrupted by unexpected signal
Jun 02 00:39:23 pve-10 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Jun 02 00:39:23 pve-10 systemd[1]: Failed to start PVE API Proxy Server.
Jun 02 00:39:23 pve-10 systemd[1]: pveproxy.service: Service RestartSec=100ms expired, scheduling restart.

and everytime it was restarted, a new pvecm updatecerts command was launched too and kept running.


Code:
root      6847  0.0  0.0 282420 95144 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
root     25740  0.0  0.0 280748 93384 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     26280  0.0  0.0  86168  2236 ?        Ssl  00:01   0:08 /usr/sbin/pvefw-logger
root     23331  0.0  0.0 290212 97504 ?        Ds   04:26   0:00 /usr/bin/perl /usr/bin/pveupdate
root     46649  0.0  0.0 276808 89376 ?        D    22:42   0:00 /usr/bin/perl /usr/bin/pvestatd status
root      1387  0.0  0.0 301272 96616 ?        D    22:53   0:00 /usr/bin/perl -T /usr/sbin/pct list
root      3509  0.0  0.0 280728 93384 ?        Ds   22:58   0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root      5798  0.0  0.0  68860 44772 ?        D    23:06   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root      7662  0.0  0.0  68876 44612 ?        D    23:12   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root      9542  0.0  0.0  68856 44520 ?        D    23:18   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root     11366  0.5  0.0  68724 51660 ?        Ss   23:24   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root     11367  0.0  0.0  68856 44748 ?        D    23:24   0:00  \_ /usr/bin/perl /usr/bin/pvecm updatecerts --silent


This was only happening in the server I tried to restart the service.

5.- After executing the restart sequence (next time I will launch the commands individually to see which one is responsible), I was expecting it to take sometime before stopping all services, as I did a test round in my test cluster and took about 30-45 seconds, but before 10 seconds the command sequence ended. I checked the status of pve-cluster and it was running (¿?), so I check the running processes in the system and

Code:
root     22285  0.0  0.0  86168  2360 ?        Ssl  00:00   0:00 /usr/sbin/pvefw-logger
root     34543  0.2  0.0 597516 51384 ?        Ssl  00:39   0:03 /usr/bin/pmxcfs
root     34556  1.5  0.0 574768 179280 ?       SLsl 00:39   0:26 /usr/sbin/corosync -f
www-data 34584  0.0  0.0 355184 124976 ?       Ss   00:39   0:00 pveproxy
www-data 34585  0.1  0.0 369620 141420 ?       S    00:39   0:03  \_ pveproxy worker
www-data 34586  0.5  0.0 370332 143060 ?       S    00:39   0:09  \_ pveproxy worker
www-data 34587  0.1  0.0 372448 143672 ?       S    00:39   0:02  \_ pveproxy worker

There was no trace at all of the pvecm updatecerts, pveupdate or pvestatd status... and when I checked the web interface everything was working as expected

Anyway, loosing control of a cluster this way (no way to start or restart vms across all the servers in a cluster) is scary.