[SOLVED] WebUI unresponsive in cluster after all vms became grayed out. (pveproxy failure)

Feb 27, 2020
43
3
28
51
I have a 9 node production cluster based on pve6 (pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)). The older server in the cluster have been up & running 563 days while the newest is 201 days so far. 2 days ago all servers in the cluster become grayed out, but I was able to access node information and network config. Today, I can not even login into the web interface in any of the serves, as I get

" Connection failure. Network error or Proxmox VE services not running"

All the vms and containers works ok.

pvecm status reports OK (9 out of 9 votes, quorate in all the servers
pvesm status shows all local storages without any issues
corosync-cfgtool -s shows all nodes as connected (or localhost)

The following service are active and running
systemctl status pvestatd
systemctl status pve-cluster
systemctl status corosync
systemctl status pveproxy
systemctl status pvedaemon

However pveproxy shows the following in the journal log in all servers
Jun 01 00:00:00 pve-10 systemd[1]: Reloading PVE API Proxy Server.
Jun 01 00:01:30 pve-10 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-10 systemd[1]: Reload failed for PVE API Proxy Server.

I have tried to restart the pveproxy service in one server with the following result:

Code:
Jun 01 23:00:29 pve-10 systemd[1]: pveproxy.service: Stopping timed out. Terminating.
Jun 01 23:00:29 pve-10 pveproxy[1627]: received signal TERM
Jun 01 23:00:29 pve-10 pveproxy[1627]: server closing
Jun 01 23:00:29 pve-10 pveproxy[47818]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[47816]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47817 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47816 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47818 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: server stopped
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:03:30 pve-10 systemd[1]: pveproxy.service: Processes still around after SIGKILL. Ignoring.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Jun 01 23:06:30 pve-10 systemd[1]: Stopped PVE API Proxy Server.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 25740 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 3509 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: Starting PVE API Proxy Server...
Jun 01 23:07:00 pve-10 pvecm[5797]: got timeout


any ideas? This is a production cluster so restarting all nodes is not really an option for us.

This is the output of pveversion -v for the server I am trying the reset

Code:
root@pve-10:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.203-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1
 
Long story short stoping (yes, stopping) the services in one node, fixed the problem in all the nodes. I do not know how or why, but

systemctl stop pve-cluster && systemctl stop corosync && systemctl stop pvedaemon && systemctl stop pveproxy && systemctl stop pvestatd

and everything worked fine again.
some extra info in case in helps anyone

1.- pveproxy service had failed in all cluster nodes with the following message

Jun 01 00:01:30 pve-3 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-3 systemd[1]: Reload failed for PVE API Proxy Server.

2.-Nothing was listening on 8006 port. a telnet to localhost on this port resulted connection refused in any of the servers

3.-All nodes had the same processes running as root. Actually the pveproxy restart at 0:00 24h ago, looked weird:
Code:
root      6847  0.0  0.0 282420 95144 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
root     25740  0.0  0.0 280748 93384 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     26280  0.0  0.0  86168  2236 ?        Ssl  00:01   0:08 /usr/sbin/pvefw-logger
root     23331  0.0  0.0 290212 97504 ?        Ds   04:26   0:00 /usr/bin/perl /usr/bin/pveupdate
root     46649  0.0  0.0 276808 89376 ?        D    22:42   0:00 /usr/bin/perl /usr/bin/pvestatd status
these commands were still running in all the servers, all 9.

4.-Restarting only pveproxy service was clearly a mistake; it failed with timeout, but systemd restarted again and again and again
Code:
Jun 02 00:36:49 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 02 00:36:49 pve-10 systemd[1]: Starting PVE API Proxy Server...
Jun 02 00:37:19 pve-10 pvecm[33755]: got timeout
Jun 02 00:38:19 pve-10 systemd[1]: pveproxy.service: Start-pre operation timed out. Terminating.
Jun 02 00:39:23 pve-10 pvecm[33755]: interrupted by unexpected signal
Jun 02 00:39:23 pve-10 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Jun 02 00:39:23 pve-10 systemd[1]: Failed to start PVE API Proxy Server.
Jun 02 00:39:23 pve-10 systemd[1]: pveproxy.service: Service RestartSec=100ms expired, scheduling restart.

and everytime it was restarted, a new pvecm updatecerts command was launched too and kept running.


Code:
root      6847  0.0  0.0 282420 95144 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
root     25740  0.0  0.0 280748 93384 ?        Ds   May31   0:00 /usr/bin/perl -T /usr/bin/pveproxy restart
root     26280  0.0  0.0  86168  2236 ?        Ssl  00:01   0:08 /usr/sbin/pvefw-logger
root     23331  0.0  0.0 290212 97504 ?        Ds   04:26   0:00 /usr/bin/perl /usr/bin/pveupdate
root     46649  0.0  0.0 276808 89376 ?        D    22:42   0:00 /usr/bin/perl /usr/bin/pvestatd status
root      1387  0.0  0.0 301272 96616 ?        D    22:53   0:00 /usr/bin/perl -T /usr/sbin/pct list
root      3509  0.0  0.0 280728 93384 ?        Ds   22:58   0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root      5798  0.0  0.0  68860 44772 ?        D    23:06   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root      7662  0.0  0.0  68876 44612 ?        D    23:12   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root      9542  0.0  0.0  68856 44520 ?        D    23:18   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root     11366  0.5  0.0  68724 51660 ?        Ss   23:24   0:00 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
root     11367  0.0  0.0  68856 44748 ?        D    23:24   0:00  \_ /usr/bin/perl /usr/bin/pvecm updatecerts --silent


This was only happening in the server I tried to restart the service.

5.- After executing the restart sequence (next time I will launch the commands individually to see which one is responsible), I was expecting it to take sometime before stopping all services, as I did a test round in my test cluster and took about 30-45 seconds, but before 10 seconds the command sequence ended. I checked the status of pve-cluster and it was running (¿?), so I check the running processes in the system and

Code:
root     22285  0.0  0.0  86168  2360 ?        Ssl  00:00   0:00 /usr/sbin/pvefw-logger
root     34543  0.2  0.0 597516 51384 ?        Ssl  00:39   0:03 /usr/bin/pmxcfs
root     34556  1.5  0.0 574768 179280 ?       SLsl 00:39   0:26 /usr/sbin/corosync -f
www-data 34584  0.0  0.0 355184 124976 ?       Ss   00:39   0:00 pveproxy
www-data 34585  0.1  0.0 369620 141420 ?       S    00:39   0:03  \_ pveproxy worker
www-data 34586  0.5  0.0 370332 143060 ?       S    00:39   0:09  \_ pveproxy worker
www-data 34587  0.1  0.0 372448 143672 ?       S    00:39   0:02  \_ pveproxy worker

There was no trace at all of the pvecm updatecerts, pveupdate or pvestatd status... and when I checked the web interface everything was working as expected

Anyway, loosing control of a cluster this way (no way to start or restart vms across all the servers in a cluster) is scary.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!