I have a 9 node production cluster based on pve6 (pve-manager/6.4-13/9f411e79 (running kernel: 5.4.143-1-pve)). The older server in the cluster have been up & running 563 days while the newest is 201 days so far. 2 days ago all servers in the cluster become grayed out, but I was able to access node information and network config. Today, I can not even login into the web interface in any of the serves, as I get
" Connection failure. Network error or Proxmox VE services not running"
All the vms and containers works ok.
pvecm status reports OK (9 out of 9 votes, quorate in all the servers
pvesm status shows all local storages without any issues
corosync-cfgtool -s shows all nodes as connected (or localhost)
The following service are active and running
systemctl status pvestatd
systemctl status pve-cluster
systemctl status corosync
systemctl status pveproxy
systemctl status pvedaemon
However pveproxy shows the following in the journal log in all servers
Jun 01 00:00:00 pve-10 systemd[1]: Reloading PVE API Proxy Server.
Jun 01 00:01:30 pve-10 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-10 systemd[1]: Reload failed for PVE API Proxy Server.
I have tried to restart the pveproxy service in one server with the following result:
any ideas? This is a production cluster so restarting all nodes is not really an option for us.
This is the output of pveversion -v for the server I am trying the reset
" Connection failure. Network error or Proxmox VE services not running"
All the vms and containers works ok.
pvecm status reports OK (9 out of 9 votes, quorate in all the servers
pvesm status shows all local storages without any issues
corosync-cfgtool -s shows all nodes as connected (or localhost)
The following service are active and running
systemctl status pvestatd
systemctl status pve-cluster
systemctl status corosync
systemctl status pveproxy
systemctl status pvedaemon
However pveproxy shows the following in the journal log in all servers
Jun 01 00:00:00 pve-10 systemd[1]: Reloading PVE API Proxy Server.
Jun 01 00:01:30 pve-10 systemd[1]: pveproxy.service: Reload operation timed out. Killing reload process.
Jun 01 00:01:30 pve-10 systemd[1]: Reload failed for PVE API Proxy Server.
I have tried to restart the pveproxy service in one server with the following result:
Code:
Jun 01 23:00:29 pve-10 systemd[1]: pveproxy.service: Stopping timed out. Terminating.
Jun 01 23:00:29 pve-10 pveproxy[1627]: received signal TERM
Jun 01 23:00:29 pve-10 pveproxy[1627]: server closing
Jun 01 23:00:29 pve-10 pveproxy[47818]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[47816]: worker exit
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47817 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47816 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: worker 47818 finished
Jun 01 23:00:29 pve-10 pveproxy[1627]: server stopped
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: State 'stop-sigterm' timed out. Killing.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:01:59 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:03:30 pve-10 systemd[1]: pveproxy.service: Processes still around after SIGKILL. Ignoring.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 25740 (pveproxy) with signal SIGKILL.
Jun 01 23:05:00 pve-10 systemd[1]: pveproxy.service: Killing process 3509 (pveproxy) with signal SIGKILL.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Jun 01 23:06:30 pve-10 systemd[1]: Stopped PVE API Proxy Server.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 25740 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: pveproxy.service: Found left-over process 3509 (pveproxy) in control group while starting unit. Ignoring.
Jun 01 23:06:30 pve-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 01 23:06:30 pve-10 systemd[1]: Starting PVE API Proxy Server...
Jun 01 23:07:00 pve-10 pvecm[5797]: got timeout
any ideas? This is a production cluster so restarting all nodes is not really an option for us.
This is the output of pveversion -v for the server I am trying the reset
Code:
root@pve-10:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.203-1-pve)
pve-manager: 6.4-15 (running version: 6.4-15/af7986e6)
pve-kernel-5.4: 6.4-20
pve-kernel-helper: 6.4-20
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-5
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.14-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-2
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.7-pve1