PVE Cluster - One node `pveproxy` crashing seemingly randomly

ndom91

Active Member
Nov 30, 2018
6
0
41
Berlin, Germany
ndo.dev
Hi All,

So we have a 5 node pve cluster and lately, the first node becomes unreachable in the proxmox web ui ~ 1x day. See screenshot:

1617196896935.png


About 1x per day, the node becomes "unreachable", as seen in the screenshot, and is not reachable via the web ui from any other node. It's own web ui seems to load, but does not allow me to login. When attempting to login, it will spin and load for a bit and then come back with an error saying, "Connection failure. Network error or Proxmox VE services not running?".

A reboot fixes this, but then it always happens again the next day at some seemingly random point in time.

I can still ssh to it, the box is up and running, and the VMs are also running.

It just seems like whatever pve service is responsible for notifying the others is dead.

I've checked the systemd service for:
  • pveproxy
  • pve-manager
  • pvedaemon
And they're all 'active' and running, not showing any error messages in their systemd journals.

Anything else I can do to troubleshoot this?

Thanks in advance!

EDIT: So I've narrowed it down to `pveproxy`. This doesn't seem to be up although its running mannyyy copies of the "binary" (perl script).

See systemctl status pveproxy in the screenshot below. This is after having tried to sigkill those processes and restart the systemd service a few times. None of the processes let you kill them, even with a kill -9. Not sure what's going on here.. Note that the .pid file in /var/run/pveproxy/pveproxy.pid is NOT there at this point in time..

1617308223953.png
 
Last edited:
How can I tell?

The systemd service corosync is alive and well, and the network connection between the nodes is fine - they're mostly in one rack and connected all through one top-of-the-rack access switch.

EDIT - Ah when checking the status, the service is "alive", but there seems to have been some multicast errors:

Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-03-30 19:14:27 CEST; 22h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 9579 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 401.8M
   CGroup: /system.slice/corosync.service
           └─9579 /usr/sbin/corosync -f

Mar 30 20:28:15 nt-pve corosync[9579]:   [KNET  ] link: host: 4 link: 0 is down
Mar 30 20:28:15 nt-pve corosync[9579]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Mar 30 20:28:15 nt-pve corosync[9579]:   [KNET  ] host: host: 4 has no active links
Mar 30 20:28:18 nt-pve corosync[9579]:   [KNET  ] rx: host: 4 link: 0 is up
Mar 30 20:28:18 nt-pve corosync[9579]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Mar 30 20:28:47 nt-pve corosync[9579]:   [TOTEM ] Token has not been received in 3712 ms
Mar 30 20:29:38 nt-pve corosync[9579]:   [CPG   ] *** 0x5648bf2ba9e0 can't mcast to group pve_dcdb_v1 state:1, error:12
Mar 30 20:29:38 nt-pve corosync[9579]:   [CPG   ] *** 0x5648bf2ba9e0 can't mcast to group pve_dcdb_v1 state:1, error:12
Mar 30 20:29:38 nt-pve corosync[9579]:   [CPG   ] *** 0x5648bf2ba9e0 can't mcast to group pve_dcdb_v1 state:1, error:12
Mar 30 22:40:48 nt-pve corosync[9579]:   [CPG   ] *** 0x5648bf2ba9e0 can't mcast to group pve_dcdb_v1 state:1, error:12

^^ The colors didn't copy and paste, but those last four "CPG" prefixed lines are red in the console..
 
Last edited:
Okay so pveproxy seems frozen/down somehow with a bunch of instances of the following command:

/usr/bin/perl /usr/bin/pvecm updatecerts --silent

Maybe its getting stuck trying to renew an un-renewable cert? I can't kill any of these unfortunately. Even a `kill -9 PID` doesn't seem to take it down.

Current systemctl status pveproxy output:

Code:
root@nt-pve:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigkill) (Result: timeout) since Thu 2021-04-01 15:08:40 CEST; 2h 43min
  Process: 11333 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
    Tasks: 28 (limit: 4915)
   Memory: 1.2G
   CGroup: /system.slice/pveproxy.service
           ├─ 1510 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─ 2011 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─ 7819 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─10618 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─11334 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14083 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14484 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14999 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─16195 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─17389 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─17550 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─18852 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─20006 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─20558 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─22631 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─24967 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─26066 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─27179 /usr/bin/perl -T /usr/bin/pveproxy restart
           ├─27833 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─28186 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─28361 /usr/bin/perl -T /usr/bin/pveproxy stop
           ├─29346 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─29747 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─30305 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─30941 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─31170 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─32180 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           └─32368 /usr/bin/perl /usr/bin/pvecm updatecerts --silent

Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 26066 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 22631 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 18852 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 14083 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 14484 (pvecm) with signal SIGKILL
 
Okay so pveproxy seems frozen/down somehow with a bunch of instances of the following command:

/usr/bin/perl /usr/bin/pvecm updatecerts --silent

Maybe its getting stuck trying to renew an un-renewable cert? I can't kill any of these unfortunately. Even a `kill -9 PID` doesn't seem to take it down.

Current systemctl status pveproxy output:

Code:
root@nt-pve:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigkill) (Result: timeout) since Thu 2021-04-01 15:08:40 CEST; 2h 43min
  Process: 11333 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=killed, signal=KILL)
    Tasks: 28 (limit: 4915)
   Memory: 1.2G
   CGroup: /system.slice/pveproxy.service
           ├─ 1510 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─ 2011 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─ 7819 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─10618 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─11334 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14083 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14484 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─14999 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─16195 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─17389 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─17550 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─18852 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─20006 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─20558 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─22631 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─24967 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─26066 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─27179 /usr/bin/perl -T /usr/bin/pveproxy restart
           ├─27833 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─28186 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─28361 /usr/bin/perl -T /usr/bin/pveproxy stop
           ├─29346 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─29747 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─30305 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─30941 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─31170 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           ├─32180 /usr/bin/perl /usr/bin/pvecm updatecerts --silent
           └─32368 /usr/bin/perl /usr/bin/pvecm updatecerts --silent

Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 26066 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 22631 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 18852 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 14083 (pvecm) with signal SIGKILL
Apr 01 17:51:13 nt-pve systemd[1]: pveproxy.service: Killing process 14484 (pvecm) with signal SIGKILL
I am encountering pretty much the same problem. Please share how did you resolve this in the end?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!