pveproxy not starting and qm list won't work

danielb1

New Member
Sep 11, 2020
6
0
1
39
Hi,

I'm having problems in our PVE cluster. Currently running Proxmox 5.4 (will upgrade to 6 once the current cluster is healthy). The pveproxy won't start on all nodes, and also commands like "qm list" get stuck on every node. Also, browsing some folders inside /etc/pve makes SSH freeze totally.

From syslog:

Sep 11 06:42:45 pve1 systemd[1]: Stopped PVE API Proxy Server.
Sep 11 06:42:45 pve1 systemd[1]: pveproxy.service: Unit entered failed state.
Sep 11 06:42:45 pve1 systemd[1]: pveproxy.service: Failed with result 'timeout'.
Sep 11 06:42:45 pve1 systemd[1]: Starting PVE API Proxy Server...
Sep 11 06:44:15 pve1 systemd[1]: pveproxy.service: Start operation timed out. Terminating.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: State 'stop-final-sigterm' timed out. Killing.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 18378 (pveproxy) with signal SIGKILL.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 9774 (pveproxy worker) with signal SIGKILL.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 22110 (pveproxy worker) with signal SIGKILL.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 23064 (pveproxy worker) with signal SIGKILL.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 887 (pveproxy) with signal SIGKILL.
Sep 11 06:45:45 pve1 systemd[1]: pveproxy.service: Killing process 7596 (pveproxy) with signal SIGKILL.
Sep 11 06:47:16 pve1 systemd[1]: pveproxy.service: Processes still around after final SIGKILL. Entering failed mode.
Sep 11 06:47:16 pve1 systemd[1]: Failed to start PVE API Proxy Server.
Sep 11 06:47:16 pve1 systemd[1]: pveproxy.service: Unit entered failed state.
Sep 11 06:47:16 pve1 systemd[1]: pveproxy.service: Failed with result 'timeout'
 
I noticed this in the syslog from yesterday:

pve1 pvestatd[4554]: storage 'NFS1' is not online

I can however see that it is mounted and accessible from the PVE.
 
Hi,

I restarted and tweaked some NFS settings in FreeNAS. Don't see any NFS related errors at the moment.

However, on our first node, I can see a lot of these:

Sep 11 09:47:51 pve1 pmxcfs[3870]: [status] notice: cpg_send_message retried 100 times
Sep 11 09:47:51 pve1 pmxcfs[3870]: [status] crit: cpg_send_message failed: 6
Sep 11 09:47:52 pve1 pmxcfs[3870]: [status] notice: cpg_send_message retry 10
Sep 11 09:47:53 pve1 pmxcfs[3870]: [status] notice: cpg_send_message retry 20
Sep 11 09:47:54 pve1 pmxcfs[3870]: [status] notice: cpg_send_message retry 30

What are causing these and how to get rid of these?
 
I also noticed that "pvesm status" is very slow on our first/main node. It's a lot faster on the other nodes.