PveProxy crashes, unable to start it

plofkat

Active Member
Mar 20, 2013
51
2
28
Pveproxy on all my servers crashed during the night.
All my vm's are up, (along with the disk access issue) however the web interface is down

I can ssh ddirectly to the hosts, however it I attempt to run "pveproxy start" everything stops responding.

It seems the issue occurs if I leave open a novnc console for a long time
 
please include relevant information in your posts if you want help, e.g. output of "pveversion -v", error messages, content of log and configuration files, ...
 
pveversion -v
proxmox-ve: 4.4-86 (running kernel: 4.4.49-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.49-1-pve: 4.4.49-86
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Tue 2017-06-13 09:04:07 SAST; 8min ago
Main PID: 31491 (code=exited, status=0/SUCCESS)

Jun 13 09:01:06 swk-prox00.namaquawines.local systemd[1]: pveproxy.service start operation timed out. Terminating.
Jun 13 09:02:36 swk-prox00.namaquawines.local systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: Failed to start PVE API Proxy Server.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: Unit pveproxy.service entered failed state.
root@swk-prox00:~# ps aux |grep pveproxy
root 4360 0.0 0.2 239652 66408 ? Ds 08:00 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 4757 0.0 0.2 239548 66072 ? Ds 08:06 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 10143 0.1 0.2 239600 66260 ? Ds 08:59 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 11532 0.0 0.0 12732 2200 pts/0 S+ 09:13 0:00 grep pveproxy
root 28078 0.0 0.2 239588 66144 ? Ds 06:25 0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root 28592 0.0 0.2 239648 66032 ? Ds 06:32 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 29950 0.0 0.2 239604 66232 ? Ds 06:44 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
 
  • Like
Reactions: Maulana Noor
systemctl start pveproxy
Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.

journalctl -xn
-- Logs begin at Sun 2017-04-09 13:58:20 SAST, end at Tue 2017-06-13 09:20:22 SAST. --
Jun 13 09:15:52 swk-prox00.namaquawines.local systemd[1]: Starting PVE API Proxy Server...
-- Subject: Unit pveproxy.service has begun with start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pveproxy.service has begun starting up.
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11826]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11827]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11826]: pam_unix(cron:session): session closed for user root
Jun 13 09:17:22 swk-prox00.namaquawines.local systemd[1]: pveproxy.service start operation timed out. Terminating.
Jun 13 09:17:30 swk-prox00.namaquawines.local pvestatd[24187]: status update time (5.772 seconds)
Jun 13 09:18:52 swk-prox00.namaquawines.local systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: Failed to start PVE API Proxy Server.
-- Subject: Unit pveproxy.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pveproxy.service has failed.
--
-- The result is failed.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: Unit pveproxy.service entered failed state.
 
  • Like
Reactions: Maulana Noor
please provide the full log of the pve services (e.g., "journalctl -b -u 'pve*'" or similar). is this a clustered setup? the symptoms look like a hanging cluster file system (which in turn causes to block pveproxy on accessing /etc/pve).
 
Hi Fabian,

This is indeed a clustered setup.
 

Attachments

  • journal.zip
    390.6 KB · Views: 3
I'd check if multicast is working reliably in your network (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network) and restart the pve-cluster service. your log shows messages like "Jun 07 06:53:46 vwk-prox01.namaquawines.local pmxcfs[1979]: [status] notice: remove message from non-member 1/1986" and retry messages.. the last message the pmxcfs logged is that it received a sync request, but nothing about handling it.
 
Fabian, you are a gentleman and a scholar.
Please pass on my recommendation to HR that you should receive a raise and promotion immediately.

I think you have just shown me the root cause of all the issues I have been having with proxmox.
Corosync is currently on the same network as my storage cluster, which would explain why everything goes so horribly wrong as soon as server load and network traffic is increased.

I will split it off to a separate network as recommended by the wiki (which I should have read with much more attention before upgrading)

I assume it would be best if I shutdown all VM's on the cluster before I make any changes?
 
I'd check if multicast is working reliably in your network (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network) and restart the pve-cluster service. your log shows messages like "Jun 07 06:53:46 vwk-prox01.namaquawines.local pmxcfs[1979]: [status] notice: remove message from non-member 1/1986" and retry messages.. the last message the pmxcfs logged is that it received a sync request, but nothing about handling it.

I do have one big problem - one of my servers is located off site (about 6km away, across a river and on the other side of town) - there is a site to site link (600MBPS Wireless), however there is no way to connect that server to the new physical network I have to create for corosync.

It seems that from version 2. to version 4 Proxmox has gone from the best possible solution for our needs to the worst possible.
Any suggestions would be welcome, for it seems I am now up the proverbial creek with no paddle.
 
I do have one big problem - one of my servers is located off site (about 6km away, across a river and on the other side of town) - there is a site to site link (600MBPS Wireless), however there is no way to connect that server to the new physical network I have to create for corosync.

It seems that from version 2. to version 4 Proxmox has gone from the best possible solution for our needs to the worst possible.
Any suggestions would be welcome, for it seems I am now up the proverbial creek with no paddle.

the cluster network does not care (much) about bandwidth - but it is rather latency sensitive. the upstream recommendation is <= 2 milliseconds. maybe it is possible to split that one node out of the cluster and operate it as stand alone node? you'd lose migration capabilities, but you could setup some kind of network storage between the single node and the cluster to have off-site backup and restore capabilities.
 
Losing migration would be extremely inconvenient, however it seems to be the only practical solution.
Under regular load, the latency between us and the remote site stays between 2 and 4ms, however under heavy load that increases to as much as 20.

This is most likely the leading cause of all the issues I am having.
Local latency rarely exceeds 1ms under heavy load.
 
This my troubleshoot at My PVE Server:

  1. Starting update -> run apt update && apt dist-upgrade -y && apt upgrade -y
  2. run --> pveproxy status
  3. run --> pveproxy start
  4. run --> pveproxy status
  5. pveversion -v
  6. journalctl -xn
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!