PveProxy crashes, unable to start it

P

plofkat

Active Member

Jun 13, 2017

#1

Pveproxy on all my servers crashed during the night.
All my vm's are up, (along with the disk access issue) however the web interface is down

I can ssh ddirectly to the hosts, however it I attempt to run "pveproxy start" everything stops responding.

It seems the issue occurs if I leave open a novnc console for a long time

fabian

Proxmox Staff Member

Staff member

Jun 13, 2017

#2

please include relevant information in your posts if you want help, e.g. output of "pveversion -v", error messages, content of log and configuration files, ...

P

plofkat

Active Member

Jun 13, 2017

#3

pveversion -v
proxmox-ve: 4.4-86 (running kernel: 4.4.49-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.49-1-pve: 4.4.49-86
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Tue 2017-06-13 09:04:07 SAST; 8min ago
Main PID: 31491 (code=exited, status=0/SUCCESS)

Jun 13 09:01:06 swk-prox00.namaquawines.local systemd[1]: pveproxy.service start operation timed out. Terminating.
Jun 13 09:02:36 swk-prox00.namaquawines.local systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: Failed to start PVE API Proxy Server.
Jun 13 09:04:07 swk-prox00.namaquawines.local systemd[1]: Unit pveproxy.service entered failed state.
root@swk-prox00:~# ps aux |grep pveproxy
root 4360 0.0 0.2 239652 66408 ? Ds 08:00 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 4757 0.0 0.2 239548 66072 ? Ds 08:06 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 10143 0.1 0.2 239600 66260 ? Ds 08:59 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 11532 0.0 0.0 12732 2200 pts/0 S+ 09:13 0:00 grep pveproxy
root 28078 0.0 0.2 239588 66144 ? Ds 06:25 0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root 28592 0.0 0.2 239648 66032 ? Ds 06:32 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 29950 0.0 0.2 239604 66232 ? Ds 06:44 0:00 /usr/bin/perl -T /usr/bin/pveproxy start

Reactions: Maulana Noor

P

plofkat

Active Member

Jun 13, 2017

#4

systemctl start pveproxy
Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.

journalctl -xn
-- Logs begin at Sun 2017-04-09 13:58:20 SAST, end at Tue 2017-06-13 09:20:22 SAST. --
Jun 13 09:15:52 swk-prox00.namaquawines.local systemd[1]: Starting PVE API Proxy Server...
-- Subject: Unit pveproxy.service has begun with start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pveproxy.service has begun starting up.
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11826]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11827]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 13 09:17:01 swk-prox00.namaquawines.local CRON[11826]: pam_unix(cron:session): session closed for user root
Jun 13 09:17:22 swk-prox00.namaquawines.local systemd[1]: pveproxy.service start operation timed out. Terminating.
Jun 13 09:17:30 swk-prox00.namaquawines.local pvestatd[24187]: status update time (5.772 seconds)
Jun 13 09:18:52 swk-prox00.namaquawines.local systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: Failed to start PVE API Proxy Server.
-- Subject: Unit pveproxy.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pveproxy.service has failed.
--
-- The result is failed.
Jun 13 09:20:22 swk-prox00.namaquawines.local systemd[1]: Unit pveproxy.service entered failed state.

Reactions: Maulana Noor

fabian

Proxmox Staff Member

Staff member

Jun 13, 2017

#5

please provide the full log of the pve services (e.g., "journalctl -b -u 'pve*'" or similar). is this a clustered setup? the symptoms look like a hanging cluster file system (which in turn causes to block pveproxy on accessing /etc/pve).

P

plofkat

Active Member

Jun 13, 2017

#6

Hi Fabian,

This is indeed a clustered setup.

fabian

Proxmox Staff Member

Staff member

Jun 13, 2017

#7

I'd check if multicast is working reliably in your network (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network) and restart the pve-cluster service. your log shows messages like "Jun 07 06:53:46 vwk-prox01.namaquawines.local pmxcfs[1979]: [status] notice: remove message from non-member 1/1986" and retry messages.. the last message the pmxcfs logged is that it received a sync request, but nothing about handling it.

P

plofkat

Active Member

Jun 13, 2017

#8

Fabian, you are a gentleman and a scholar.
Please pass on my recommendation to HR that you should receive a raise and promotion immediately.

I think you have just shown me the root cause of all the issues I have been having with proxmox.
Corosync is currently on the same network as my storage cluster, which would explain why everything goes so horribly wrong as soon as server load and network traffic is increased.

I will split it off to a separate network as recommended by the wiki (which I should have read with much more attention before upgrading)

I assume it would be best if I shutdown all VM's on the cluster before I make any changes?

P

plofkat

Active Member

Jun 13, 2017

#9

fabian said:
I'd check if multicast is working reliably in your network (https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network) and restart the pve-cluster service. your log shows messages like "Jun 07 06:53:46 vwk-prox01.namaquawines.local pmxcfs[1979]: [status] notice: remove message from non-member 1/1986" and retry messages.. the last message the pmxcfs logged is that it received a sync request, but nothing about handling it.

I do have one big problem - one of my servers is located off site (about 6km away, across a river and on the other side of town) - there is a site to site link (600MBPS Wireless), however there is no way to connect that server to the new physical network I have to create for corosync.

It seems that from version 2. to version 4 Proxmox has gone from the best possible solution for our needs to the worst possible.
Any suggestions would be welcome, for it seems I am now up the proverbial creek with no paddle.

fabian

Proxmox Staff Member

Staff member

Jun 14, 2017

#10

plofkat said:
I do have one big problem - one of my servers is located off site (about 6km away, across a river and on the other side of town) - there is a site to site link (600MBPS Wireless), however there is no way to connect that server to the new physical network I have to create for corosync.

It seems that from version 2. to version 4 Proxmox has gone from the best possible solution for our needs to the worst possible.
Any suggestions would be welcome, for it seems I am now up the proverbial creek with no paddle.

the cluster network does not care (much) about bandwidth - but it is rather latency sensitive. the upstream recommendation is <= 2 milliseconds. maybe it is possible to split that one node out of the cluster and operate it as stand alone node? you'd lose migration capabilities, but you could setup some kind of network storage between the single node and the cluster to have off-site backup and restore capabilities.

P

plofkat

Active Member

Jun 14, 2017

#11

Losing migration would be extremely inconvenient, however it seems to be the only practical solution.
Under regular load, the latency between us and the remote site stays between 2 and 4ms, however under heavy load that increases to as much as 20.

This is most likely the leading cause of all the issues I am having.
Local latency rarely exceeds 1ms under heavy load.

M

Maulana Noor

Member

Apr 2, 2019

#12

This my troubleshoot at My PVE Server:

Starting update -> run apt update && apt dist-upgrade -y && apt upgrade -y
run --> pveproxy status
run --> pveproxy start
run --> pveproxy status
pveversion -v
journalctl -xn

Search

Search

PveProxy crashes, unable to start it

plofkat

Active Member

fabian

Proxmox Staff Member

plofkat

Active Member

plofkat

Active Member

fabian

Proxmox Staff Member

plofkat

Active Member

Attachments

fabian

Proxmox Staff Member

plofkat

Active Member

plofkat

Active Member

fabian

Proxmox Staff Member

plofkat

Active Member

Maulana Noor

Member

We value your privacy