pveproxy become blocked state and cannot be killed

brickmasterj

New Member
Oct 18, 2015
19
1
3
NB: a zombie process is something different than a process blocked for I/O.
please provide the following output:

if you have a pveproxy process blocked,
please provide the following output
ps faxl | grep pveproxy
Pardon, I meant blocked. I get the same output as in original post. In any case:
Code:
root@pve:~# ps faxl | grep pveproxy
0     0  3372  3242  20   0  12732  1792 pipe_w S+   pts/3      0:00          \_ grep pveproxy
 

manu

Proxmox Staff Member
Mar 3, 2015
806
66
28
In your case it looks the pveproxy is not even running
waht is the ouput of

service pveproxy status

?

if it is not running please start it
 

brickmasterj

New Member
Oct 18, 2015
19
1
3
In your case it looks the pveproxy is not even running
waht is the ouput of

service pveproxy status

?

if it is not running please start it
Exact same as original post:
Code:
root@pve:~# service pveproxy status
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
   Active: failed (Result: timeout) since Tue 2016-09-27 14:58:46 CEST; 21h ago
Main PID: 830 (code=exited, status=0/SUCCESS)

Sep 27 14:55:45 pve1 systemd[1]: pveproxy.service start operation timed out. Terminating.
Sep 27 14:57:16 pve1 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Sep 27 14:58:46 pve1 systemd[1]: pveproxy.service still around after final SIGKILL. Enterin...ode.Sep 27 14:58:46 pve1 systemd[1]: Failed to start PVE API Proxy Server.
Sep 27 14:58:46 pve1 systemd[1]: Unit pveproxy.service entered failed state.
 

manu

Proxmox Staff Member
Mar 3, 2015
806
66
28
If you have a pveproxy blocked, it should appears in the process list with the state D.
Are you sure there is no string 'pveproxy' in your list of processes ( ps faxl)

If there is no pve proxy process, try to start the daemon manually with

service pveproxy start
 

arifww

New Member
Dec 1, 2015
5
0
1
Hi I have same problem,

Cannot kill pveproxy.

Code:
0     0 18538 16346  20   0  12728  2036 pipe_w S+   pts/0      0:00          \_ grep pveproxy
4     0 21250     1  20   0 235228 65904 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
4     0 21593     1  20   0 235152 65760 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start
4     0 22147     1  20   0 235184 65840 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start
4     0 16805     1  20   0 235204 65752 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start

I still can access /etc/pve also pvecm status seems good.

Code:
Quorum information
------------------
Date:             Mon Oct 17 15:27:57 2016
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          156
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.10.101
0x00000002          1 192.168.10.102
0x00000003          1 192.168.10.103
0x00000004          1 192.168.10.104
0x00000005          1 192.168.10.105 (local)
 

manu

Proxmox Staff Member
Mar 3, 2015
806
66
28
@arifww

what is the output of
the df command on your system ?
does it hang on a filesystem ?
 

arifww

New Member
Dec 1, 2015
5
0
1
@manu

df output, seems OK

Code:
Filesystem           1K-blocks     Used Available Use% Mounted on
udev                     10240        0     10240   0% /dev
tmpfs                  6586756    82932   6503824   2% /run
/dev/dm-0             10190136  1205008   8444456  13% /
tmpfs                 16466884    55008  16411876   1% /dev/shm
tmpfs                     5120        0      5120   0% /run/lock
tmpfs                 16466884        0  16466884   0% /sys/fs/cgroup
/dev/mapper/pve-data 249923436 98994288 138210660  42% /var/lib/vz
tmpfs                      100        0       100   0% /run/lxcfs/controllers
cgmfs                      100        0       100   0% /run/cgmanager/fs
/dev/fuse                30720       44     30676   1% /etc/pve

System is OK, i can access (ssh) normally.
 

Jura

Member
Dec 20, 2010
1
0
21
I confirm this problem.
WIll provide requested information ASAP.
/etc/init.d/pve-cluster restart
usually helps
 

manu

Proxmox Staff Member
Mar 3, 2015
806
66
28
The thing is that when a process is blocked in stated D ( waiting for I/O) it can be caused by many different things. Most of the time, a unreachable filesystem where a file handler is still open will be the cause ( hence the request to check with df if all FS are accessible), either a NFS mount, or the fuse filesystem in /etc/pve.

So if you have pveproxy process blocked:
Check if the process is in state D

ps faxl | grep pveproxy
1 33 7475 1 20 0 332612 88092 hrtime Ss ? 0:00 pveproxy

here I have the state Ss for instance on my system

If not in state D, restart with
service pveproxy restart

if in state D, check if your filesystem
df -h

if the df command hangs, then restart the pve-cluster service

If this does not help you need to reboot the host. A process waiting for I/O cannot be cleanly killed that's Unix Design ( that's why the exact process state is called *un-interruptible sleep*.
From my experience this comes most of the time from flaky networks (example: NFS mounts when the network went down)
 

timdau

Member
Feb 24, 2017
5
0
6
52
Ran into this problem today, running a new-ish kernel. 4.4.6-1-pve

Running /etc/init.d/pve-cluster restart on one node fixed the issue on all nodes. Did I just happen to guess correctly on the broken node?

pvecm status said everything was fine.

I could read files inside /etc/pve

After I restarted I got about 117k log entries like this one:

Feb 24 10:35:49 pve2 pmxcfs[14393]: [status] notice: remove message from non-member 1/3027

This has happened twice now, it scares the heck out of me each time.
 

brickmasterj

New Member
Oct 18, 2015
19
1
3
After having experienced this issue any number of times, I have started to collect some metrics on the sort of machines it happens on. One thing that immediately stood out to me that on the about ~10ish machines I have actively and extensively run PVE on, this issue has in my case exclusively occurred on machines that ran AMD CPU's and chipsets, and without ECC RAM. All my machines ran any form of ZFS pools and ranged between mid end AMD to high end Xeon's.

I would be very interested in if the other people who have encountered this issue could report on the architecture and/or RAM type they are using, may be there is a correlation (no ECC RAM for example would be an easy suspect).
 

timdau

Member
Feb 24, 2017
5
0
6
52
Yeah, I am sorry. I wish it was otherwise. It could still be ZFS, but I don't understand how it could be related to the cluster.
 

axion.joey

Active Member
Dec 29, 2009
78
2
28
I just saw this exact same issue running the latest version of Proxmox.

pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.44-1-pve: 4.4.44-84
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

Running on a Dell R620 with Intel Processors and ECC ram. At the time of crash there were only 3 LXC's running on the host. All of which were mostly idle. Has anyone found a solution for this problem?
 

Gert

Member
Jul 27, 2015
16
1
23
43
Centurion, South Africa
www.huge.co.za
This happens to 2 of y 3 nodes in my cluster every couple of days. always the same 2 nodes and always at the same time.

Proxmox sends this email:

/etc/cron.daily/logrotate:
Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.
Job for spiceproxy.service failed. See 'systemctl status spiceproxy.service' and 'journalctl -xn' for details.
error: error running shared postrotate script for '/var/log/pveproxy/access.log '
run-parts: /etc/cron.daily/logrotate exited with return code 1

Every time the ram is strangely over utilized on both nodes for no apparent reason.

root@proxmox:~# free -m
total used free shared buffers cached
Mem: 32149 31804 345 74 12111 119
-/+ buffers/cache: 19573 12576
Swap: 8191 308 7883


root@proxmox:~# systemctl status pveproxy.service
â pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Fri 2017-04-14 06:49:08 SAST; 1h 34min ago
Main PID: 29521 (code=exited, status=0/SUCCESS)

Apr 14 06:46:08 proxmox systemd[1]: pveproxy.service start operation timed out. Terminating.
Apr 14 06:47:38 proxmox systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Apr 14 06:49:08 proxmox systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Apr 14 06:49:08 proxmox systemd[1]: Failed to start PVE API Proxy Server.
Apr 14 06:49:08 proxmox systemd[1]: Unit pveproxy.service entered failed state.


To get things running I have to do the following:

sync; echo 3 > /proc/sys/vm/drop_caches

root@proxmox:~# free -m
total used free shared buffers cached
Mem: 32149 19327 12821 74 45 89
-/+ buffers/cache: 19192 12956
Swap: 8191 308 7883

root@proxmox:~# /etc/init.d/pve-cluster restart
[ ok ] Restarting pve-cluster (via systemctl): pve-cluster.service.
root@proxmox:~# pveproxy restart


All is now good again and no need for a restart.
 

Haru

New Member
Apr 29, 2017
5
0
1
32
Any new here? I'm facing this problem. reboot all nodes will help, but after few minutes it happen again. I have 4nodes cluster. Is anyway remove cluster related services on all nodes, make them standalone again (just proxmox without any clustering) ?
 

brickmasterj

New Member
Oct 18, 2015
19
1
3
Any new here? I'm facing this problem. reboot all nodes will help, but after few minutes it happen again. I have 4nodes cluster. Is anyway remove cluster related services on all nodes, make them standalone again (just proxmox without any clustering) ?
Nope, we have since replaced all production machines this error occurred on as there seems to be no possible fix for this fairly common problem (they were older AMD machines anyway). My testing machine however still runs into this mystery of a problem every few weeks or so if left on long enough...

The only thing I could add is that my suspicion still is with ZFS, because I've only seen the error occur on ZFS RAIDZ machines...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!