pveproxy become blocked state and cannot be killed

brickmasterj · Sep 28, 2016

manu said:
NB: a zombie process is something different than a process blocked for I/O.
please provide the following output:

if you have a pveproxy process blocked,
please provide the following output
ps faxl | grep pveproxy

Pardon, I meant blocked. I get the same output as in original post. In any case:

Code:

root@pve:~# ps faxl | grep pveproxy
0     0  3372  3242  20   0  12732  1792 pipe_w S+   pts/3      0:00          \_ grep pveproxy

manu · Sep 28, 2016

In your case it looks the pveproxy is not even running
waht is the ouput of

service pveproxy status

?

if it is not running please start it

brickmasterj · Sep 28, 2016

manu said:
In your case it looks the pveproxy is not even running
waht is the ouput of

service pveproxy status

?

if it is not running please start it

Exact same as original post:

Code:

root@pve:~# service pveproxy status
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
   Active: failed (Result: timeout) since Tue 2016-09-27 14:58:46 CEST; 21h ago
Main PID: 830 (code=exited, status=0/SUCCESS)

Sep 27 14:55:45 pve1 systemd[1]: pveproxy.service start operation timed out. Terminating.
Sep 27 14:57:16 pve1 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Sep 27 14:58:46 pve1 systemd[1]: pveproxy.service still around after final SIGKILL. Enterin...ode.Sep 27 14:58:46 pve1 systemd[1]: Failed to start PVE API Proxy Server.
Sep 27 14:58:46 pve1 systemd[1]: Unit pveproxy.service entered failed state.

manu · Sep 28, 2016

If you have a pveproxy blocked, it should appears in the process list with the state D.
Are you sure there is no string 'pveproxy' in your list of processes ( ps faxl)

If there is no pve proxy process, try to start the daemon manually with

service pveproxy start

arifww · Oct 17, 2016

Hi I have same problem,

Cannot kill pveproxy.

Code:

0     0 18538 16346  20   0  12728  2036 pipe_w S+   pts/0      0:00          \_ grep pveproxy
4     0 21250     1  20   0 235228 65904 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
4     0 21593     1  20   0 235152 65760 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start
4     0 22147     1  20   0 235184 65840 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start
4     0 16805     1  20   0 235204 65752 filena Ds   ?          0:00 /usr/bin/perl -T /usr/bin/pveproxy start

I still can access /etc/pve also pvecm status seems good.

Code:

Quorum information
------------------
Date:             Mon Oct 17 15:27:57 2016
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000005
Ring ID:          156
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.10.101
0x00000002          1 192.168.10.102
0x00000003          1 192.168.10.103
0x00000004          1 192.168.10.104
0x00000005          1 192.168.10.105 (local)

manu · Oct 17, 2016

@arifww

what is the output of
the df command on your system ?
does it hang on a filesystem ?

arifww · Oct 17, 2016

@manu

df output, seems OK

Code:

Filesystem           1K-blocks     Used Available Use% Mounted on
udev                     10240        0     10240   0% /dev
tmpfs                  6586756    82932   6503824   2% /run
/dev/dm-0             10190136  1205008   8444456  13% /
tmpfs                 16466884    55008  16411876   1% /dev/shm
tmpfs                     5120        0      5120   0% /run/lock
tmpfs                 16466884        0  16466884   0% /sys/fs/cgroup
/dev/mapper/pve-data 249923436 98994288 138210660  42% /var/lib/vz
tmpfs                      100        0       100   0% /run/lxcfs/controllers
cgmfs                      100        0       100   0% /run/cgmanager/fs
/dev/fuse                30720       44     30676   1% /etc/pve

System is OK, i can access (ssh) normally.

Jura · Oct 18, 2016

I confirm this problem.
WIll provide requested information ASAP.
/etc/init.d/pve-cluster restart
usually helps

manu · Oct 19, 2016

The thing is that when a process is blocked in stated D ( waiting for I/O) it can be caused by many different things. Most of the time, a unreachable filesystem where a file handler is still open will be the cause ( hence the request to check with df if all FS are accessible), either a NFS mount, or the fuse filesystem in /etc/pve.

So if you have pveproxy process blocked:
Check if the process is in state D

ps faxl | grep pveproxy
1 33 7475 1 20 0 332612 88092 hrtime Ss ? 0:00 pveproxy

here I have the state Ss for instance on my system

If not in state D, restart with
service pveproxy restart

if in state D, check if your filesystem
df -h

if the df command hangs, then restart the pve-cluster service

If this does not help you need to reboot the host. A process waiting for I/O cannot be cleanly killed that's Unix Design ( that's why the exact process state is called *un-interruptible sleep*.
From my experience this comes most of the time from flaky networks (example: NFS mounts when the network went down)

arifww · Oct 19, 2016

Hi All,

Resolved after reboot all nodes in my cluster.

Thanks.

timdau · Feb 24, 2017

Ran into this problem today, running a new-ish kernel. 4.4.6-1-pve

Running /etc/init.d/pve-cluster restart on one node fixed the issue on all nodes. Did I just happen to guess correctly on the broken node?

pvecm status said everything was fine.

I could read files inside /etc/pve

After I restarted I got about 117k log entries like this one:

Feb 24 10:35:49 pve2 pmxcfs[14393]: [status] notice: remove message from non-member 1/3027

This has happened twice now, it scares the heck out of me each time.

brickmasterj · Feb 24, 2017

After having experienced this issue any number of times, I have started to collect some metrics on the sort of machines it happens on. One thing that immediately stood out to me that on the about ~10ish machines I have actively and extensively run PVE on, this issue has in my case exclusively occurred on machines that ran AMD CPU's and chipsets, and without ECC RAM. All my machines ran any form of ZFS pools and ranged between mid end AMD to high end Xeon's.

I would be very interested in if the other people who have encountered this issue could report on the architecture and/or RAM type they are using, may be there is a correlation (no ECC RAM for example would be an easy suspect).

timdau · Feb 24, 2017

I am exclusively using ZFS, ECC RAM and E3 Xeon processors in my cluster.

brickmasterj · Feb 24, 2017

timdau said:
I am exclusively using ZFS, ECC RAM and E3 Xeon processors in my cluster.

Well, I guess we ruled that out then

timdau · Feb 24, 2017

Yeah, I am sorry. I wish it was otherwise. It could still be ZFS, but I don't understand how it could be related to the cluster.

axion.joey · Mar 29, 2017

I just saw this exact same issue running the latest version of Proxmox.

pveversion -v
proxmox-ve: 4.4-84 (running kernel: 4.4.44-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.44-1-pve: 4.4.44-84
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-96
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

Running on a Dell R620 with Intel Processors and ECC ram. At the time of crash there were only 3 LXC's running on the host. All of which were mostly idle. Has anyone found a solution for this problem?

manu · Mar 29, 2017

did you read https://forum.proxmox.com/threads/p...state-and-cannot-be-killed.24386/#post-146753 ?

Gert · Apr 14, 2017

This happens to 2 of y 3 nodes in my cluster every couple of days. always the same 2 nodes and always at the same time.

Proxmox sends this email:

/etc/cron.daily/logrotate:
Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.
Job for spiceproxy.service failed. See 'systemctl status spiceproxy.service' and 'journalctl -xn' for details.
error: error running shared postrotate script for '/var/log/pveproxy/access.log '
run-parts: /etc/cron.daily/logrotate exited with return code 1

Every time the ram is strangely over utilized on both nodes for no apparent reason.

root@proxmox:~# free -m
total used free shared buffers cached
Mem: 32149 31804 345 74 12111 119
-/+ buffers/cache: 19573 12576
Swap: 8191 308 7883

root@proxmox:~# systemctl status pveproxy.service
â pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Fri 2017-04-14 06:49:08 SAST; 1h 34min ago
Main PID: 29521 (code=exited, status=0/SUCCESS)

Apr 14 06:46:08 proxmox systemd[1]: pveproxy.service start operation timed out. Terminating.
Apr 14 06:47:38 proxmox systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Apr 14 06:49:08 proxmox systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Apr 14 06:49:08 proxmox systemd[1]: Failed to start PVE API Proxy Server.
Apr 14 06:49:08 proxmox systemd[1]: Unit pveproxy.service entered failed state.

To get things running I have to do the following:

sync; echo 3 > /proc/sys/vm/drop_caches

root@proxmox:~# free -m
total used free shared buffers cached
Mem: 32149 19327 12821 74 45 89
-/+ buffers/cache: 19192 12956
Swap: 8191 308 7883

root@proxmox:~# /etc/init.d/pve-cluster restart
[ ok ] Restarting pve-cluster (via systemctl): pve-cluster.service.
root@proxmox:~# pveproxy restart

All is now good again and no need for a restart.

Haru · May 28, 2017

Any new here? I'm facing this problem. reboot all nodes will help, but after few minutes it happen again. I have 4nodes cluster. Is anyway remove cluster related services on all nodes, make them standalone again (just proxmox without any clustering) ?

brickmasterj · May 28, 2017

Haru said:
Any new here? I'm facing this problem. reboot all nodes will help, but after few minutes it happen again. I have 4nodes cluster. Is anyway remove cluster related services on all nodes, make them standalone again (just proxmox without any clustering) ?

Nope, we have since replaced all production machines this error occurred on as there seems to be no possible fix for this fairly common problem (they were older AMD machines anyway). My testing machine however still runs into this mystery of a problem every few weeks or so if left on long enough...

The only thing I could add is that my suspicion still is with ZFS, because I've only seen the error occur on ZFS RAIDZ machines...

pveproxy become blocked state and cannot be killed

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

New Member

Active Member

New Member

Active Member

Active Member

Proxmox Staff Member

Member

New Member

New Member

We value your privacy