pveproxy become blocked state and cannot be killed

dirk.nilius

Member
Nov 5, 2015
47
0
6
Berlin, Germany
Symptoms:
- web UI not accessible

Code:
systemctl status pveproxy

● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
Active: failed (Result: timeout) since Mon 2015-11-09 06:49:06 CET; 3h 46min ago
Process: 7836 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
Main PID: 27630 (code=exited, status=0/SUCCESS)

Nov 09 06:46:06 ckc-b-p0005 systemd[1]: pveproxy.service start operation timed out. Terminating.
Nov 09 06:47:36 ckc-b-p0005 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Nov 09 06:49:06 ckc-b-p0005 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Nov 09 06:49:06 ckc-b-p0005 systemd[1]: Failed to start PVE API Proxy Server.
Nov 09 06:49:06 ckc-b-p0005 systemd[1]: Unit pveproxy.service entered failed state.

Code:
ps -aux | grep pveproxy

root 11474 0.0 0.0 14188 1856 pts/5 S+ 10:37 0:00 grep pveproxy
root 22952 0.0 0.1 235244 62224 ? Ds 06:25 0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root 25449 0.0 0.1 235256 62296 ? Ds 06:32 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 29533 0.0 0.1 235224 62208 ? Ds 06:44 0:00 /usr/bin/perl -T /usr/bin/pveproxy start

Code:
kill -9 22952 25449 29533
-> no effect

Code:
reboot
-> system stuck with waiting for running jobs message

had to do a hard power off

proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

syslog:

Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277697] Tainted: P O 4.2.3-2-pve #1
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277744] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277824] pveproxy D 000000000000000e 0 10150 1 0x00000004
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277829] ffff880507173dc8 0000000000000086 ffff88105aa65280 ffff881042378dc0
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277834] 00000000ffffffff ffff881041c306b0 ffff880507173de8 ffffffff818036a7
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277836] Call Trace:
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277845] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277849] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277851] [<ffffffff81805665>] __mutex_lock_slowpath+0xb5/0x120
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277853] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277856] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277858] [<ffffffff8120e133>] SyS_mkdir+0x53/0x100
Nov 9 06:27:50 ckc-b-p0003 kernel: [168917.277861] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:28:04 ckc-b-p0003 systemd[1]: pveproxy.service stop-sigterm timed out. Killing.
Nov 9 06:29:34 ckc-b-p0003 systemd[1]: pveproxy.service still around after SIGKILL. Ignoring.
Nov 9 06:29:50 ckc-b-p0003 kernel: [169037.161915] Tainted: P O 4.2.3-2-pve #1
Nov 9 06:29:50 ckc-b-p0003 kernel: [169037.162046] ffff880507173dc8 0000000000000086 ffff88105aa65280 ffff881042378dc0
Nov 9 06:29:50 ckc-b-p0003 kernel: [169037.162053] Call Trace:
Nov 9 06:29:50 ckc-b-p0003 kernel: [169037.162068] [<ffffffff81805665>] __mutex_lock_slowpath+0xb5/0x120
Nov 9 06:29:50 ckc-b-p0003 kernel: [169037.162075] [<ffffffff8120e133>] SyS_mkdir+0x53/0x100
Nov 9 06:30:25 ckc-b-p0003 rrdcached[1774]: flushing old values
Nov 9 06:30:25 ckc-b-p0003 rrdcached[1774]: rotating journals
Nov 9 06:30:25 ckc-b-p0003 rrdcached[1774]: started new journal /var/lib/rrdcached/journal/rrd.journal.1447047025.319497
Nov 9 06:30:25 ckc-b-p0003 rrdcached[1774]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1447039825.319469
Nov 9 06:31:04 ckc-b-p0003 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Nov 9 06:31:50 ckc-b-p0003 kernel: [169157.046251] pveproxy D 000000000000000e 0 10150 1 0x00000004
Nov 9 06:31:50 ckc-b-p0003 kernel: [169157.046261] 00000000ffffffff ffff881041c306b0 ffff880507173de8 ffffffff818036a7
Nov 9 06:31:50 ckc-b-p0003 kernel: [169157.046276] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 9 06:31:50 ckc-b-p0003 kernel: [169157.046284] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 9 06:32:34 ckc-b-p0003 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Nov 9 06:32:34 ckc-b-p0003 systemd[1]: Unit pveproxy.service entered failed state.
Nov 9 06:33:50 ckc-b-p0003 kernel: [169276.930396] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 9 06:33:50 ckc-b-p0003 kernel: [169276.930485] ffff881041c30600 ffff880507174000 ffff881041c306ac ffff881042378dc0
Nov 9 06:33:50 ckc-b-p0003 kernel: [169276.930499] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:33:50 ckc-b-p0003 kernel: [169276.930506] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:33:50 ckc-b-p0003 kernel: [169276.930514] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:34:05 ckc-b-p0003 systemd[1]: pveproxy.service start operation timed out. Terminating.
Nov 9 06:35:35 ckc-b-p0003 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814485] INFO: task pveproxy:10150 blocked for more than 120 seconds.
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814542] Tainted: P O 4.2.3-2-pve #1
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814671] pveproxy D 000000000000000e 0 10150 1 0x00000004
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814679] ffff881041c30600 ffff880507174000 ffff881041c306ac ffff881042378dc0
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814692] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814700] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814708] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814892] ffff881041c30600 ffff880c465c4000 ffff881041c306ac ffff881008fec4c0
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814898] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814903] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:35:50 ckc-b-p0003 kernel: [169396.814909] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:37:05 ckc-b-p0003 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Nov 9 06:37:05 ckc-b-p0003 systemd[1]: Failed to start PVE API Proxy Server.
Nov 9 06:37:05 ckc-b-p0003 systemd[1]: Unit pveproxy.service entered failed state.
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698755] Tainted: P O 4.2.3-2-pve #1
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698897] pveproxy D 000000000000000e 0 10150 1 0x00000004
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698906] 00000000ffffffff ffff881041c306b0 ffff880507173de8 ffffffff818036a7
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698921] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698929] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.698935] INFO: task pveproxy:14932 blocked for more than 120 seconds.
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.699112] pveproxy D 000000000000000c 0 14932 1 0x00000004
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.699118] 00000000ffffffff ffff881041c306b0 ffff880c465c3de8 ffffffff818036a7
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.699125] [<ffffffff8180395e>] schedule_preempt_disabled+0xe/0x10
Nov 9 06:37:50 ckc-b-p0003 kernel: [169516.699130] [<ffffffff8120d19a>] filename_create+0x7a/0x160
Nov 9 06:38:35 ckc-b-p0003 systemd[1]: spiceproxy.service stopping timed out. Terminating.
Nov 9 06:38:35 ckc-b-p0003 spiceproxy[27527]: received signal TERM
Nov 9 06:38:35 ckc-b-p0003 spiceproxy[27527]: server closing
Nov 9 06:38:35 ckc-b-p0003 spiceproxy[27527]: worker 27528 finished
Nov 9 06:38:35 ckc-b-p0003 spiceproxy[27527]: server stopped
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.582973] Tainted: P O 4.2.3-2-pve #1
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583101] pveproxy D 000000000000000e 0 10150 1 0x00000004
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583108] ffff881041c30600 ffff880507174000 ffff881041c306ac ffff881042378dc0
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583122] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583130] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583138] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583321] ffff881041c30600 ffff880c465c4000 ffff881041c306ac ffff881008fec4c0
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583327] [<ffffffff818036a7>] schedule+0x37/0x80
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583333] [<ffffffff818056f3>] mutex_lock+0x23/0x40
Nov 9 06:39:50 ckc-b-p0003 kernel: [169636.583339] [<ffffffff818077b2>] entry_SYSCALL_64_fastpath+0x16/0x75
Nov 9 06:40:06 ckc-b-p0003 systemd[1]: spiceproxy.service stop-sigterm timed out. Killing.
Nov 9 06:40:41 ckc-b-p0003 postfix/qmgr[1930]: 1E891F05E: from=<root@ckc-b-p0003.ckc.de>, size=1011, nrcpt=1 (queue active)
Nov 9 06:40:41 ckc-b-p0003 postfix/smtp[19769]: connect to ckc.de[192.168.13.0]:25: Network is unreachable
Nov 9 06:41:11 ckc-b-p0003 postfix/smtp[19769]: connect to ckc.de[192.168.115.10]:25: Connection timed out
Nov 9 06:41:11 ckc-b-p0003 postfix/smtp[19769]: connect to ckc.de[192.168.20.1]:25: Connection refused
Nov 9 06:41:36 ckc-b-p0003 systemd[1]: spiceproxy.service still around after SIGKILL. Ignoring.
Nov 9 06:41:41 ckc-b-p0003 postfix/smtp[19769]: connect to ckc.de[192.168.254.0]:25: Connection timed out
Nov 9 06:41:41 ckc-b-p0003 postfix/smtp[19769]: connect to ckc.de[192.168.10.1]:25: Connection refused
Nov 9 06:41:41 ckc-b-p0003 postfix/smtp[19769]: 1E891F05E: to=<nilius@ckc.de>, relay=none, delay=183499, delays=183439/0.02/60/0, dsn=4.4.1, status=deferred (connect to ckc.de[192.168.10.1]:25: Connection refused)
Nov 9 06:43:06 ckc-b-p0003 systemd[1]: spiceproxy.service stop-final-sigterm timed out. Killing.
Nov 9 06:44:36 ckc-b-p0003 systemd[1]: spiceproxy.service still around after final SIGKILL. Entering failed mode.
Nov 9 06:44:36 ckc-b-p0003 systemd[1]: Unit spiceproxy.service entered failed state.
Nov 9 06:46:07 ckc-b-p0003 systemd[1]: pveproxy.service start operation timed out. Terminating.
Nov 9 06:47:37 ckc-b-p0003 systemd[1]: pveproxy.service stop-final-sigterm timed out. Killing.
Nov 9 06:49:07 ckc-b-p0003 systemd[1]: pveproxy.service still around after final SIGKILL. Entering failed mode.
Nov 9 06:49:07 ckc-b-p0003 systemd[1]: Failed to start PVE API Proxy Server.
Nov 9 06:49:07 ckc-b-p0003 systemd[1]: Unit pveproxy.service entered failed state.
Nov 9 06:50:37 ckc-b-p0003 systemd[1]: spiceproxy.service start operation timed out. Terminating.
Nov 9 06:52:08 ckc-b-p0003 systemd[1]: spiceproxy.service stop-final-sigterm timed out. Killing.
Nov 9 06:53:38 ckc-b-p0003 systemd[1]: spiceproxy.service still around after final SIGKILL. Entering failed mode.
Nov 9 06:53:38 ckc-b-p0003 systemd[1]: Failed to start PVE SPICE Proxy Server.
Nov 9 06:53:38 ckc-b-p0003 systemd[1]: Unit spiceproxy.service entered failed state.
Nov 9 06:53:38 ckc-b-p0003 pvepw-logger[27552]: received terminate request (signal)
Nov 9 06:53:38 ckc-b-p0003 pvepw-logger[27552]: stopping pvefw logger
Nov 9 06:53:38 ckc-b-p0003 pvepw-logger[27227]: starting pvefw logger
 
Last edited:
pveproxy D 000000000000000e 0 10150 1 0x00000004

The D here means waiting for I/O.

Most probable causes would overruse of RAM, which could cause high swap activity, or storage accessibility problem.

If this happens again, please post the output of "uptime" and "free".
 
The RAM usage is max. at ~38GB of 64GB, Swap utilization is nealy zero. So I think that cannot be the problem. But I post the requestet values when it happend again.
 
Hi,

I found the problem. The clusterfs is blocking. Any process that is trying to read or write from /etc/pve become blocked forever. I saw this a few times. If you try ti `cd` into /etc/pve your shell is also blocked. Restarting pve-cluster from another shell is solving the problem. So this is apparently a bug in the clusterfs. Can I give you some more informations to help you to locate the problem?
 
Hi,

I think the log is not usefull at the moment. It is flooding the logs with hundereds of messages per minute. Now I have the problem that I cannot restart cluster:

-- Logs begin at Tue 2015-11-17 06:39:38 CET, end at Tue 2015-11-17 11:47:39 CET. --
Nov 17 11:47:36 ckc-b-p0005 pmxcfs[15123]: [main] notice: exit proxmox configuration filesystem (-1)
Nov 17 11:47:36 ckc-b-p0005 systemd[1]: pve-cluster.service: control process exited, code=exited status=255
Nov 17 11:47:36 ckc-b-p0005 systemd[1]: Failed to start The Proxmox VE cluster filesystem.
-- Subject: Unit pve-cluster.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pve-cluster.service has failed.
--
-- The result is failed.
Nov 17 11:47:36 ckc-b-p0005 systemd[1]: Unit pve-cluster.service entered failed state.
Nov 17 11:47:38 ckc-b-p0005 pve-ha-lrm[15115]: ipcc_send_rec failed: Connection refused
Nov 17 11:47:38 ckc-b-p0005 pve-ha-lrm[15115]: ipcc_send_rec failed: Connection refused
Nov 17 11:47:38 ckc-b-p0005 pve-ha-lrm[15115]: ipcc_send_rec failed: Connection refused
Nov 17 11:47:39 ckc-b-p0005 pve-ha-crm[15109]: ipcc_send_rec failed: Connection refused
Nov 17 11:47:39 ckc-b-p0005 pve-ha-crm[15109]: ipcc_send_rec failed: Connection refused
Nov 17 11:47:39 ckc-b-p0005 pve-ha-crm[15109]: ipcc_send_rec failed: Connection refused
 
Is corosync running ?
What is the output of pvecm status ?
 
Yes corosync is running and pvecm status is all good:

Quorum information
------------------
Date: Thu Nov 19 10:32:26 2015
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 7940
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.13.12
0x00000003 1 192.168.13.13
0x00000001 1 192.168.13.14 (local)


I attached a log from a failing node. Perhaps you can see any useful information. View attachment journal.log.zip
 
I have the same problem.

After updating all my trouble started we already tried everything to identify the problem and I can not .

Dirk.Nilius could tell what was causing your problem ?
 
We've also started having this issue (pveproxy in blocked state). pveproxy was initially failed on one machine in a four machine cluster and then 'spread' to two other machines. I've had to perform hard restarts on the three machines where it was hung.

The only unusual behavior was that backups were hung on one of the machines.
 
I have the same problem. 3 node cluster working day after a reboot, and outputs the same mistakes.
INFO: task pveproxy:22867 blocked for more than 120 seconds.
[71520.042250] Tainted: G O 4.2.8-1-pve #1
[71520.042275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[71520.042319] pveproxy D ffff88105f316a00 0 22867 1 0x00000004
[71520.042321] ffff880839eb7dc8 0000000000000082 ffff88085bd9b700 ffff8808560d0dc0
[71520.042323] ffff88085ab30300 ffff880839eb8000 ffff88085ab303ac ffff8808560d0dc0
[71520.042325] 00000000ffffffff ffff88085ab303b0 ffff880839eb7de8 ffffffff818069f7
[71520.042327] Call Trace:
[71520.042329] [<ffffffff818069f7>] schedule+0x37/0x80
[71520.042330] [<ffffffff81806cbe>] schedule_preempt_disabled+0xe/0x10
[71520.042332] [<ffffffff81808a15>] __mutex_lock_slowpath+0xb5/0x120
[71520.042334] [<ffffffff81808aa3>] mutex_lock+0x23/0x40
[71520.042336] [<ffffffff8120dc3a>] filename_create+0x7a/0x160
[71520.042338] [<ffffffff8120ebd3>] SyS_mkdir+0x53/0x100
[71520.042340] [<ffffffff8180ab72>] entry_SYSCALL_64_fastpath+0x16/0x75
 
Have you ZFS and the 4.2 Kernel? Does it happen if you do more backups? Then this could indeed be the same problem, it was a rare hard to reproducible Kernel regression. We fixed the issue temporary with downgrading to 3.19.

4.4 was not tested if this issue still persists, yet, AFAIK.

If you do not have ZFS (as storage for the node or as NFS background server) and 4.2 Kernel it may be another problem, then it would be helpful to open a new thread. :)
 
Is there any solution for this? I have a same symptom: web gui can't be connected (no process listen to port 8006), pveproxy become blocked state and cannot be killed.

Please advice.

Thanks.
 
Usually a 'blocked' problem in PVE comes from a process waiting for I/O ( disk or network server not responding)
Whenever your have a 'blocked, login via ssh to your server and see it the commands

pvesm status

executes or hangs.
 
Stuck with the exact same problem on 3 servers, however whenever it happens `pvesm status` seems to execute fine. This still is quite a problem. I also don't see any weird I/O wait times or whatever in the logs/nagios.

Any ideas how to proceed?
 
NB: a zombie process is something different than a process blocked for I/O.
please provide the following output:

if you have a pveproxy process blocked,
please provide the following output
ps faxl | grep pveproxy
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!