INFO: task pveproxy: blocked for more than 120 seconds

yaboc · Jun 7, 2016

i have a 4 node cluster hooked up to a SAN

today i got notification email from node1 (master node)

/etc/cron.daily/logrotate:

Job for pveproxy.service failed. See 'systemctl status pveproxy.service' and 'journalctl -xn' for details.

Job for spiceproxy.service failed. See 'systemctl status spiceproxy.service' and 'journalctl -xn' for details.

error: error running shared postrotate script for '/var/log/pveproxy/access.log '

run-parts: /etc/cron.daily/logrotate exited with return code 1

so when i got in i was unable to access the GUI through the master node, it's accessible from the other 3 however it's unresponsive. VMs are still running and accessible (ssh) and but I can't migrate them online do do a host reboot.

pveversion
pve-manager/4.2-4/2660193c (running kernel: 4.4.8-1-pve)

dmesg message
INFO: task pveproxy:9610 blocked for more than 120 seconds.
[1931640.280328] Tainted: P O 4.4.8-1-pve #1
[1931640.280359] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

i can't kill -9 tasks and any qm commands never finish (no error). what would be the safest way of fixing this ? log in to each vm and do a gracefull shutdown from within the vm and then reboot the host ?
Please advise

Thanks

yaboc · Jun 8, 2016

i had to shutdown the VMs on that one host and reboot it :/

plofkat · Apr 24, 2017

Same issue - pveproxy stops responding on some of my nodes, unable to stop or start it.
Anyone have any idea as to the cause of this?

dietmar · Apr 24, 2017

Maybe an unresponsive NFS server? Test with

# pvesm status

plofkat · Apr 24, 2017

I can ssh and run the command on all servers, no issues.

On the servers with the error i am however unable to access the /etc/pve/qemu-server folder.
The terminal stops responding completely.

All VMS still responsive fortunately.

I have shut down all the VMS on one of the hosts, and tried to restart the host - as of now, it has been shutting down for about 15 minutes.

dietmar · Apr 24, 2017

plofkat said:
and tried to restart the host - as of now, it has been shutting down for about 15 minutes.

This is another indication for some unresponsive storage (nfs mount?)

plofkat · Apr 24, 2017

Unless proxmox has some built in nfs mount that I am unaware of, there should not be any?

Code:

vwk-prox01:~# pvesm status
MFS          dir 1     57430241024     41122953984     16307287040 72.11%
local        dir 1        98952796        87066000         6837248 93.22%
local-lvm  lvmthin 1       361213952               0       361213952 0.50%

Code:

vwk-prox05:~# pvesm status
MFS          dir 1     57430241024     41122952768     16307288256 72.11%
local        dir 1        34703056         9697800        23219400 29.96%
local-lvm  lvmthin 1        85954560               0        85954560 0.50%

MFS is mounted with fuse, and I can access the share from all servers without any issues.

From fstab

Code:

mfsmount /mnt/mfs fuse mfsmaster=vwk-master01,_netdev 0 0

plofkat · Apr 24, 2017

The server just rebooted, however I still get the following error:

Code:

[  240.224061] INFO: task spiceproxy:1829 blocked for more than 120 seconds.
[  240.224116]       Tainted: P           O    4.4.49-1-pve #1
[  240.224160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.224210] spiceproxy      D ffff880213cbbdf8     0  1829      1 0x00000004
[  240.224215]  ffff880213cbbdf8 ffff880215b1d3c0 ffff8802162d4c80 ffff880212610000
[  240.224219]  ffff880213cbc000 ffff8800d9ac43ac ffff880212610000 00000000ffffffff
[  240.224222]  ffff8800d9ac43b0 ffff880213cbbe10 ffffffff8185c215 ffff8800d9ac43a8
[  240.224225] Call Trace:
[  240.224235]  [<ffffffff8185c215>] schedule+0x35/0x80
[  240.224238]  [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[  240.224242]  [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[  240.224244]  [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[  240.224248]  [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[  240.224251]  [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[  240.224254]  [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75
[  360.224061] INFO: task spiceproxy:1829 blocked for more than 120 seconds.
[  360.224115]       Tainted: P           O    4.4.49-1-pve #1
[  360.224159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.224209] spiceproxy      D ffff880213cbbdf8     0  1829      1 0x00000004
[  360.224215]  ffff880213cbbdf8 ffff880215b1d3c0 ffff8802162d4c80 ffff880212610000
[  360.224218]  ffff880213cbc000 ffff8800d9ac43ac ffff880212610000 00000000ffffffff
[  360.224221]  ffff8800d9ac43b0 ffff880213cbbe10 ffffffff8185c215 ffff8800d9ac43a8

plofkat · Apr 24, 2017

Update on this - I can now log in to the web interface of the server I restarted, however, I am unable to start any of the VM's

plofkat · Apr 24, 2017

It seems some services are not starting

Code:

vwk-prox04:~# service pve-manager status
● pve-manager.service - PVE VM Manager
   Loaded: loaded (/lib/systemd/system/pve-manager.service; enabled)
   Active: activating (start) since Mon 2017-04-24 09:51:37 SAST; 1h 11min ago
 Main PID: 2275 (pvesh)
   CGroup: /system.slice/pve-manager.service
           └─2275 /usr/bin/perl /usr/bin/pvesh --nooutput create /nodes/localhost/startall

Apr 24 09:51:37 vwk-prox04 systemd[1]: Starting PVE VM Manager...
root@vwk-prox04:~# service pve-
pve-cluster   pve-firewall  pve-ha-crm    pve-ha-lrm    pve-manager
root@vwk-prox04:~# service pve-manager status
● pve-manager.service - PVE VM Manager
   Loaded: loaded (/lib/systemd/system/pve-manager.service; enabled)
   Active: activating (start) since Mon 2017-04-24 09:51:37 SAST; 1h 12min ago
 Main PID: 2275 (pvesh)
   CGroup: /system.slice/pve-manager.service
           └─2275 /usr/bin/perl /usr/bin/pvesh --nooutput create /nodes/localhost/startall

Apr 24 09:51:37 vwk-prox04 systemd[1]: Starting PVE VM Manager...

Code:

vwk-prox04:~# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: active (running) since Mon 2017-04-24 09:47:02 SAST; 1h 17min ago
  Process: 1684 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 1640 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 1682 (pmxcfs)
   CGroup: /system.slice/pve-cluster.service
           └─1682 /usr/bin/pmxcfs

Apr 24 09:47:01 vwk-prox04 pmxcfs[1682]: [status] crit: can't initialize service
Apr 24 09:47:02 vwk-prox04 systemd[1]: Started The Proxmox VE cluster filesystem.
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [status] notice: update cluster info (cluster name  vwk00, version = 7)
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [status] notice: node has quorum
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [dcdb] notice: members: 1/1972, 2/1682, 3/1672, 4/1267, 5/1379, 6/1919, 7/2132
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [dcdb] notice: starting data syncronisation
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [dcdb] notice: received sync request (epoch 1/1972/0000002C)
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [status] notice: members: 1/1972, 2/1682, 3/1672, 4/1267, 5/1379, 6/1919, 7/2132
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [status] notice: starting data syncronisation
Apr 24 09:47:07 vwk-prox04 pmxcfs[1682]: [status] notice: received sync request (epoch 1/1972/00000028)

Code:

vwk-prox04:~# service pveproxy status
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled)
   Active: active (running) since Mon 2017-04-24 09:47:07 SAST; 1h 18min ago
  Process: 1798 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
 Main PID: 1824 (pveproxy)
   CGroup: /system.slice/pveproxy.service
           ├─1824 pveproxy
           ├─5911 pveproxy worker
           ├─6153 pveproxy worker
           └─6645 pveproxy worker

Apr 24 10:55:15 vwk-prox04 pveproxy[1824]: starting 1 worker(s)
Apr 24 10:55:15 vwk-prox04 pveproxy[1824]: worker 6645 started
Apr 24 10:55:16 vwk-prox04 pveproxy[6153]: Clearing outdated entries from certificate cache
Apr 24 10:58:04 vwk-prox04 pveproxy[6645]: Clearing outdated entries from certificate cache
Apr 24 10:58:56 vwk-prox04 pveproxy[6645]: proxy detected vanished client connection
Apr 24 10:59:21 vwk-prox04 pveproxy[5911]: proxy detected vanished client connection
Apr 24 11:00:57 vwk-prox04 pveproxy[6153]: proxy detected vanished client connection
Apr 24 11:02:23 vwk-prox04 pveproxy[6153]: proxy detected vanished client connection
Apr 24 11:03:59 vwk-prox04 pveproxy[5911]: proxy detected vanished client connection
Apr 24 11:05:25 vwk-prox04 pveproxy[5911]: proxy detected vanished client connection

plofkat · Apr 24, 2017

Web interface entirely broken at the moment.

Random services crashing on both servers I restarted (no VM's running and unable to start any VM on them after the restart)

Code:

[  720.224373] Call Trace:

[  720.224376]  [<ffffffff8185c215>] schedule+0x35/0x80
[  720.224378]  [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[  720.224381]  [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[  720.224384]  [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[  720.224386]  [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[  720.224388]  [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[  720.224391]  [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75
[  840.224071] INFO: task spiceproxy:1823 blocked for more than 120 seconds.
[  840.224142]       Tainted: P           O    4.4.49-1-pve #1
[  840.224203] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  840.224285] spiceproxy      D ffff880214863df8     0  1823      1 0x00000004
[  840.224290]  ffff880214863df8 ffff880216c51cc0 ffffffff81e12540 ffff880035f3f2c0
[  840.224294]  ffff880214864000 ffff880214ee83ac ffff880035f3f2c0 00000000ffffffff
[  840.224297]  ffff880214ee83b0 ffff880214863e10 ffffffff8185c215 ffff880214ee83a8
[  840.224300] Call Trace:
[  840.224310]  [<ffffffff8185c215>] schedule+0x35/0x80
[  840.224313]  [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[  840.224316]  [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[  840.224319]  [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[  840.224323]  [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[  840.224325]  [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[  840.224329]  [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75
[  840.224333] INFO: task pvesh:2369 blocked for more than 120 seconds.
[  840.224395]       Tainted: P           O    4.4.49-1-pve #1
[  840.224456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  840.224538] pvesh           D ffff8800daeefdf8     0  2369      1 0x00000000
[  840.224541]  ffff8800daeefdf8 ffff880216c51cc0 ffff8802162d0000 ffff8802121072c0
[  840.224544]  ffff8800daef0000 ffff880214ee83ac ffff8802121072c0 00000000ffffffff
[  840.224548]  ffff880214ee83b0 ffff8800daeefe10 ffffffff8185c215 ffff880214ee83a8
[  840.224551] Call Trace:
[  840.224553]  [<ffffffff8185c215>] schedule+0x35/0x80
[  840.224556]  [<ffffffff8185c4ce>] schedule_preempt_disabled+0xe/0x10
[  840.224559]  [<ffffffff8185e1c9>] __mutex_lock_slowpath+0xb9/0x130
[  840.224561]  [<ffffffff8185e25f>] mutex_lock+0x1f/0x30
[  840.224564]  [<ffffffff8121f9ea>] filename_create+0x7a/0x160
[  840.224566]  [<ffffffff81220983>] SyS_mkdir+0x53/0x100
[  840.224569]  [<ffffffff81860336>] entry_SYSCALL_64_fastpath+0x16/0x75

Fusl · Aug 6, 2017

I keep having this exact same issue:

Code:

[550669.885916] INFO: task pveproxy:27683 blocked for more than 120 seconds.
[550669.885964]       Tainted: P           O    4.4.67-1-pve #1
[550669.886004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[550669.886074] pveproxy        D ffff880eb5393df8     0 27683      1 0x00000004
[550669.886079]  ffff880eb5393df8 ffff881030ca2540 ffff881038713800 ffff881037cc7000
[550669.886082]  ffff880eb5394000 ffff8810264d89ac ffff881037cc7000 00000000ffffffff
[550669.886085]  ffff8810264d89b0 ffff880eb5393e10 ffffffff818605d5 ffff8810264d89a8
[550669.886088] Call Trace:
[550669.886100]  [<ffffffff818605d5>] schedule+0x35/0x80
[550669.886104]  [<ffffffff8186088e>] schedule_preempt_disabled+0xe/0x10
[550669.886107]  [<ffffffff81862589>] __mutex_lock_slowpath+0xb9/0x130
[550669.886112]  [<ffffffff8186261f>] mutex_lock+0x1f/0x30
[550669.886117]  [<ffffffff8121fd0a>] filename_create+0x7a/0x160
[550669.886122]  [<ffffffff81220ca3>] SyS_mkdir+0x53/0x100
[550669.886128]  [<ffffffff818646f6>] entry_SYSCALL_64_fastpath+0x16/0x75

After some debugging I found out that the other nodes in the cluster have actually marked this node as offline. Running "pvesm status", "pvecm status" or "corosync-quorumtool" just freezes. Running "corosync-quorumtool" with strace actually shows that it tries to connect to the corosync/cmap node where it waits forever and eventually turns into yet another process stuck in D (uninterruptable) state:

Code:

connect(3, {sa_family=AF_LOCAL, sun_path=@"cmap"}, 110) = 0
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, MSG_NOSIGNAL, NULL, 0) = 24
poll([{fd=3, events=POLLIN}], 1, 4294967295

I then straced the actual corosync/cmap process which spammed this onto my terminal (tens of thousands of lines per second):

Code:

[pid  3249] write(6, "\17\0\0\0", 4)    = -1 EAGAIN (Resource temporarily unavailable)
[pid  3249] write(6, "\17\0\0\0", 4)    = -1 EAGAIN (Resource temporarily unavailable)
[pid  3249] write(6, "\17\0\0\0", 4)    = -1 EAGAIN (Resource temporarily unavailable)
[pid  3249] write(6, "\17\0\0\0", 4)    = -1 EAGAIN (Resource temporarily unavailable)
[pid  3249] write(6, "\17\0\0\0", 4)    = -1 EAGAIN (Resource temporarily unavailable)
[...]

Killing corosync (pkill -9 corosync) and then restarting it with "systemctl restart corosync.service" immediately made the other terminal windows running "pvecm status", "pvesm status" and "corosync-quorumtool" return a text again. I then was able to issue "systemctl restart pveproxy.service" to restart the pveproxy service and voila, everything was back to normal.

plofkat · Aug 25, 2017

This issue seems to be caused by corosync using the same physical network as the shared storage.
If latency on the network exceeds 2ms corosync throws a fit.

Splitting the two networks over different switches and interfaces seems to resolve this completely.

Search

Search

INFO: task pveproxy: blocked for more than 120 seconds

yaboc

Renowned Member

yaboc

Renowned Member

plofkat

Active Member

dietmar

Proxmox Staff Member

plofkat

Active Member

dietmar

Proxmox Staff Member

plofkat

Active Member

Attachments

plofkat

Active Member

plofkat

Active Member

plofkat

Active Member

plofkat

Active Member

Fusl

Renowned Member

plofkat

Active Member

We value your privacy