task blocked for more than 120 seconds after upgrade to kernel 2.6.32.39

Fleg

New Member
Mar 9, 2012
26
0
1
Hello,
One of my proxmox servers became unresponsive after an update of the kernel from 2.6.32.34 to 2.6.32.39.
Basically, it works correctly after reboot for about 24h and then start to be excessively slow (~2-3mn for a simple ls) and
in the logs, I have things like :

Code:
Sep  7 06:30:04 lpnhevictor kernel: INFO: task gpg:520457 blocked for more than 120 seconds.
Sep  7 06:30:04 lpnhevictor kernel:      Not tainted 2.6.32-40-pve #1
Sep  7 06:30:04 lpnhevictor kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 06:30:04 lpnhevictor kernel: gpg           D ffff880825a1c7c0     0 520457 520150    0 0x00000000
Sep  7 06:30:04 lpnhevictor kernel: ffff8808258e5d58 0000000000000086 0000000000000000 ffff8808609dde00
Sep  7 06:30:04 lpnhevictor kernel: 0000000000000003 ffff880049c5de00 ffff88083c7fc348 ffff88083c75e810
Sep  7 06:30:04 lpnhevictor kernel: 0001ca380b804814 ffffffff8105825f 000000011e0a7c58 0000000000001a26
Sep  7 06:30:04 lpnhevictor kernel: Call Trace:
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff8105825f>] ? __dequeue_entity+0x2f/0x50
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff815625f4>] schedule_timeout+0x204/0x300
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81058153>] ? __wake_up+0x53/0x70
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81561e37>] wait_for_completion+0xd7/0x110
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff8106da50>] ? default_wake_function+0x0/0x20
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81152f20>] ? lru_add_drain_per_cpu+0x0/0x10
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff810a00d6>] flush_work+0x76/0xc0
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff8109e3f0>] ? wq_barrier_func+0x0/0x20
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff810a0303>] schedule_on_each_cpu+0x103/0x160
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81152495>] lru_add_drain_all+0x15/0x20
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81171391>] __mlock+0x41/0x110
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff81171473>] sys_mlock+0x13/0x20
Sep  7 06:30:04 lpnhevictor kernel: [<ffffffff8100b162>] system_call_fastpath+0x16/0x1b
Sep  7 06:31:40 lpnhevictor pmxcfs[3271]: [status] notice: received log
Sep  7 06:32:04 lpnhevictor kernel: INFO: task gpg:520457 blocked for more than 120 seconds.
Sep  7 06:32:04 lpnhevictor kernel:      Not tainted 2.6.32-40-pve #1
...
Sep  7 17:00:04 lpnhevictor kernel: INFO: task bash:691295 blocked for more than 120 seconds.
Sep  7 17:00:04 lpnhevictor kernel:      Not tainted 2.6.32-40-pve #1
Sep  7 17:00:04 lpnhevictor kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 17:00:04 lpnhevictor kernel: bash          D ffff880826198880     0 691295 691294    0 0x00000004
Sep  7 17:00:04 lpnhevictor kernel: ffff880838465bf8 0000000000000086 0000000000000000 ffff88083fc22800
Sep  7 17:00:04 lpnhevictor kernel: 0000000000000000 0000000000000010 ffff880838465c28 ffffffff8105ca23
Sep  7 17:00:04 lpnhevictor kernel: 0001ec935695e36d ffff88083c75ec00 00000001204b8046 0000000000000ea1
Sep  7 17:00:04 lpnhevictor kernel: Call Trace:
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8105ca23>] ? perf_event_task_sched_out+0x33/0x70
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff815625f4>] schedule_timeout+0x204/0x300
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff81058153>] ? __wake_up+0x53/0x70
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff81561e37>] wait_for_completion+0xd7/0x110
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8106da50>] ? default_wake_function+0x0/0x20
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff810a00d6>] flush_work+0x76/0xc0
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8109e3f0>] ? wq_barrier_func+0x0/0x20
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff810a0172>] flush_delayed_work+0x52/0x70
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff81344775>] tty_flush_to_ldisc+0x15/0x20
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff81340e4c>] n_tty_read+0x1dc/0x960
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff810a6a3c>] ? remove_wait_queue+0x3c/0x50
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8106da50>] ? default_wake_function+0x0/0x20
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8133b692>] tty_read+0x92/0xf0
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff811ae18e>] vfs_read+0x9e/0x190
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff811ae2ca>] sys_read+0x4a/0x90
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff810957c4>] ? sys_rt_sigprocmask+0xa4/0x120
Sep  7 17:00:04 lpnhevictor kernel: [<ffffffff8100b162>] system_call_fastpath+0x16/0x1b
Sep  7 17:01:57 lpnhevictor pmxcfs[3271]: [status] notice: received log
Sep  7 17:02:04 lpnhevictor kernel: INFO: task bash:691295 blocked for more than 120 seconds.

I tryed to change the vm.dirty_ratio to 5, desactivated all bios energy saving options but nothing changed.
In the meantime the 2.6.32.40 kernel was released, thus I tryed to upgrade, but it seems that this didn't fix the problem.
I restarted on 2.6.32.34 kernel and it works.

The server is a HP ProLiant BL460c Gen8 (with Intel Xeon and HP Smart Array P220i Controller) running pvemanager 3.4-8

I have another HP server in the same bay (ProLiant BL465c G7 but with AMD Opteron processors and Smart Array P711m Controller) running the same versions of proxmox without any problem.

Any suggestion ?
Thanks.
 
An update to kernel 2.6.32.40 and pve-manager 3.4-9 did not solve the problem !
Does anyone had the same issue ?
Does anyone use the same type of server without any problem ?
 
I tryed to remove the node from the cluster and I reinstalled it from scratch (from wheezy) in standalone (not integrated into the cluster)... and the problem is still the same !
Nobody had this problem ?
I am desperate !
 
I tryed to remove the node from the cluster and I reinstalled it from scratch (from wheezy) in standalone (not integrated into the cluster)... and the problem is still the same !
Nobody had this problem ?
I am desperate !

what does the output of lsmod give you?
 
# lsmodModule Size Used by
ipmi_devintf 7749 0
ip_set 30985 0
vzethdev 8245 0
vznetdev 19296 0
pio_nfs 20010 0
pio_direct 30154 0
pfmt_raw 3205 0
pfmt_ploop1 6671 0
ploop 119941 4 pfmt_ploop1,pfmt_raw,pio_direct,pio_nfs
simfs 4964 0
vzrst 202733 0
nf_nat 23685 1 vzrst
nf_conntrack_ipv4 9970 2 nf_nat
nf_defrag_ipv4 1523 1 nf_conntrack_ipv4
vzcpt 154793 1 vzrst
nf_conntrack 82058 4 vzcpt,nf_conntrack_ipv4,nf_nat,vzrst
vzdquota 56321 0
vzmon 25647 3 vzcpt,vzrst,vznetdev
vzdev 2757 4 vzmon,vzdquota,vznetdev,vzethdev
ip6t_REJECT 4671 0
ip6table_mangle 3661 0
ip6table_filter 3025 0
ip6_tables 18984 2 ip6table_filter,ip6table_mangle
xt_length 1330 0
xt_hl 1539 0
xt_tcpmss 1583 0
xt_TCPMSS 3549 0
iptable_mangle 3485 0
iptable_filter 2929 1
xt_multiport 2676 1
xt_limit 2185 0
xt_dscp 2097 0
ipt_REJECT 2423 0
ip_tables 18124 2 iptable_filter,iptable_mangle
vhost_net 31189 0
tun 18110 1 vhost_net
macvtap 10322 1 vhost_net
macvlan 9758 1 macvtap
nfnetlink_log 8661 1
kvm_intel 54717 0
nfnetlink 4427 3 nfnetlink_log,ip_set
kvm 341966 1 kvm_intel
vzevent 2170 1
ib_iser 42209 0
rdma_cm 37116 1 ib_iser
iw_cm 8665 1 rdma_cm
ib_cm 37320 1 rdma_cm
ib_sa 24369 2 ib_cm,rdma_cm
ib_mad 39879 2 ib_sa,ib_cm
ib_core 81369 6 ib_mad,ib_sa,ib_cm,iw_cm,rdma_cm,ib_iser
ib_addr 8662 2 ib_core,rdma_cm
iscsi_tcp 9927 0
libiscsi_tcp 17212 1 iscsi_tcp
nfsd 319713 2
nfs 448218 3 vzcpt,vzrst,pio_nfs
nfs_acl 2655 2 nfs,nfsd
auth_rpcgss 46382 2 nfs,nfsd
fscache 55833 1 nfs
lockd 77273 3 nfs,nfsd,vzrst
sunrpc 272643 7 lockd,auth_rpcgss,nfs_acl,nfs,nfsd,pio_nfs
ipv6 341203 167 ib_addr,ip6table_mangle,ip6t_REJECT,vzcpt,vzrst
fuse 100558 3
power_meter 9054 0
acpi_ipmi 3746 1 power_meter
snd_pcsp 8742 0
iTCO_wdt 7003 0
iTCO_vendor_support 3064 1 iTCO_wdt
snd_pcm 88275 1 snd_pcsp
snd_page_alloc 8864 1 snd_pcm
snd_timer 22357 1 snd_pcm
hpilo 7535 0
snd 71725 3 snd_timer,snd_pcm,snd_pcsp
soundcore 8009 1 snd
serio_raw 4656 0
lpc_ich 13095 0
hpwdt 7009 0
ipmi_si 45406 2 acpi_ipmi
video 21007 0
ioatdma 54372 80
dca 7037 1 ioatdma
mfd_core 1871 1 lpc_ich
output 2417 1 video
ipmi_msghandler 37752 3 ipmi_si,acpi_ipmi,ipmi_devintf
shpchp 29162 0
ext4 426383 1
jbd2 93644 1 ext4
mbcache 8169 1 ext4
sg 29475 0
be2iscsi 103570 0
iscsi_boot_sysfs 9466 1 be2iscsi
libiscsi 49243 4 be2iscsi,libiscsi_tcp,iscsi_tcp,ib_iser
hpsa 80789 2
scsi_transport_iscsi 102000 5 libiscsi,be2iscsi,iscsi_tcp,ib_iser
be2net 103244 0
 
my hp units had broadcom nics that gave me all kinds of grief for about 8 months before I finally caved in and replaced them. was looking for that driver.

with that said, all I can think of is firmware/bios updates.

I might be mistaken, only have a few years of linux under my belt, but unless you have compiled and loaded drivers you are using drivers that are part of the proxmox kernel
 
Last edited:
Ronis, I didn't try 3.10 kernel because all of the 8 servers of my cluster (in production) are in 2.6.32 so I want to keep it homogenous.
 
I finally got the firmware updates from HP (such a pain in the ass to get it !!).
I upgraded and it seems that it solved the problem (almost 48 hours working and no freeze so far with 2.6.32-41-pve kernel).
 
I finally got the firmware updates from HP (such a pain in the ass to get it !!).
I upgraded and it seems that it solved the problem (almost 48 hours working and no freeze so far with 2.6.32-41-pve kernel).

Got the same problem:
Oct 7 12:27:39 vps2 kernel: [76440.224347] systemd D ffff88012fc96a00 0 1 0 0x00000000
Oct 7 12:27:39 vps2 kernel: [76440.224357] ffff88012fc16a00 ffff88012abbc000 0000000000000001 7fffffffffffffff
Oct 7 12:27:39 vps2 kernel: [76440.224364] Call Trace:
Oct 7 12:27:39 vps2 kernel: [76440.224378] [<ffffffff817cec21>] schedule_timeout+0x201/0x2a0
Oct 7 12:27:39 vps2 kernel: [76440.224387] [<ffffffff810a034c>] ? try_to_wake_up+0x20c/0x340
Oct 7 12:27:39 vps2 kernel: [76440.224394] [<ffffffff817cf414>] ldsem_down_write+0xd4/0x1af
Oct 7 12:27:39 vps2 kernel: [76440.224402] [<ffffffff814a280e>] tty_ldisc_hangup+0xbe/0x200
Oct 7 12:27:39 vps2 kernel: [76440.224405] [<ffffffff81499e65>] __tty_hangup+0x285/0x410
Oct 7 12:27:39 vps2 kernel: [76440.224408] [<ffffffff8149b071>] tty_ioctl+0x901/0xc10
Oct 7 12:27:39 vps2 kernel: [76440.224412] [<ffffffff811fd666>] ? getname_flags+0x56/0x1f0
Oct 7 12:27:39 vps2 kernel: [76440.224415] [<ffffffff8120171a>] do_vfs_ioctl+0x2ba/0x490
Oct 7 12:27:39 vps2 kernel: [76440.224417] [<ffffffff811fd60b>] ? putname+0x5b/0x60
Oct 7 12:27:39 vps2 kernel: [76440.224422] [<ffffffff811ed609>] ? do_sys_open+0x1b9/0x250
Oct 7 12:27:39 vps2 kernel: [76440.224424] [<ffffffff81201969>] SyS_ioctl+0x79/0x90
Oct 7 12:27:39 vps2 kernel: [76440.224427] [<ffffffff817cfaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
Oct 7 12:27:39 vps2 kernel: [76440.224547] INFO: task getty:2570 blocked for more than 120 seconds.
Oct 7 12:27:39 vps2 kernel: [76440.224649] Tainted: P O 4.2.1-1-pve #1
Oct 7 12:27:39 vps2 kernel: [76440.224735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 7 12:27:39 vps2 kernel: [76440.224842] getty D ffff88012fc16a00 0 2570 9015 0x20020106
Oct 7 12:27:39 vps2 kernel: [76440.224846] ffff880113d57ae8 0000000000000046 ffffffff81c14580 ffff88005f73d780
Oct 7 12:27:39 vps2 kernel: [76440.224849] ffff880100050008 ffff880113d58000 ffff8800c4ad6c8c ffff88005f73d780
Oct 7 12:27:39 vps2 kernel: [76440.224852] 00000000ffffffff ffff8800c4ad6c90 ffff880113d57b08 ffffffff817cbde7
Oct 7 12:27:39 vps2 kernel: [76440.224855] Call Trace:
Oct 7 12:27:39 vps2 kernel: [76440.224859] [<ffffffff817cbde7>] schedule+0x37/0x80
Oct 7 12:27:39 vps2 kernel: [76440.224865] [<ffffffff817cdb03>] __mutex_lock_slowpath+0x93/0x110
Oct 7 12:27:39 vps2 kernel: [76440.224870] [<ffffffff817cf95c>] tty_lock+0x3c/0x90
Oct 7 12:27:39 vps2 kernel: [76440.224876] [<ffffffff811f00d4>] __fput+0xe4/0x220
Oct 7 12:27:39 vps2 kernel: [76440.224882] [<ffffffff81093bfb>] task_work_run+0x9b/0xb0
Oct 7 12:27:39 vps2 kernel: [76440.224889] [<ffffffff8107a597>] do_group_exit+0x47/0xc0
Oct 7 12:27:39 vps2 kernel: [76440.224898] [<ffffffff81013448>] do_signal+0x28/0xb40
Oct 7 12:27:39 vps2 kernel: [76440.224905] [<ffffffff814a1f06>] ? tty_ldisc_deref+0x16/0x20
Oct 7 12:27:39 vps2 kernel: [76440.224911] [<ffffffff811eddb8>] ? __vfs_read+0x18/0x40
Oct 7 12:27:39 vps2 kernel: [76440.224916] [<ffffffff81013fc2>] do_notify_resume+0x62/0x70
Oct 7 12:29:39 vps2 kernel: [76560.224037] INFO: task systemd:1 blocked for more than 120 seconds.

What to check?
 
The only "change" I noticed when the problem occurs was that the load of the server was changing from nearly 0 to 1... but no process seems to be eating the processor nor the memory (top command returned nothing abnormal). Moreover, there was nothing in the logs indicating that a process started just before.
As it seems solved by the firmware upgrade, my guess is that some instructions used in the new kernels were not correctly interpreted and produced a kind of "hardware hang". My knowledge of the kernel changes between versions is too low to have a clear idea of what was causing the problem. Thus I can only suggests you to do an upgrade of all the firmwares of your server and see if it fix the problem.
 
Got the same problem:
...

Oct 7 12:27:39 vps2 kernel: [76440.224649] Tainted: P O 4.2.1-1-pve #1
...

this thread is about kernel 2.6.32-39 (see topic) - but you run a 4.0beta kernel.

update to stable 4.0, if you have further issues, please open a new thread.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!