VM blocked due to hung_task_timeout_secs

Tom91 · Jun 17, 2015

Hi,

I need your advice because we have a problem where we don't have any solution.

For an unkown reason some VM become unavailable : the load on the VM grows up to a point where it's impossible to do anything. The only way to unblock the situation is to reboot it.
As all our VM a monitored thanks to centreon, I have some graphs

I already find some threads speaking about a similare problem but I don't find a solution.
https://forum.proxmox.com/threads/12982-qemu-nbd-bug?highlight=kworker
https://forum.proxmox.com/threads/21354-Why-are-my-VMs-dying-with-quot-hung_task_timeout_secs-quot

There is 4 servers, configurated on a cluster. Each servers have the same configuration.

Code:

$ pveversion  -v
proxmox-ve-2.6.32: 3.4-156 (running kernel: 3.10.0-5-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-156
pve-kernel-3.10.0-5-pve: 3.10.0-19
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-17
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

There is no load on the host when the problem arrive on the VM.

Code:

top - 10:58:04 up 25 days, 22:35,  1 user,  load average: 0.19, 0.21, 0.23
Tasks: 616 total,   1 running, 615 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem:    257927 total,    31034 used,   226893 free,        0 buffers
MiB Swap:      243 total,        0 used,      243 free,     3618 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                                                                                                                          
 4846 root      20   0 32.5g 2.0g 4024 S     3  0.8 866:49.94 kvm                                                                                                                                                                              
 4557 root      20   0 8737m 2.8g 4008 S     3  1.1 827:31.35 kvm                                                                                                                                                                              
 4963 root      20   0 32.6g 1.6g 4024 S     2  0.6 951:55.92 kvm                                                                                                                                                                              
53389 root      20   0 4645m 1.3g 4024 S     2  0.5 860:33.95 kvm                                                                                                                                                                              
 4897 root      20   0 32.5g 1.6g 4148 S     2  0.6 917:29.60 kvm                                                                                                                                                                              
31560 root      20   0  287m  61m 6232 S     2  0.0   0:02.64 pvedaemon worke                                                                                                                                                                  
 4704 root      20   0 4616m 1.2g 3988 S     2  0.5 991:31.37 kvm                                                                                                                                                                              
 4313 root      20   0 4616m 1.5g 3984 S     1  0.6 534:31.11 kvm                                                                                                                                                                              
 4462 root      20   0 4616m 1.3g 4108 S     1  0.5 949:10.53 kvm                                                                                                                                                                              
 4607 root      20   0 4617m 1.4g 3984 S     1  0.5 500:35.04 kvm                                                                                                                                                                              
 4732 root      20   0 4637m 1.2g 4000 S     1  0.5 724:27.02 kvm                                                                                                                                                                              
29185 www-data  20   0  285m  61m 5160 S     1  0.0   0:03.53 pveproxy worker                                                                                                                                                                  
 3103 root      20   0  353m  54m  33m S     1  0.0  24:57.06 pmxcfs                                                                                                                                                                           
 3905 root      20   0 4616m 1.8g 3988 S     1  0.7 737:33.64 kvm                                                                                                                                                                              
    3 root      20   0     0    0    0 S     0  0.0  17:21.56 ksoftirqd/0                                                                                                                                                                      
 2183 root      20   0     0    0    0 S     0  0.0  12:47.38 xfsaild/dm-7                                                                                                                                                                     
 2982 root      20   0  371m 2328  888 S     0  0.0  18:25.42 rrdcached                                                                                                                                                                        
 3308 root       0 -20  201m  66m  42m S     0  0.0  64:04.29 corosync                                                                                                                                                                         
 4594 root      20   0     0    0    0 S     0  0.0   6:12.00 vhost-4557                                                                                                                                                                       
 5264 root      20   0 4630m 1.4g 3996 S     0  0.6 959:41.76 kvm                                                                                                                                                                              
 9205 root      20   0 4546m 4.1g 4148 S     0  1.6  28:06.42 kvm                                                                                                                                                                              
16144 root      20   0     0    0    0 S     0  0.0   0:44.51 kworker/6:1                                                                                                                                                                      
20152 root      20   0 4617m 1.1g 3984 S     0  0.4 298:02.00 kvm

Code:

$ sudo pveperf 
CPU BOGOMIPS:      175688.20
REGEX/SECOND:      989918
HD SIZE:           0.95 GB (/dev/mapper/vg01-root)
BUFFERED READS:    325.33 MB/sec
AVERAGE SEEK TIME: 0.04 ms
FSYNCS/SECOND:     13047.64
DNS EXT:           29.53 ms

Code:

$ sudo pveperf /srv/vms/
CPU BOGOMIPS:      175688.20
REGEX/SECOND:      1046214
HD SIZE:           199.90 GB (/dev/mapper/vg01-vms)
BUFFERED READS:    445.96 MB/sec
AVERAGE SEEK TIME: 0.19 ms
FSYNCS/SECOND:     103.74
DNS EXT:           31.61 ms

Code:

$ sudo qm list | grep -v stopped
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       102 server-102     running    4096              10.00 3905      
       103 server-103     running    4096              10.00 4313      
       104 server-104     running    4096              10.00 20152     
       105 server-105     running    4096              10.00 4462      
       106 server-106     running    4096              10.00 9205      
       107 server-107     running    8192              10.00 4557      
       108 server-108     running    4096              10.00 4607      
       109 server-109     running    4096              10.00 4704      
       110 server-110     running    4096              10.00 4732      
       111 server-111     running    4096              10.00 53389     
       112 server-112     running    32768             10.00 4846      
       113 server-113     running    32768             10.00 4897      
       114 server-114     running    32768             10.00 4963      
       137 server-137     running    4096              10.00 5264

Code:

$ cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:d3:c1:fc:c3:50
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:d3:c1:fc:c3:51
Slave queue ID: 0

Code:

$ cat /etc/pve/qemu-server/106.conf 
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 4096
name: server-106
net0: virtio=72:33:AA:5F:BB:11,bridge=vmbr0,tag=226
onboot: 1
ostype: l26
smbios1: uuid=071915f2-544c-49ba-a3c3-f3c52f5188d4
sockets: 1
virtio0: vms:106/vm-106-disk-1.qcow2,format=qcow2,size=10G

Some informations about a VM when the problem has just arrived.

Code:

$ uname -a
Linux ######### 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux

I find somes kernel error on syslog of the VM, but not

Code:

Jun 11 21:51:49 ######### kernel: [1762080.352106] INFO: task kworker/0:3:6362 blocked for more than 120 seconds.
Jun 11 21:51:49 ######### kernel: [1762080.352807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 11 21:51:49 ######### kernel: [1762080.353202] kworker/0:3     D ffff88013fc13780     0  6362      2 0x00000000
Jun 11 21:51:49 ######### kernel: [1762080.353205]  ffff880137a147c0 0000000000000046 0000000100000008 ffff880136fd80c0
Jun 11 21:51:49 ######### kernel: [1762080.353208]  0000000000013780 ffff880120f31fd8 ffff880120f31fd8 ffff880137a147c0
Jun 11 21:51:49 ######### kernel: [1762080.353210]  ffff88013a340200 ffffffff8107116d 0000000000000202 ffff880139e0edc0
Jun 11 21:51:49 ######### kernel: [1762080.353213] Call Trace:
Jun 11 21:51:49 ######### kernel: [1762080.353219]  [<ffffffff8107116d>] ? arch_local_irq_save+0x11/0x17
Jun 11 21:51:49 ######### kernel: [1762080.353255]  [<ffffffffa0150a8f>] ? xlog_wait+0x51/0x67 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353258]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
Jun 11 21:51:49 ######### kernel: [1762080.353268]  [<ffffffffa015322c>] ? _xfs_log_force_lsn+0x1cd/0x205 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353277]  [<ffffffffa0150502>] ? xfs_trans_commit+0x10a/0x205 [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353285]  [<ffffffffa011d7d4>] ? xfs_sync_worker+0x3a/0x6a [xfs]
Jun 11 21:51:49 ######### kernel: [1762080.353288]  [<ffffffff8105b5f7>] ? process_one_work+0x161/0x269
Jun 11 21:51:49 ######### kernel: [1762080.353290]  [<ffffffff8105aba3>] ? cwq_activate_delayed_work+0x3c/0x48
Jun 11 21:51:49 ######### kernel: [1762080.353292]  [<ffffffff8105c5c0>] ? worker_thread+0xc2/0x145
Jun 11 21:51:49 ######### kernel: [1762080.353294]  [<ffffffff8105c4fe>] ? manage_workers.isra.25+0x15b/0x15b
Jun 11 21:51:49 ######### kernel: [1762080.353296]  [<ffffffff8105f701>] ? kthread+0x76/0x7e
Jun 11 21:51:49 ######### kernel: [1762080.353299]  [<ffffffff813575b4>] ? kernel_thread_helper+0x4/0x10
Jun 11 21:51:49 ######### kernel: [1762080.353302]  [<ffffffff8105f68b>] ? kthread_worker_fn+0x139/0x139
Jun 11 21:51:49 ######### kernel: [1762080.353304]  [<ffffffff813575b0>] ? gs_change+0x13/0x13

Somes processus are blocked

Code:

ps auxf | grep -E ' [DR]'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2787 0.0 0.0 0 0 ? D May22 0:07 \_ [flush-253:4]
root 6362 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:3]
root 7257 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:2]
root 8036 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:1]
root 8262 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:4]
root 8263 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:5]
root 32191 0.0 0.0 0 0 ? D Jun12 0:12 \_ [kworker/0:8]
root 26983 0.0 0.0 0 0 ? D Jun12 1:24 \_ [kworker/0:9]
(… and some others process ...)

There is no io-wait

Code:

vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 3078348 2708 462720 0 0 1 89 15 18 0 0 99 0
0 0 0 3078340 2708 462720 0 0 0 0 289 718 0 0 100 0
0 0 0 3078340 2708 462720 0 0 0 0 289 722 0 0 100 0
0 0 0 3078340 2708 462720 0 0 0 0 291 730 0 0 100 0

Thks for your advices.

Tom91 · Jun 17, 2015

Re: The load on VM grows up to a point where they become unavailable

My Message become unreadable after I post it. I don't know why ?!?

[EDIT]

I change this parameter "Message Editor Interface" and it seems to work.

udo · Jun 17, 2015

Tom91 said:

Hi,

I need your advice because we have a problem where we don't have any solution.

...

Code:

$ cat /etc/pve/qemu-server/106.conf 
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 4096
name: server-106
net0: virtio=72:33:AA:5F:BB:11,bridge=vmbr0,tag=226
onboot: 1
ostype: l26
smbios1: uuid=071915f2-544c-49ba-a3c3-f3c52f5188d4
sockets: 1
virtio0: vms:106/vm-106-disk-1.qcow2,format=qcow2,size=10G

...

Hi,
does the same happens, if you change the disk-format to raw?

What kind of storage is the underlying filesystem?

Udo

Tom91 · Jun 18, 2015

Hi,

Thanks for your reply.

The host server is a HP Proliant with SSD behind a P420i Controller.
The host operating system is a Debian 7.8.

The problem doesn't appear all the time ... on all servers. I don't understand which parameters affect the storage of the VM.

I will try to do a test with a disk-format to raw.

Thomas

uFx · Jun 19, 2015

We have the same problem with a few VM's on Proxmox 3.4. We are using local storage on an Areca-RAID-set. We are also using backup to NFS and the 'blocked for more than 120 seconds' problems occur mostly during a backup. However, we did have some problems with a few VM's when there was no backup running, no high load/io on the host. Although I'm not for sure I have the feeling that this problem occurs more often on Proxmox 3.4 than 3.3.

All our VM's are using qcow2 as disk image format.

draga · Jun 20, 2015

We had the same problem since Proxmox 3.3 (upgraded from 3.2). Local storage on RAID Controller, battery backed up, cache=writeback. With 3.2, everything was working flawlessly, since we upgraded to 3.3 the problem started to occur. We tried many combinations.
Changing the image format to raw helped, but didn't solve. Instead of happening every day, it started to happen randomly, but at least twice a week.

Changing cache type to none, solved the problem.

Tom91 · Jun 22, 2015

Today, I found 2 VMs with a high load. Typically, it's a VM blocked by day.

it's not the first time, these VMs have their files system blocked.
I already tried to migrate VMs on another proxmox. Upgrade the OS of the VM doesn't helped.

We have left the cache system of the VM to the defaut solution : Default (no cache). so our problem is not created by this parameter.

thank you all the same for the proposal.

Tom91 · Jun 25, 2015

The problem is always present and isn't resolve on our cluster.

We tried to change the parameter "vm.dirty_ratio" in file "sysctl.conf" without success.
cf http://blog.ronnyegner-consulting.d...ked-for-more-than-120-seconds/comment-page-1/

Is someone have others advice ?

mstrent · Jul 2, 2015

I am having this issue as well. Can't seem to narrow it down or resolve no matter what options I change. Random VMs die with hung_task_timeout_secs. It "feels" I/O related. Argh!

mstrent · Jul 6, 2015

Update: I had two more VMs crash over the weekend in this way. It's as if the virtio disk is just yanked out every once in a while.

Interesting note: Doing a "reset" doesn't resolve it. I get disk errors after grub. I have to do a full stop, then start.

Tom91 · Jul 6, 2015

Hi,

Just for information, since I installed the kernel 2.6, we don't have no more VM that crashed.

Tom91

mstrent · Jul 6, 2015

Tom91 said:
Hi,

Just for information, since I installed the kernel 2.6, we don't have no more VM that crashed.

Tom91

That's good info, thanks!

You mean the Proxmox kernel, not the kernel inside the VMs, right? So you were on the 3.x Proxmox kernel before?

mstrent · Jul 7, 2015

Switching back to 2.6 kernel did not resolve.

Cross referencing with another thread for more evidence on this bug:
http://forum.proxmox.com/threads/20372-Linux-guest-problems-on-new-Haswell-EP-processors

warren · Jul 7, 2015

Tom91 said:
..
We tried to change the parameter "vm.dirty_ratio" in file "sysctl.conf" without success.
cf http://blog.ronnyegner-consulting.d...ked-for-more-than-120-seconds/comment-page-1/

Is someone have others advice ?

I was having this issue every +- 4 days or so on 3 guest machines. I also applied the above and also remember apply changes to : dirty_background_ratio
referenced here : http://forum.proxmox.com/archive/index.php/t-15893.html

so my sysctl.conf now is as follows :
vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

I have not seen the error " ..blocked for more than 120 seconds." since 11 May15

Tom91 · Jul 7, 2015

1 week without having a crash of VM \o/

Yes, we changed the kernel of the hypervisor.

mstrent · Jul 7, 2015

Tom91 said:
1 week without having a crash of VM \o/

Yes, we changed the kernel of the hypervisor.

Hmm... After we switched back to the 2.6 kernel the problem has actually gotten much worse. VMs hanging every few hours.

mstrent · Jul 7, 2015

warren said:
I was having this issue every +- 4 days or so on 3 guest machines. I also applied the above and also remember apply changes to : dirty_background_ratio
referenced here : http://forum.proxmox.com/archive/index.php/t-15893.html

so my sysctl.conf now is as follows :
vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

I have not seen the error " ..blocked for more than 120 seconds." since 11 May15

Worth a try... This change was on the host, not inside the VM, right?

warren · Jul 7, 2015

mstrent said:
Worth a try... This change was on the host, not inside the VM, right?

Nope, made changes inside the running VM ( these vm's are running SME server )

e100 · Jul 8, 2015

Is this problem limited to Debian 7 guests or has it been observed with other guest OS?

I've only seen it with Debian guests and only fix I found is to use IDE instead of virtio.

mstrent · Jul 8, 2015

e100 said:
Is this problem limited to Debian 7 guests or has it been observed with other guest OS?

I've only seen it with Debian guests and only fix I found is to use IDE instead of virtio.

I'm seeing it on guests with a variety of Ubuntu versions and Debian as well.

Hardware-wise, I have two Dell 2970's (AMD) and one 2950 (Intel) and all three boxes are exhibiting this hanging bug. Storage is local RAID6 on Dell Perc6/i with 10k SAS drives. All firmware is up-to-date.

VM blocked due to hung_task_timeout_secs

New Member

New Member

Distinguished Member

New Member

Member

Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Renowned Member

New Member

New Member

New Member

Renowned Member

Renowned Member

New Member