VM blocked due to hung_task_timeout_secs

Discussion in 'Proxmox VE: Installation and configuration' started by Tom91, Jun 17, 2015.

  1. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    Hi,

    I need your advice because we have a problem where we don't have any solution.

    For an unkown reason some VM become unavailable : the load on the VM grows up to a point where it's impossible to do anything. The only way to unblock the situation is to reboot it.
    As all our VM a monitored thanks to centreon, I have some graphs

    I already find some threads speaking about a similare problem but I don't find a solution.
    https://forum.proxmox.com/threads/12982-qemu-nbd-bug?highlight=kworker
    https://forum.proxmox.com/threads/21354-Why-are-my-VMs-dying-with-quot-hung_task_timeout_secs-quot

    There is 4 servers, configurated on a cluster. Each servers have the same configuration.

    Code:
    $ pveversion  -v
    proxmox-ve-2.6.32: 3.4-156 (running kernel: 3.10.0-5-pve)
    pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
    pve-kernel-2.6.32-39-pve: 2.6.32-156
    pve-kernel-3.10.0-5-pve: 3.10.0-19
    pve-kernel-2.6.32-34-pve: 2.6.32-140
    lvm2: 2.02.98-pve4
    clvm: 2.02.98-pve4
    corosync-pve: 1.4.7-1
    openais-pve: 1.1.4-3
    libqb0: 0.11.1-2
    redhat-cluster-pve: 3.2.0-2
    resource-agents-pve: 3.9.2-4
    fence-agents-pve: 4.0.10-2
    pve-cluster: 3.0-17
    qemu-server: 3.4-6
    pve-firmware: 1.1-4
    libpve-common-perl: 3.0-24
    libpve-access-control: 3.0-16
    libpve-storage-perl: 3.0-33
    pve-libspice-server1: 0.12.4-3
    vncterm: 1.1-8
    vzctl: 4.0-1pve6
    vzprocps: 2.0.11-2
    vzquota: 3.1-2
    pve-qemu-kvm: 2.2-10
    ksm-control-daemon: 1.1-1
    glusterfs-client: 3.5.2-1
    There is no load on the host when the problem arrive on the VM.

    Code:
    top - 10:58:04 up 25 days, 22:35,  1 user,  load average: 0.19, 0.21, 0.23
    Tasks: 616 total,   1 running, 615 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    MiB Mem:    257927 total,    31034 used,   226893 free,        0 buffers
    MiB Swap:      243 total,        0 used,      243 free,     3618 cached
    
      PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                                                                                                                          
     4846 root      20   0 32.5g 2.0g 4024 S     3  0.8 866:49.94 kvm                                                                                                                                                                              
     4557 root      20   0 8737m 2.8g 4008 S     3  1.1 827:31.35 kvm                                                                                                                                                                              
     4963 root      20   0 32.6g 1.6g 4024 S     2  0.6 951:55.92 kvm                                                                                                                                                                              
    53389 root      20   0 4645m 1.3g 4024 S     2  0.5 860:33.95 kvm                                                                                                                                                                              
     4897 root      20   0 32.5g 1.6g 4148 S     2  0.6 917:29.60 kvm                                                                                                                                                                              
    31560 root      20   0  287m  61m 6232 S     2  0.0   0:02.64 pvedaemon worke                                                                                                                                                                  
     4704 root      20   0 4616m 1.2g 3988 S     2  0.5 991:31.37 kvm                                                                                                                                                                              
     4313 root      20   0 4616m 1.5g 3984 S     1  0.6 534:31.11 kvm                                                                                                                                                                              
     4462 root      20   0 4616m 1.3g 4108 S     1  0.5 949:10.53 kvm                                                                                                                                                                              
     4607 root      20   0 4617m 1.4g 3984 S     1  0.5 500:35.04 kvm                                                                                                                                                                              
     4732 root      20   0 4637m 1.2g 4000 S     1  0.5 724:27.02 kvm                                                                                                                                                                              
    29185 www-data  20   0  285m  61m 5160 S     1  0.0   0:03.53 pveproxy worker                                                                                                                                                                  
     3103 root      20   0  353m  54m  33m S     1  0.0  24:57.06 pmxcfs                                                                                                                                                                           
     3905 root      20   0 4616m 1.8g 3988 S     1  0.7 737:33.64 kvm                                                                                                                                                                              
        3 root      20   0     0    0    0 S     0  0.0  17:21.56 ksoftirqd/0                                                                                                                                                                      
     2183 root      20   0     0    0    0 S     0  0.0  12:47.38 xfsaild/dm-7                                                                                                                                                                     
     2982 root      20   0  371m 2328  888 S     0  0.0  18:25.42 rrdcached                                                                                                                                                                        
     3308 root       0 -20  201m  66m  42m S     0  0.0  64:04.29 corosync                                                                                                                                                                         
     4594 root      20   0     0    0    0 S     0  0.0   6:12.00 vhost-4557                                                                                                                                                                       
     5264 root      20   0 4630m 1.4g 3996 S     0  0.6 959:41.76 kvm                                                                                                                                                                              
     9205 root      20   0 4546m 4.1g 4148 S     0  1.6  28:06.42 kvm                                                                                                                                                                              
    16144 root      20   0     0    0    0 S     0  0.0   0:44.51 kworker/6:1                                                                                                                                                                      
    20152 root      20   0 4617m 1.1g 3984 S     0  0.4 298:02.00 kvm
    Code:
    $ sudo pveperf 
    CPU BOGOMIPS:      175688.20
    REGEX/SECOND:      989918
    HD SIZE:           0.95 GB (/dev/mapper/vg01-root)
    BUFFERED READS:    325.33 MB/sec
    AVERAGE SEEK TIME: 0.04 ms
    FSYNCS/SECOND:     13047.64
    DNS EXT:           29.53 ms
    Code:
    $ sudo pveperf /srv/vms/
    CPU BOGOMIPS:      175688.20
    REGEX/SECOND:      1046214
    HD SIZE:           199.90 GB (/dev/mapper/vg01-vms)
    BUFFERED READS:    445.96 MB/sec
    AVERAGE SEEK TIME: 0.19 ms
    FSYNCS/SECOND:     103.74
    DNS EXT:           31.61 ms
    Code:
    $ sudo qm list | grep -v stopped
          VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
           102 server-102     running    4096              10.00 3905      
           103 server-103     running    4096              10.00 4313      
           104 server-104     running    4096              10.00 20152     
           105 server-105     running    4096              10.00 4462      
           106 server-106     running    4096              10.00 9205      
           107 server-107     running    8192              10.00 4557      
           108 server-108     running    4096              10.00 4607      
           109 server-109     running    4096              10.00 4704      
           110 server-110     running    4096              10.00 4732      
           111 server-111     running    4096              10.00 53389     
           112 server-112     running    32768             10.00 4846      
           113 server-113     running    32768             10.00 4897      
           114 server-114     running    32768             10.00 4963      
           137 server-137     running    4096              10.00 5264
    Code:
    $ cat /proc/net/bonding/bond0 
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: eth0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: eth0
    MII Status: up
    Speed: 1000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: a0:d3:c1:fc:c3:50
    Slave queue ID: 0
    
    Slave Interface: eth1
    MII Status: up
    Speed: 1000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: a0:d3:c1:fc:c3:51
    Slave queue ID: 0
    Code:
    $ cat /etc/pve/qemu-server/106.conf 
    bootdisk: virtio0
    cores: 1
    ide2: none,media=cdrom
    memory: 4096
    name: server-106
    net0: virtio=72:33:AA:5F:BB:11,bridge=vmbr0,tag=226
    onboot: 1
    ostype: l26
    smbios1: uuid=071915f2-544c-49ba-a3c3-f3c52f5188d4
    sockets: 1
    virtio0: vms:106/vm-106-disk-1.qcow2,format=qcow2,size=10G
    Some informations about a VM when the problem has just arrived.

    Code:
    $ uname -a
    Linux ######### 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux
    I find somes kernel error on syslog of the VM, but not

    Code:
    Jun 11 21:51:49 ######### kernel: [1762080.352106] INFO: task kworker/0:3:6362 blocked for more than 120 seconds.
    Jun 11 21:51:49 ######### kernel: [1762080.352807] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Jun 11 21:51:49 ######### kernel: [1762080.353202] kworker/0:3     D ffff88013fc13780     0  6362      2 0x00000000
    Jun 11 21:51:49 ######### kernel: [1762080.353205]  ffff880137a147c0 0000000000000046 0000000100000008 ffff880136fd80c0
    Jun 11 21:51:49 ######### kernel: [1762080.353208]  0000000000013780 ffff880120f31fd8 ffff880120f31fd8 ffff880137a147c0
    Jun 11 21:51:49 ######### kernel: [1762080.353210]  ffff88013a340200 ffffffff8107116d 0000000000000202 ffff880139e0edc0
    Jun 11 21:51:49 ######### kernel: [1762080.353213] Call Trace:
    Jun 11 21:51:49 ######### kernel: [1762080.353219]  [<ffffffff8107116d>] ? arch_local_irq_save+0x11/0x17
    Jun 11 21:51:49 ######### kernel: [1762080.353255]  [<ffffffffa0150a8f>] ? xlog_wait+0x51/0x67 [xfs]
    Jun 11 21:51:49 ######### kernel: [1762080.353258]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
    Jun 11 21:51:49 ######### kernel: [1762080.353268]  [<ffffffffa015322c>] ? _xfs_log_force_lsn+0x1cd/0x205 [xfs]
    Jun 11 21:51:49 ######### kernel: [1762080.353277]  [<ffffffffa0150502>] ? xfs_trans_commit+0x10a/0x205 [xfs]
    Jun 11 21:51:49 ######### kernel: [1762080.353285]  [<ffffffffa011d7d4>] ? xfs_sync_worker+0x3a/0x6a [xfs]
    Jun 11 21:51:49 ######### kernel: [1762080.353288]  [<ffffffff8105b5f7>] ? process_one_work+0x161/0x269
    Jun 11 21:51:49 ######### kernel: [1762080.353290]  [<ffffffff8105aba3>] ? cwq_activate_delayed_work+0x3c/0x48
    Jun 11 21:51:49 ######### kernel: [1762080.353292]  [<ffffffff8105c5c0>] ? worker_thread+0xc2/0x145
    Jun 11 21:51:49 ######### kernel: [1762080.353294]  [<ffffffff8105c4fe>] ? manage_workers.isra.25+0x15b/0x15b
    Jun 11 21:51:49 ######### kernel: [1762080.353296]  [<ffffffff8105f701>] ? kthread+0x76/0x7e
    Jun 11 21:51:49 ######### kernel: [1762080.353299]  [<ffffffff813575b4>] ? kernel_thread_helper+0x4/0x10
    Jun 11 21:51:49 ######### kernel: [1762080.353302]  [<ffffffff8105f68b>] ? kthread_worker_fn+0x139/0x139
    Jun 11 21:51:49 ######### kernel: [1762080.353304]  [<ffffffff813575b0>] ? gs_change+0x13/0x13
    Somes processus are blocked

    Code:
    ps auxf | grep -E ' [DR]'
    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    root 2787 0.0 0.0 0 0 ? D May22 0:07 \_ [flush-253:4]
    root 6362 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:3]
    root 7257 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:2]
    root 8036 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:1]
    root 8262 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:4]
    root 8263 0.0 0.0 0 0 ? D Jun11 0:00 \_ [kworker/0:5]
    root 32191 0.0 0.0 0 0 ? D Jun12 0:12 \_ [kworker/0:8]
    root 26983 0.0 0.0 0 0 ? D Jun12 1:24 \_ [kworker/0:9]
    (… and some others process ...)
    There is no io-wait

    Code:
    vmstat 1 10
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 0 3078348 2708 462720 0 0 1 89 15 18 0 0 99 0
    0 0 0 3078340 2708 462720 0 0 0 0 289 718 0 0 100 0
    0 0 0 3078340 2708 462720 0 0 0 0 289 722 0 0 100 0
    0 0 0 3078340 2708 462720 0 0 0 0 291 730 0 0 100 0
    Thks for your advices.
     
    #1 Tom91, Jun 17, 2015
    Last edited: Jun 17, 2015
  2. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    Re: The load on VM grows up to a point where they become unavailable

    My Message become unreadable after I post it. I don't know why ?!?

    [EDIT]

    I change this parameter "Message Editor Interface" and it seems to work.
     
    #2 Tom91, Jun 17, 2015
    Last edited: Jun 17, 2015
  3. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,831
    Likes Received:
    158
    Hi,
    does the same happens, if you change the disk-format to raw?

    What kind of storage is the underlying filesystem?

    Udo
     
  4. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    Hi,

    Thanks for your reply.

    The host server is a HP Proliant with SSD behind a P420i Controller.
    The host operating system is a Debian 7.8.

    The problem doesn't appear all the time ... on all servers. I don't understand which parameters affect the storage of the VM.

    I will try to do a test with a disk-format to raw.

    Thomas
     
  5. uFx

    uFx New Member

    Joined:
    Jun 19, 2015
    Messages:
    8
    Likes Received:
    0
    We have the same problem with a few VM's on Proxmox 3.4. We are using local storage on an Areca-RAID-set. We are also using backup to NFS and the 'blocked for more than 120 seconds' problems occur mostly during a backup. However, we did have some problems with a few VM's when there was no backup running, no high load/io on the host. Although I'm not for sure I have the feeling that this problem occurs more often on Proxmox 3.4 than 3.3.

    All our VM's are using qcow2 as disk image format.
     
  6. draga

    draga New Member

    Joined:
    May 17, 2015
    Messages:
    1
    Likes Received:
    0
    We had the same problem since Proxmox 3.3 (upgraded from 3.2). Local storage on RAID Controller, battery backed up, cache=writeback. With 3.2, everything was working flawlessly, since we upgraded to 3.3 the problem started to occur. We tried many combinations.
    Changing the image format to raw helped, but didn't solve. Instead of happening every day, it started to happen randomly, but at least twice a week.

    Changing cache type to none, solved the problem.
     
  7. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    Today, I found 2 VMs with a high load. Typically, it's a VM blocked by day.

    it's not the first time, these VMs have their files system blocked.
    I already tried to migrate VMs on another proxmox. Upgrade the OS of the VM doesn't helped.

    We have left the cache system of the VM to the defaut solution : Default (no cache). so our problem is not created by this parameter.

    thank you all the same for the proposal.
     
    #7 Tom91, Jun 22, 2015
    Last edited: Jun 22, 2015
  8. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
  9. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    I am having this issue as well. Can't seem to narrow it down or resolve no matter what options I change. Random VMs die with hung_task_timeout_secs. It "feels" I/O related. Argh!
     
  10. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    Update: I had two more VMs crash over the weekend in this way. It's as if the virtio disk is just yanked out every once in a while.

    Interesting note: Doing a "reset" doesn't resolve it. I get disk errors after grub. I have to do a full stop, then start.
     
  11. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    Hi,

    Just for information, since I installed the kernel 2.6, we don't have no more VM that crashed.

    Tom91
     
  12. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    That's good info, thanks!

    You mean the Proxmox kernel, not the kernel inside the VMs, right? So you were on the 3.x Proxmox kernel before?
     
  13. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
  14. warren

    warren New Member

    Joined:
    Jul 7, 2015
    Messages:
    3
    Likes Received:
    0
    I was having this issue every +- 4 days or so on 3 guest machines. I also applied the above and also remember apply changes to : dirty_background_ratio
    referenced here : http://forum.proxmox.com/archive/index.php/t-15893.html

    so my sysctl.conf now is as follows :
    vm.dirty_background_bytes = 0
    vm.dirty_bytes = 0
    vm.dirty_expire_centisecs = 3000
    vm.dirty_writeback_centisecs = 500
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 5


    I have not seen the error " ..blocked for more than 120 seconds." since 11 May15
     
  15. Tom91

    Tom91 New Member

    Joined:
    Jul 21, 2014
    Messages:
    9
    Likes Received:
    0
    1 week without having a crash of VM \o/

    Yes, we changed the kernel of the hypervisor.
     
  16. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    Hmm... After we switched back to the 2.6 kernel the problem has actually gotten much worse. VMs hanging every few hours.
     
  17. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    Worth a try... This change was on the host, not inside the VM, right?
     
  18. warren

    warren New Member

    Joined:
    Jul 7, 2015
    Messages:
    3
    Likes Received:
    0
    Nope, made changes inside the running VM ( these vm's are running SME server )
     
  19. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Is this problem limited to Debian 7 guests or has it been observed with other guest OS?

    I've only seen it with Debian guests and only fix I found is to use IDE instead of virtio.
     
  20. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    I'm seeing it on guests with a variety of Ubuntu versions and Debian as well.

    Hardware-wise, I have two Dell 2970's (AMD) and one 2950 (Intel) and all three boxes are exhibiting this hanging bug. Storage is local RAID6 on Dell Perc6/i with 10k SAS drives. All firmware is up-to-date.
     
    #20 mstrent, Jul 8, 2015
    Last edited: Jul 8, 2015
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice