Linux guest problems on new Haswell-EP processors

Discussion in 'Proxmox VE: Installation and configuration' started by e100, Nov 19, 2014.

  1. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    We recently upgraded two servers with dual socket Xeon boards.
    One server has two E5-2687W v3
    The other has two E5-2620 v3

    I have three debian wheezy guests, one on the 2687W and two on the 2620 that have had issues.
    These guests are currently kicked out of production so they just sit there idle al day.
    The only real load is a cron job that runs every few minutes, it makes some http requests and reads/writes some tiny files.

    The only clue I have is some kernel message in the guest about jbd2/dm-0-8 being blocked for more than 120 seconds.
    I don't have the exact error but was something like "INFO: task jbd2/dm-0-8 blocked for more than 120 seconds."
    IO becomes stalled and load keeps rising.
    Only way to recover is to stop/start the VM.

    Guests worked fine before the upgrade.
    The only components changed where CPU/RAM/Motherboard
    Still using same RAID card and disks.

    Storage is LVM over DRBD.

    Oddly no issues with Windows guests, so far.

    Any suggestions?

    VM config file:
    Code:
    # cat /etc/pve/qemu-server/107.conf 
    bootdisk: virtio0
    cores: 1
    ide2: none,media=cdrom
    memory: 1280
    name: XXXXXXXXXXX
    net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr10
    onboot: 1
    ostype: l26
    sockets: 1
    virtio0: vm9-vm10:vm-107-disk-1,cache=directsync,size=3G
    

    Code:
    # pveversion -v
    proxmox-ve-2.6.32: 3.3-139 (running kernel: 2.6.32-34-pve)
    pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
    pve-kernel-2.6.32-20-pve: 2.6.32-100
    pve-kernel-2.6.32-12-pve: 2.6.32-68
    pve-kernel-2.6.32-19-pve: 2.6.32-96
    pve-kernel-2.6.32-16-pve: 2.6.32-82
    pve-kernel-2.6.32-13-pve: 2.6.32-72
    pve-kernel-2.6.32-29-pve: 2.6.32-126
    pve-kernel-2.6.32-34-pve: 2.6.32-139
    pve-kernel-2.6.32-14-pve: 2.6.32-74
    pve-kernel-2.6.32-26-pve: 2.6.32-114
    pve-kernel-2.6.32-11-pve: 2.6.32-66
    pve-kernel-2.6.32-18-pve: 2.6.32-88
    pve-kernel-2.6.32-23-pve: 2.6.32-109
    lvm2: 2.02.98-pve4
    clvm: 2.02.98-pve4
    corosync-pve: 1.4.7-1
    openais-pve: 1.1.4-3
    libqb0: 0.11.1-2
    redhat-cluster-pve: 3.2.0-2
    resource-agents-pve: 3.9.2-4
    fence-agents-pve: 4.0.10-1
    pve-cluster: 3.0-15
    qemu-server: 3.3-3
    pve-firmware: 1.1-3
    libpve-common-perl: 3.0-19
    libpve-access-control: 3.0-15
    libpve-storage-perl: 3.0-25
    pve-libspice-server1: 0.12.4-3
    vncterm: 1.1-8
    vzctl: 4.0-1pve6
    vzprocps: 2.0.11-2
    vzquota: 3.1-2
    pve-qemu-kvm: 2.1-10
    ksm-control-daemon: 1.1-1
    glusterfs-client: 3.5.2-1
    
     
  2. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Right after posing my message a colleague sent me this screen shot.
    As you can see many tasks are stalled, there are no other messages before these, goes from working fine to spitting out these errors with stalled IO.

    Screenshot from 2014-11-19 08:46:23.png
     
  3. term

    term Member

    Joined:
    Aug 29, 2013
    Messages:
    68
    Likes Received:
    1
    What do you have the processor type set to? Does changing it help?
     
  4. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    They were all set to Default.
    I had the same thought myself and have already changed one of the VMs to Haswell then shutdown and restarted it.

    The VMs that have had this problem have run for up to two weeks without any problem so it will take some time to see if that helped.
     
  5. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
  6. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    More information...

    When the IO stalls it does not seem to be a problem with the guest OS, it seems to be an issue with KVM itself.

    I tried to reset the VM by pressing reset in the GUI.
    VM did not reset.

    Then I entered the monitor tab and entered 'help', the response is:
    Code:
    Type 'help' for help.
    # help
    ERROR: VM 107 qmp command 'human-monitor-command' failed - unable to connect to VM 107 socket - timeout after 31 retries
    
    Only way to recover is to stop/start the VM.
     
  7. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,361
    Likes Received:
    139
    do you have tried with kernel 3.10 ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    I have not yet tried the 3.10 kernel, if you think it might help I can surely give it a try.
     
  9. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,361
    Likes Received:
    139
    Yes, I think it could help, I known that kvm module in 3.10 have some cpu filtering bugs corrected.
    (I have see that mainly on live migration between old and new xeons).

    So, try it to compare, maybe it'll work. (Don't have haswell-ep yet to test on my side)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    So far using the 3.10 kernel seems to have resolved the problem.
     
  11. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Spoke too soon.

    This problem still occurs on 3.10 but with much less frequency.

    KVM itself is hanging, not the guest.
    Monitor does not work, cannot perform a reset.
    To recover I have to stop, then start the VM.

    Is there anything I can do to help track down the source of this problem?
     
  12. RONIS

    RONIS New Member

    Joined:
    Mar 24, 2013
    Messages:
    9
    Likes Received:
    0
    With Intel Xeon E5 2620 v2 we got the same issues with Linux guests.
     
  13. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Humm, that is interesting.
    We have at least four servers running Xeon E5 v2 CPUs and have not seen this problem with those.

    I believe that whatever the problem is its in KVM itself. A race condition of some sort, so I am not shocked to see it happen on other CPUs.
     
  14. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,361
    Likes Received:
    139
    Hi,

    Can you try to disable apicv,

    I have see bug reports about it recently (including rhel7 3.10 kernel), with last xeons processors

    # modprobe kvm_intel enable_apicv=N
    cat /sys/module/kvm_intel/parameters/enable_apicv to verify
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Sure, I will try turning off apicv.

    I've been playing with various IO options with my new SSDs. When I was testing iothreads if I set cache=directsync I experienced IO stalls. Nearly all of my VMs use directsync. Most likely not related to the issue here but I have set some of my VMs to writethrough to see if it makes a difference.

    I've also been having issues with DRBD on 3.10. Seems like the IO scheduler is working very different resulting in timeouts causing DRBD to disconnect.
     
  16. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    spirit,

    Turning off APICv does not resolve the problem.
    Any other suggestions? I am completely out of ideas on what might resolve this.
     
  17. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,361
    Likes Received:
    139
    I have build a new kernel based on coming rhel 7.1-beta kernel

    deb are here:

    http://odisoweb1.odiso.net/kernel/

    maybe it'll help you ?

    Merry Xmas ;)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  18. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Hi spirit,

    I installed the kernel you provided yesterday.
    So far no VM lock ups but its not been long enough to conclude the issue is resolved.

    But I am concerned that the kernel is spitting out lots of warnings like this:
    Code:
    [   42.594700] ib0: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    [   42.597066] ib0: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    [   42.597528] ib1: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    [   42.599691] ib1: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    [   42.603214] ib1: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    [   42.604830] ib1: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL
    
    Those repeat every minute or so.

    Seems related to this patch:
    http://permalink.gmane.org/gmane.linux.drivers.rdma/20239

    I'm no kernel hacker so this is a bit above me but it appears that hardware drivers also need patched to work with the above ipoib patch:
    http://lkml.org/lkml/2014/4/24/543

    My IB cards use the mthca driver
    Code:
    [    8.190786] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
    
    The changes are to prevent a deadlock, I have been having some issues with DRBD timing out under load on machines running the 3.10 kernel, Wonder if this is related.
     
  19. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    Shortly after posting this a VM locked up, so this new kernel does not resolve the problem.:(
     
  20. symmcom

    symmcom Well-Known Member

    Joined:
    Oct 28, 2012
    Messages:
    1,075
    Likes Received:
    25
    I had hung_task error frequently on one of my E5-2620v2. My problem was tinkering with Infiniband. In my case entire node would lock up and only way to clear was hard reboot. I removed additional IB drivers i installed and have not seen this error for about 2 months now. i am not on the new Kernel 3.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice