Linux guest problems on new Haswell-EP processors

Discussion in 'Proxmox VE: Installation and configuration' started by e100, Nov 19, 2014.

  1. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    local ssd, netapp san through nfs, zfs san && ceph cluster.
    don't have any problem.

    all my linux guests are debian with kernel 3.x, virtio or virtio-scsi disk.


    One question : as your server are pretty old, is the battery of the perc6i ok ? (if not, you'll don't have writeback cache, and with raid6 it'll hurt a lot)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    Good question! Megacli reports battery status optimal on all servers.
     
  3. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    root@proxmox4:~# pveperf /var/lib/vz
    CPU BOGOMIPS: 67029.36
    REGEX/SECOND: 992711
    HD SIZE: 1506.85 GB (/dev/mapper/pve-data)
    BUFFERED READS: 379.42 MB/sec
    AVERAGE SEEK TIME: 7.01 ms
    FSYNCS/SECOND: 3189.62
    DNS EXT: 218.45 ms
    DNS INT: 3.02 ms (lewis.local)
     
  4. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    yes, seem to be ok.

    I'm using a lot of dell servers with different perc, and I sure that host kernel driver is pretty stable.

    Can you do some test with something like a debian jessie vm, with recent kernel and virtio disk, and do some write benchmark ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. remark

    remark Member

    Joined:
    May 4, 2011
    Messages:
    91
    Likes Received:
    6
    I have same problem, both on Intel Xeon and AMD Opteron CPUs.
    Intel Xeon E5620, AMD Opteron 6128
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. mstrent

    mstrent New Member

    Joined:
    Mar 20, 2012
    Messages:
    21
    Likes Received:
    0
    Have you folks altered your guest VM i/o scheduler and/or timing of cron.daily (mlocate/logrotate)?
     
  7. remark

    remark Member

    Joined:
    May 4, 2011
    Messages:
    91
    Likes Received:
    6
    No, I haven't. Default installation, only daemon config files change (httpd, DrWeb daemon, etc.)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. robhost

    robhost Member
    Proxmox Subscriber

    Joined:
    Jun 15, 2014
    Messages:
    185
    Likes Received:
    7
    We have the same issue on HP DL180gen9 with E5-2620v3. Any news on that?
    We'll give kernel 3.10 a try now...
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. robhost

    robhost Member
    Proxmox Subscriber

    Joined:
    Jun 15, 2014
    Messages:
    185
    Likes Received:
    7
    Does the Wheezy backports 3.16 kernel still work for you, spirit? Then we'll give it a try.

    We see the same issue also with the 3.10.0-11-pve kernel bootet since a few hours :-(
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. nanonettr

    nanonettr New Member

    Joined:
    Jul 25, 2015
    Messages:
    17
    Likes Received:
    0
    I cant remember when but we had a similar problem.

    We disabled intel power features from BIOS and changed some lines in /etc/default/grub as follows;

    Code:
    GRUB_DEFAULT=0
    GRUB_TIMEOUT=5
    GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
    GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.scan=sync rootdelay=30 nodelayacct elevator=deadline idle=halt intel_idle.max_cstate=0 processor.max_cstate=1 panic=90"
    GRUB_CMDLINE_LINUX=""
    GRUB_RECORDFAIL_TIMEOUT="5"
    GRUB_TIMEOUT_STYLE="hidden"
    since then never had a VM lockup. Relevent changes are "idle=halt intel_idle.max_cstate=0 processor.max_cstate=1"

    After updating file you ned to run 'update-grub' and reboot.
    Also these changes mean using more electric...
     
  11. robhost

    robhost Member
    Proxmox Subscriber

    Joined:
    Jun 15, 2014
    Messages:
    185
    Likes Received:
    7
    Update:

    No more hangs sincs 10 days with "Wheezy backports 3.16" Kernel on PVE 3.4 with Haswell CPU.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    I have a few Ubuntu 14.04 VMs running 3.16 and they have never had this problem.
    That seems to support the idea that the problem might be in the guest kernel.
     
  13. robhost

    robhost Member
    Proxmox Subscriber

    Joined:
    Jun 15, 2014
    Messages:
    185
    Likes Received:
    7
    Not really. We changed our host kernel to 3.16, so I think the *-pve kernel has some problems with Haswell.
    As we're running lots of CentOS 7 servers with RHEL stock kernel without problems, it must be a PVE specific problem, maybe in a combination with their qemu packages or something.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. e100

    e100 Active Member
    Proxmox Subscriber

    Joined:
    Nov 6, 2010
    Messages:
    1,235
    Likes Received:
    24
    This seems like a significant clue.
     
  15. Juniorrrrr

    Juniorrrrr New Member

    Joined:
    Sep 29, 2015
    Messages:
    3
    Likes Received:
    0
    same problem here, in servers "Ivy Bridge" works fine on "Haswell-EP" have constant freezing

    ssd.jpg
     
  16. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,480
    Likes Received:
    96
    Have you disabled all C-states in BIOS?
     
  17. Juniorrrrr

    Juniorrrrr New Member

    Joined:
    Sep 29, 2015
    Messages:
    3
    Likes Received:
    0
    I had disabled all in power technology.
    Now, I put in custom and I disabled all in C-states

    Screenshot_1.jpg Screenshot_2.jpg
     
  18. robhost

    robhost Member
    Proxmox Subscriber

    Joined:
    Jun 15, 2014
    Messages:
    185
    Likes Received:
    7

    Did you try kernel 3.16
    We're running this kernel since 3 weeks without any freeze!
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  19. avladulescu

    avladulescu New Member

    Joined:
    Mar 3, 2015
    Messages:
    20
    Likes Received:
    0
    I can confirm this, it still happens with exactly the same synoptic that e100 described.

    I have 2 sites running dell r730xd servers with 2 x E5-2630 v3 processors and this issue still manifests on high loaded VMs. Numa is enabled, drive and network is set to Virtio and SCSI controller type to default LSI.

    We are talking about, 3.4-11 version with 3.10.0-13-pve kernel installed. There is no pattern on this problem, but from what I can tell, on one site I have all VMs (dedicated not OpenVZ) running debian 7.9 updated (local SSD storage via HW RAID controller) and on second side centos 6.7 with 2.6.32-573.8.1.el6.x86_64 running (running via NFS -- tested via ISCSI) on a dedicated 10G network to central storage solution. So this fuss on local/remote storage place is pointless now.

    An interesting point, which I see that nobody replied is nanonettr post which I will give the change a try.

    The issue and what I have tested is described in more details here: (problem #2): http://forum.proxmox.com/threads/24277-VM-high-vCPU-usage-issues

    On the other hand, I come with an additional information which I have tested on both setups

    - when the VM is in lock state and it prints on the console the kernel hung task timeout messages, adding another disk (doesn't matter the site or storage type) over GUI of proxmox automatically pulls out the locked CPU thread wait IO time from 100% to 0 and everything comes back to normal.

    So 2 different setups, different network and storage designs, different KVM and kernel VM guests !

    Therefore adding other disk, just to remove after the VM calms down to it (no format or other operations needed on the drive), does somehow a VM's disk/configuration refresh in qemu that snaps the VM out of the locking state.

    I tested to see if this is a general add/remove component to the subject VMs by mounting/unmounting an iso image, adding/removing a network card, but it only reacts to add/remove hdd.

    Any other clues ?
     
    #59 avladulescu, Nov 25, 2015
    Last edited: Nov 25, 2015
  20. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    Is it with disk hotplug enable ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice