Linux guest problems on new Haswell-EP processors

spirit · Jul 8, 2015

mstrent said:
What kind of storage are you folks running? At the moment I'm using local storage: RAID6 on 10k SAS drives with the Dell Perc 6i.

Servers are two Dell 2970's (AMD) and one 2950 (Intel) and all three boxes are exhibiting this hanging bug. All firmware is up-to-date.

local ssd, netapp san through nfs, zfs san && ceph cluster.
don't have any problem.

all my linux guests are debian with kernel 3.x, virtio or virtio-scsi disk.

One question : as your server are pretty old, is the battery of the perc6i ok ? (if not, you'll don't have writeback cache, and with raid6 it'll hurt a lot)

mstrent · Jul 8, 2015

spirit said:
One question : as your server are pretty old, is the battery of the perc6i ok ? (if not, you'll don't have writeback cache, and with raid6 it'll hurt a lot)

Good question! Megacli reports battery status optimal on all servers.

mstrent · Jul 8, 2015

root@proxmox4:~# pveperf /var/lib/vz
CPU BOGOMIPS: 67029.36
REGEX/SECOND: 992711
HD SIZE: 1506.85 GB (/dev/mapper/pve-data)
BUFFERED READS: 379.42 MB/sec
AVERAGE SEEK TIME: 7.01 ms
FSYNCS/SECOND: 3189.62
DNS EXT: 218.45 ms
DNS INT: 3.02 ms (lewis.local)

spirit · Jul 9, 2015

yes, seem to be ok.

I'm using a lot of dell servers with different perc, and I sure that host kernel driver is pretty stable.

Can you do some test with something like a debian jessie vm, with recent kernel and virtio disk, and do some write benchmark ?

remark · Jul 9, 2015

I have same problem, both on Intel Xeon and AMD Opteron CPUs.
Intel Xeon E5620, AMD Opteron 6128

mstrent · Jul 9, 2015

Have you folks altered your guest VM i/o scheduler and/or timing of cron.daily (mlocate/logrotate)?

remark · Jul 10, 2015

mstrent said:
Have you folks altered your guest VM i/o scheduler and/or timing of cron.daily (mlocate/logrotate)?

No, I haven't. Default installation, only daemon config files change (httpd, DrWeb daemon, etc.)

robhost · Sep 13, 2015

We have the same issue on HP DL180gen9 with E5-2620v3. Any news on that?
We'll give kernel 3.10 a try now...

robhost · Sep 13, 2015

spirit said:
Do you have tried to use kernel 3.16 from wheezy-backports ?

I'm running it without any hang on more than 400 guests vms

Does the Wheezy backports 3.16 kernel still work for you, spirit? Then we'll give it a try.

We see the same issue also with the 3.10.0-11-pve kernel bootet since a few hours :-(

nanonettr · Sep 16, 2015

I cant remember when but we had a similar problem.

We disabled intel power features from BIOS and changed some lines in /etc/default/grub as follows;

Code:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.scan=sync rootdelay=30 nodelayacct elevator=deadline idle=halt intel_idle.max_cstate=0 processor.max_cstate=1 panic=90"
GRUB_CMDLINE_LINUX=""
GRUB_RECORDFAIL_TIMEOUT="5"
GRUB_TIMEOUT_STYLE="hidden"

since then never had a VM lockup. Relevent changes are "idle=halt intel_idle.max_cstate=0 processor.max_cstate=1"

After updating file you ned to run 'update-grub' and reboot.
Also these changes mean using more electric...

robhost · Sep 22, 2015

Update:

No more hangs sincs 10 days with "Wheezy backports 3.16" Kernel on PVE 3.4 with Haswell CPU.

e100 · Sep 22, 2015

I have a few Ubuntu 14.04 VMs running 3.16 and they have never had this problem.
That seems to support the idea that the problem might be in the guest kernel.

robhost · Sep 22, 2015

Not really. We changed our host kernel to 3.16, so I think the *-pve kernel has some problems with Haswell.
As we're running lots of CentOS 7 servers with RHEL stock kernel without problems, it must be a PVE specific problem, maybe in a combination with their qemu packages or something.

e100 · Sep 28, 2015

robhost said:
Not really. We changed our host kernel to 3.16, so I think the *-pve kernel has some problems with Haswell.
As we're running lots of CentOS 7 servers with RHEL stock kernel without problems, it must be a PVE specific problem, maybe in a combination with their qemu packages or something.

This seems like a significant clue.

Juniorrrrr · Sep 29, 2015

same problem here, in servers "Ivy Bridge" works fine on "Haswell-EP" have constant freezing

mir · Sep 29, 2015

Have you disabled all C-states in BIOS?

Juniorrrrr · Sep 29, 2015

I had disabled all in power technology.
Now, I put in custom and I disabled all in C-states

robhost · Oct 3, 2015

Juniorrrrr said:
same problem here, in servers "Ivy Bridge" works fine on "Haswell-EP" have constant freezing

View attachment 2892

Did you try kernel 3.16
We're running this kernel since 3 weeks without any freeze!

avladulescu · Nov 25, 2015

RONIS said:
I don't know about others, but what i did to resolve this problem was CPU downgrade from Xeon E5-2620 v2 to Xeon E5 2620.
Since then, we don't had this error anymore. I would like to know, if this problems exists with latest Intel Xeon v3 CPUs ?

I can confirm this, it still happens with exactly the same synoptic that e100 described.

I have 2 sites running dell r730xd servers with 2 x E5-2630 v3 processors and this issue still manifests on high loaded VMs. Numa is enabled, drive and network is set to Virtio and SCSI controller type to default LSI.

We are talking about, 3.4-11 version with 3.10.0-13-pve kernel installed. There is no pattern on this problem, but from what I can tell, on one site I have all VMs (dedicated not OpenVZ) running debian 7.9 updated (local SSD storage via HW RAID controller) and on second side centos 6.7 with 2.6.32-573.8.1.el6.x86_64 running (running via NFS -- tested via ISCSI) on a dedicated 10G network to central storage solution. So this fuss on local/remote storage place is pointless now.

An interesting point, which I see that nobody replied is nanonettr post which I will give the change a try.

The issue and what I have tested is described in more details here: (problem #2): http://forum.proxmox.com/threads/24277-VM-high-vCPU-usage-issues

On the other hand, I come with an additional information which I have tested on both setups

- when the VM is in lock state and it prints on the console the kernel hung task timeout messages, adding another disk (doesn't matter the site or storage type) over GUI of proxmox automatically pulls out the locked CPU thread wait IO time from 100% to 0 and everything comes back to normal.

So 2 different setups, different network and storage designs, different KVM and kernel VM guests !

Therefore adding other disk, just to remove after the VM calms down to it (no format or other operations needed on the drive), does somehow a VM's disk/configuration refresh in qemu that snaps the VM out of the locking state.

I tested to see if this is a general add/remove component to the subject VMs by mounting/unmounting an iso image, adding/removing a network card, but it only reacts to add/remove hdd.

Any other clues ?

spirit · Nov 25, 2015

avladulescu said:
- when the VM is in lock state and it prints on the console the kernel hung task timeout messages, adding another disk (doesn't matter the site or storage type) over GUI of proxmox automatically pulls out the locked CPU thread wait IO time from 100% to 0 and everything comes back to normal.

Is it with disk hotplug enable ?

Linux guest problems on new Haswell-EP processors

Distinguished Member

New Member

New Member

Distinguished Member

Renowned Member

New Member

Renowned Member

Active Member

Active Member

Member

Active Member

Famous Member

Active Member

Famous Member

New Member

Famous Member

New Member

Active Member

Renowned Member

Distinguished Member

We value your privacy