Linux guest problems on new Haswell-EP processors

robhost · Nov 25, 2015

Any other clues ?

Yes, use Kernel 3.16 from Debian backports. We're running a few Haswell nodes since 2 month without any problems with this kernel. Before we had daily VM freezes.
This works only for KVM, OpenVZ ist not included in this kernel. If you're on OpenVZ, there is no fix afaik.

avladulescu · Nov 25, 2015

@ spirit - Yes, it is the default value set, hotplug enabled (Disk Network and USB), are you suggesting something ?

@ robhost - I said I am not using OpenVZ, as pointed out in that separate link I provided, fully describing the issue. But as on the other hand I am not running only debian to use back-ports and I have mentioned of using also the centos 6.7 with 2.6.x branch this is out of question as this envs are production ones.

I have checked the configuration of C-STATES on the Dell Server, in BIOS. It was disabled as I have on both servers set the Performance Profile to "Performance".

Reference at page 21 from Dell's documentation.

"BIOS Performance and Power TuningGuidelines for Dell PowerEdge 12thGeneration Servers"

Therefore, nanonettr post's doesn't apply as a quick look over dmesg output from boot time, searching C-STATES CPU set is not mentioned.

So as stated before, the add/remove drive snaps out the VM from staled IO freeze, without performing stop/start commands on the VM, it has to do something qemu disk management or refresh.

I am still digging to see if I can force a stupid workaround via the cronjob and a qm monitor commands sets over the disk state of the implied VMs, like doing a querying from a 5/10 minutes cycle.

Any other suggestions are appreciated.

BR

robhost · Nov 25, 2015

@ robhost - I said I am not using OpenVZ, as pointed out in that separate link I provided, fully describing the issue. But as on the other hand I am not running only debian to use back-ports and I have mentioned of using also the centos 6.7 with 2.6.x branch this is out of question as this envs are production ones.

You missed the point to use the backported kernel on the PVE host, NOT an your VMs.

avladulescu · Nov 25, 2015

Where in the world did you find backports pve kernel ?

Besides the rest, I've got these 2 in source.list:

deb http://download.proxmox.com/debian wheezy pve-no-subscription
deb http://http.debian.net/debian wheezy-backports main

Are you running on the bare metal server directly the stock backports kernel from debian repo merged into a Proxmox 3.x install ? At least this is what I understand from your state.

Furthermore on this topic I found some crucial information, which I am starting to ask myself why Proxmox devels/staff are keeping low & silent on this topic?

Go over these following threads I found related to the current issue (and mind the dates too):

http://pve.proxmox.com/pipermail/pve-user/2015-May/008736.html

Still brings shadow on the final answer, but brings another argument into count - compat vers on qcow2 compat: 0.10 to compat: 1.1
Although, in my case I went installing the bare metal with 3.4.x branch CD from start and didn't upgrade from 3.1

http://pve.proxmox.com/pipermail/pve-devel/2014-October/012909.html

I am wondering why iothread setting is still being hidden since we're on the exact match of qemu versions:

ii pve-qemu-kvm 2.2-13 amd64 Full virtualization on x86 hardware
ii qemu-server 3.4-6 amd64 Qemu Server Tools

Or might this be a subscription intended paid fix to have ?

spirit · Nov 25, 2015

Hi, I'm running 15x dell r630, with Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz process, kernel 3.10, and I don't have any problem.
All on nfs or ceph, no local disk

(running around 1000vms, debian and windows guests)

spirit · Nov 25, 2015

avladulescu said:
@ spirit - Yes, it is the default value set, hotplug enabled (Disk Network and USB), are you suggesting something ?

BR

Can you try without hotplug ? (I want to known if it's occur when create a new disk on the storage, or if the action of plugging in qemu, is doing something in qemu io thread.)

avladulescu · Nov 25, 2015

spirit said:
Hi, I'm running 15x dell r630, with Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz process, kernel 3.10, and I don't have any problem.
All on nfs or ceph, no local disk

(running around 1000vms, debian and windows guests)

Still, that doesn't mean the issue doesn't exist

There is a sure difference between my 730xd's (running also on local and remote storage) and yours r630's in terms of chipset and used CPUs, but I am having a hard time believing that the CPUs instruction set on my 2630's v3 is so new that it brings trouble into this.

Definitely there's a catch on the software side, as dell bios is very straight forward.

robhost · Nov 25, 2015

Are you running on the bare metal server directly the stock backports kernel from debian repo merged into a Proxmox 3.x install ? At least this is what I understand from your state.

Yes, "3.16.0-0.bpo.4-amd64" from wheezy-backports directly on the PVE host (HP DL180gen9). 2 month without any VM freeze.

spirit · Nov 25, 2015

Also, I'm not sure about your controller model, but lsi have a big firmware bug last year on 9207 controller

https://community.nexenta.com/thread/1053
https://ceph.com/planet/downgrade-lsi-9207-to-p19-firmware/

also, as you have xd server, I think you have a lot of disk, so double check to have backplane firmware updated to last version.

avladulescu · Nov 26, 2015

Thank you for sharing this very useful information !

My servers are running H730 Mini (MegaRAID SAS-3 3108 [Invader] (rev 02)), which is based, if not mistaking, on the 93xx LSI branch cards.

Although, it doesn't apply. As I described I am running 2 setup, different locations, one with local storage and one with remote storage. So the problem couldn't come only from the LSI firmware bug due to the fact that one of the setup is running on a central storage, and still the problem exist (the 2.6.x kernel - centos 6.7 - I had described before).

So, I have some servers on the x-density framework, not all of them are using local storage as source storage.

spirit · Nov 26, 2015

avladulescu said:
Thank you for sharing this very useful information !

My servers are running H730 Mini (MegaRAID SAS-3 3108 [Invader] (rev 02)), which is based, if not mistaking, on the 93xx LSI branch cards.

Although, it doesn't apply. As I described I am running 2 setup, different locations, one with local storage and one with remote storage. So the problem couldn't come only from the LSI firmware bug due to the fact that one of the setup is running on a central storage, and still the problem exist (the 2.6.x kernel - centos 6.7 - I had described before).

So, I have some servers on the x-density framework, not all of them are using local storage as source storage.

Ok.

About the bios, do you have done last upgrade ? (because they are some cpu microcode update for intel processor)

avladulescu · Nov 26, 2015

Yes, All up to date, bios, firmwares, HW raid firmwares.

How can I send a PM message to you on this forum? I can't locate the PM button to contact you on a quick chat/call on private.

I wouldn't post into this forum like chat message, to spam it, and return to the thread once I get some relevant information for others to share with.

I might have some leads but, won't post them once this is a certitude fix.

spirit · Nov 26, 2015

avladulescu said:
Yes, All up to date, bios, firmwares, HW raid firmwares.

How can I send a PM message to you on this forum? I can't locate the PM button to contact you on a quick chat/call on private.

I wouldn't post into this forum like chat message, to spam it, and return to the thread once I get some relevant information for others to share with.

I might have some leads but, won't post them once this is a certitude fix.

feel free to contact me on my work email : aderumier@odiso.com

avladulescu · Nov 30, 2015

As said, I returned with more info on the topic in order to shed some light on the gathered research I've manage to do so far.

Considering my last post, I have focused on the way the internal disk scheduler is set, from default values towards changing it to deadline on all VMs. This has improved the stability quite a bit, but was not enough to stop this bug from manifesting.

So, since the last post, I was constantly checking the status of the VMs via my NMS and their resource utilisation. What I could observe is that at the time the VM gets stuck, on a low loaded VM, the memory buffers and cached values start to rise pretty solid (even if the lock is cleared afterwards via the earlier add/del disk described method). While the vCPUs are in a "lock state" the host context switches, system interrupts and load average go sky rocket on the graphs and the system I/O Activity freezes completely.

These started to seem more like a memory leak, therefore the first point to start was to check what is new/different in v3 processors in comparision with older versions.
The following link provided a starting point: https://software.intel.com/en-us/bl...ies-on-the-latest-intel-xeon-are-you-ready-to

From all the described technologies on the site, the VMCS Shadowing provided the mostly kernel errors pages on current (in use) kernel branches. Therefore a further lookup over, reveals the "kvm_vm_ioctl" KVM kernel functions to be the central point of all sort of misbehaviours.

Below I have added a few useful links I could find related to this:

https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.10.47 - search for commit 264f8746aa6ebf1a62588c653a5e3c4891f69fee
http://www.gossamer-threads.com/lists/linux/kernel/2207193
http://stackoverflow.com/questions/33192729/vmwrite-error-when-updating-vmcs-from-kvm-vm-ioctl
https://bugzilla.kernel.org/show_bug.cgi?id=93251 - affecting 3.19 branch

So as I understand this (I'm not a kernel developer) the "leak" comes from the following logic:

Running a virtual machine -> allocates the corresponding configured vCPUs to the KVM process (to vCPU & ioctl) as well as setting io scheduling in kvm instance in qemu-kvm. The vCPUs are tighten to the physical CPUs & ram memory that binds on a KVM specific instruction set to open ioctl system calls. These should try to create a set of file descriptors to the current process involving the disk access role.

The lock'up only occurs when, the kvm_vm_ioctl tries to free-up some memory resources previously allocated.

What is mostly interesting is that there is no version of 3.16.x to test with in pve repository, and this bug was supposedly fixed in 3.10 and we don't know the correlation that it is between 3.10.47 and the pve-kernel-3.10.0-13-pve revision and if it includes the fix, but might explain why robhost is running stable on 3.16.x branch from backports and that supposedly got fixed in the 4.1.x kernel branch.

Regarding Spirit's CPU version in comparison to mine, it is well known the fact that each hardware CPU branch version/revision (mine entry-to-middle, his high end) has a major architecture, thus minor changes between different high-to-low gamma, whereas to the cpu microcode support included in each bios update by all hardware/mobo vendors.

Currently, I'm under stability testing with 3.16.x kernel from backports, forcing me to drop the 3.10.x pve stable kernel release (maybe until a 3.16 pve might raise - although I don't believe it so, since 3.10 is dead next year and the progress on 4.x branch is way to far ongoing on the Proxmox 4 versions for somebody to reconsider bug fixing on a dead-end kernel version/product).

Hopefully my logic and explanations are close to right and this will help others in the future.

pjkenned · Nov 30, 2015

No clues but I have seen this on two different E5 v3 processors with Proxmox VE 4.0.System 1: dual E5-2650L V3 supermicro motherboardSystem 2: dual E5-2683 V3 asrock motherboard

spirit · Nov 30, 2015

avladulescu said:
As said, I returned with more info on the topic in order to shed some light on the gathered research I've manage to do so far.

Considering my last post, I have focused on the way the internal disk scheduler is set, from default values towards changing it to deadline on all VMs. This has improved the stability quite a bit, but was not enough to stop this bug from manifesting.

So, since the last post, I was constantly checking the status of the VMs via my NMS and their resource utilisation. What I could observe is that at the time the VM gets stuck, on a low loaded VM, the memory buffers and cached values start to rise pretty solid (even if the lock is cleared afterwards via the earlier add/del disk described method). While the vCPUs are in a "lock state" the host context switches, system interrupts and load average go sky rocket on the graphs and the system I/O Activity freezes completely.

These started to seem more like a memory leak, therefore the first point to start was to check what is new/different in v3 processors in comparision with older versions.
The following link provided a starting point: https://software.intel.com/en-us/bl...ies-on-the-latest-intel-xeon-are-you-ready-to

From all the described technologies on the site, the VMCS Shadowing provided the mostly kernel errors pages on current (in use) kernel branches. Therefore a further lookup over, reveals the "kvm_vm_ioctl" KVM kernel functions to be the central point of all sort of misbehaviours.

Below I have added a few useful links I could find related to this:

https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.10.47 - search for commit 264f8746aa6ebf1a62588c653a5e3c4891f69fee
http://www.gossamer-threads.com/lists/linux/kernel/2207193
http://stackoverflow.com/questions/33192729/vmwrite-error-when-updating-vmcs-from-kvm-vm-ioctl
https://bugzilla.kernel.org/show_bug.cgi?id=93251 - affecting 3.19 branch

So as I understand this (I'm not a kernel developer) the "leak" comes from the following logic:

Running a virtual machine -> allocates the corresponding configured vCPUs to the KVM process (to vCPU & ioctl) as well as setting io scheduling in kvm instance in qemu-kvm. The vCPUs are tighten to the physical CPUs & ram memory that binds on a KVM specific instruction set to open ioctl system calls. These should try to create a set of file descriptors to the current process involving the disk access role.

The lock'up only occurs when, the kvm_vm_ioctl tries to free-up some memory resources previously allocated.

What is mostly interesting is that there is no version of 3.16.x to test with in pve repository, and this bug was supposedly fixed in 3.10 and we don't know the correlation that it is between 3.10.47 and the pve-kernel-3.10.0-13-pve revision and if it includes the fix, but might explain why robhost is running stable on 3.16.x branch from backports and that supposedly got fixed in the 4.1.x kernel branch.

Regarding Spirit's CPU version in comparison to mine, it is well known the fact that each hardware CPU branch version/revision (mine entry-to-middle, his high end) has a major architecture, thus minor changes between different high-to-low gamma, whereas to the cpu microcode support included in each bios update by all hardware/mobo vendors.

Currently, I'm under stability testing with 3.16.x kernel from backports, forcing me to drop the 3.10.x pve stable kernel release (maybe until a 3.16 pve might raise - although I don't believe it so, since 3.10 is dead next year and the progress on 4.x branch is way to far ongoing on the Proxmox 4 versions for somebody to reconsider bug fixing on a dead-end kernel version/product).

Hopefully my logic and explanations are close to right and this will help others in the future.

Nice debug . I was looking for redhat 3.10 kernel updates changelogs (because current proxmox 3.10 was not updated since may 2015), but nothing too news related to kvm has been backported by redhat.

If you need a more recent kernel than 3.16, you can try proxmox 4.0 kernel, it should work :

http://download.proxmox.com/debian/...d64/pve-kernel-4.2.3-2-pve_4.2.3-22_amd64.deb
http://download.proxmox.com/debian/...ption/binary-amd64/pve-firmware_1.1-7_all.deb

avladulescu · Nov 30, 2015

@ Spirit:

I can't change nor test this on to this systems because are production envs and my main concern is not to have sleepless nights due to this stupid bug, which I already had plenty.

pjkenned said:
No clues but I have seen this on two different E5 v3 processors with Proxmox VE 4.0.System 1: dual E5-2650L V3 supermicro motherboardSystem 2: dual E5-2683 V3 asrock motherboard

Separately, I have a different env running on Core i7 socket 2011 v1 CPUs and never had encountered this issue before.

If am I am to take a wild general guess, considering what and how a Linux Kernel work (that it molds on to the hardware system), I suppose v3 architecture from Intel is one step ahead of the Kernel development schedule to be consider a fully stable & supported layout, otherwise v1 imposes no trouble.

robhost · Nov 30, 2015

If am I am to take a wild general guess, considering what and how a Linux Kernel work (that it molds on to the hardware system), I suppose v3 architecture from Intel is one step ahead of the Kernel development schedule to be consider a fully stable & supported layout, otherwise v1 imposes no trouble.

Not the kernel develpment at all, but the RHEL kernels (which PVE uses in a patched version). This ist why newer 3.16 and also 4.0 kernels do not have this problem (and PVE 4.0 also not, depending on its 4.0 kernel).
Imho upgrading to PVE 4 or using the backported kernel 3.16 or even the 4.0 from PVE 4 in PVE 3 ist the only way to fix this issue.

e100 · Dec 3, 2015

For me this issue only happens with virtIO, IDE works fine.
Anyone tried SCSI virtio?

I've got more important fish to fry than this problem so I've not invested much time on this problem since the IDE workaround seems to be sufficient for now.

avladulescu · Dec 3, 2015

@ e100 - tested every possible setting, even ISCSI, hot plug disabled, all same results. But while some are still frying the fish, other stubborn people prefer to have a solid solution to this bug and use virtio (the best performance) without any issue.

As a tested solution to this issue, I can confirm after testing that backports kernel 3.16 still imposed issues on the virtual machines I was running. I was still experiencing lock-ups on the VMs running 2.6 kernels (centos 6.7) and 3.16 vCPU & IO wait (debian 8.2) was misbehaving over VM transfer to another host in the cluster one one setup, but on the other hand, on another setup which is only running 7.9 VMS, the backports did help solve the issue.

As now, I can conclude that all this issues, I have posted and explained into another thread of mine ( http://forum.proxmox.com/threads/24277-VM-high-vCPU-usage-issues ) are totally gone.
Following the trial & error suggestion Spirit has done, meaning to try an upgrade all cluster nodes to 4.2.3-2-pve kernel, even if running 3.4 Proxmox version and see the outcome afterwards, has succeeded.

As a rule of thumb, I manage to identify a couple of years ago when I started find my way around how Proxmox works and what are the pros and the cons, I discovered an important fact to keep in mind:

Run the Hypervisor Host with a kernel version at least equal or same branch to the VMs that you're planing to deploy on to it.

My sanity testing implied the following kernel versions and OS types:

Centos 6.7 - 2.6.32-573
Debian 7.9 - 3.2
Debian 8.2 - 3.16
Ubuntu 14.04.1 LTS - 3.19
Ubuntu 15.10 LTS - 4.2.0-16

The testing has been done with a private test cluster but as well as with a production cluster and also Ubuntu VMs don't experience vCPUs lock-ups after this.

So as a final solution, head and install the following on your guest system:

- pve-firmware_1.1-7_all.deb
- pve-headers-4.2.3-2-pve_4.2.3-22_amd64.deb
- pve-kernel-4.2.3-2-pve_4.2.3-22_amd64.deb

Cheers

Linux guest problems on new Haswell-EP processors

Active Member

Renowned Member

Active Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Active Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Famous Member

Renowned Member

We value your privacy