Linux guest problems on new Haswell-EP processors

spirit,

The beta kernel you provided causes more problems than it resolves.
After adding it to three nodes I now have a constant barage of corosync errors:
Code:
Jan 13 03:19:28 vm5 corosync[3982]:   [TOTEM ] Retransmit List: 15a90b 15a90d 15a90f
Jan 13 03:19:28 vm5 corosync[3982]:   [TOTEM ] Retransmit List: 15a90d
Jan 13 03:19:28 vm5 corosync[3982]:   [TOTEM ] Retransmit List: 15a911 15a913
Jan 13 03:19:28 vm5 corosync[3982]:   [TOTEM ] Retransmit List: 15a911
Jan 13 03:19:28 vm5 corosync[3982]:   [TOTEM ] Retransmit List: 15a915

I went back to the 3.10 kernel in the Enterprise repo on all three machines.
After reboot of all three the Retransmit List errors stopped.

I suspect this is related to the IPoIB kernel changes that I pointed out earlier.
DRBD get timeouts with IPoIB resulting in split brains when using 3.10 from repo or the beta kernel you provided.
corosync has non-stop re-transmits with 3.10 beta and IPoIB

At this point I think it is safe to say that 3.10 + IPoIB with an IB card that does not use mlx4 driver = problems.
Note: I only assume the mlx4 driver works because it is the only one patched to resolve the race condition that was identified.
 
Nice to hear;-)
lsmod |grep mlx4
mlx4_ib 126754 0
ib_sa 24273 6 mlx4_ib,ib_cm,rdma_cm,rdma_ucm,ib_ipoib,ib_srp
ib_mad 39118 4 ib_sa,mlx4_ib,ib_cm,ib_umad
ib_core 74491 12 ib_mad,ib_sa,mlx4_ib,ib_cm,ib_uverbs,iw_cm,rdma_cm,rdma_ucm,ib_umad,ib_ipoib,ib_srp,ib_iser
mlx4_core 212606 1 mlx4_ib
 
Interesting, I have not tried that kernel in the guests.

Currently I am testing if changing from virtio-blk to IDE resolves the issue.
Looks promising so far but is not yet conclusive.


You changed it inside the guest vm ?
 
...
I suspect this is related to the IPoIB kernel changes that I pointed out earlier.
DRBD get timeouts with IPoIB resulting in split brains when using 3.10 from repo or the beta kernel you provided.
corosync has non-stop re-transmits with 3.10 beta and IPoIB
...
Hi,
strange! I got also DRBD-Errors with Infiniband on 2.6.32, when running verify. With 10GB-Ethernet all runs fine.

On my todo is an test with 3.10, but it's looks that I can save the time...

Udo
 
I've not had DRBD issues with 2.6.32, even running verify.

But 3.10 and the beta 3.10 kernel spirit provided have been noting but problems on some, but not all, servers.
I had started another thread about that but it never went anywhere.

I only mentioned it here to provide feedback to spirit on the beta kernel.

3.10 will be coming someday weather I want it or not and for IPoIB it seems to be unusable.
From what I read all the IB drivers need to use GFP_NOIO when allocating memory to prevent a deadlock with IPoIB.
Only mlx4 has been updated so far and I've not seen any activity related to fixing the others.

http://lkml.org/lkml/2014/4/24/543
 
I've not had DRBD issues with 2.6.32, even running verify.

But 3.10 and the beta 3.10 kernel spirit provided have been noting but problems on some, but not all, servers.
I had started another thread about that but it never went anywhere.

I only mentioned it here to provide feedback to spirit on the beta kernel.

3.10 will be coming someday weather I want it or not and for IPoIB it seems to be unusable.
From what I read all the IB drivers need to use GFP_NOIO when allocating memory to prevent a deadlock with IPoIB.
Only mlx4 has been updated so far and I've not seen any activity related to fixing the others.

http://lkml.org/lkml/2014/4/24/543

Hi,

any info on mellanox website about rhel7 support ? because the 3.10 kernel is the rhel7 kernel.

(my beta kernel is based on rhel7.1beta)
 
Hi,

any info on mellanox website about rhel7 support ? because the 3.10 kernel is the rhel7 kernel.

(my beta kernel is based on rhel7.1beta)

I've not seen anything on mellanox site, their latest drivers on the site seem to be outdated too.
The only info I've seen is on the Linux kernel mailing list.

On the original topic, so far not using virtio-blk has prevented and VM lockups.
Wonder if the wheezy kernel has a buggy virtio driver.
 
I don't know about others, but what i did to resolve this problem was CPU downgrade from Xeon E5-2620 v2 to Xeon E5 2620.
Since then, we don't had this error anymore. I would like to know, if this problems exists with latest Intel Xeon v3 CPUs ?
 
The only solution I have found is to not use virtio for the disk. When using IDE this problem never happens.

Yeah. That's been my observation, too. Ugly workaround, though. :(

*Edit* I've tried SCSI, too. Only IDE works around the bug.
 
Last edited:
I've not tried SATA....

I find it most strange that I only see this on VMs with very little disk IO. VMs running nothing but memcached have the problem where a busy web server constantly writing logs never had the problem.

Wish I had a way to trigger the problem on demand, that would help to identify the cause.
 
I've not tried SATA....

I find it most strange that I only see this on VMs with very little disk IO. VMs running nothing but memcached have the problem where a busy web server constantly writing logs never had the problem.

Wish I had a way to trigger the problem on demand, that would help to identify the cause.

I've actually noticed the opposite: the hangs correlate with heavy I/O. ;-) Definitely happens more often during backups.
 
Since then, we don't had this error anymore. I would like to know, if this problems exists with latest Intel Xeon v3 CPUs ?

Hi, I'm running E5-2687W v3 && E5-2603 v2 , without any problem here (pve-kernel 3.10).
This is with dell poweredge r620/r630.

also I have done all bios update (which include some processor microcode updates).


I never has seen this kind of error like you
 
What kind of storage are you folks running? At the moment I'm using local storage: RAID6 on 10k SAS drives with the Dell Perc 6i.

Servers are two Dell 2970's (AMD) and one 2950 (Intel) and all three boxes are exhibiting this hanging bug. All firmware is up-to-date.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!