Proxmox VE 4.1 Infiniband problem

ikn

Active Member
Mar 13, 2016
10
0
41
49
Hi.
I am sorry for my bad english :(

I have 2 cluster on the Proxmox VE ver. 3.4 and 4.1
All equipment is absolutely identical in both clusters
All nodes has 2*X5650 Xeon, 96Gb RAM and InfiniBand Card: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT / s - IB QDR / 10GigE] (rev b0)
Both clusters uses of the same Mellanox infiniband unmanaged switches
In version 3.4 infiniband works well, without any problems.
In version 4.1 with high network activity network (aka vzdump) becomes unavailable.
After that, in the event log there is a message about the loss of a quorum and then reboots the node.
 
I can confirm that this is a real problem.

Here is what I know:
1. There are no kernel messages when the IPoIB network stops working
2. If I ifdown then ifup the IPoIB interface the IB network starts working again
3. Only seems to happen if server is under heavy load from ( lots of IO and/or lots of CPU usage)

Some additional information.

I had this problem when doing an off-line disk move in the Proxmox interface from CEPH to DRBD9
My disks and network are capable of syncing at 100MB/sec, I had set c-max-rate to 102400 (100MB/s) and experienced this network issue multiple times.
I changed c-max-rate to 10240 (10MB/sec ) and do not have this network issue.

However, on some nodes if I perform the disk move on them even with c-max-rate set to 10MB/sec the network will drop.
Most likely the additional CEPH traffic is triggering the problem.

Code:
# pveversion
pve-manager/4.1-22/aca130cf (running kernel: 4.2.8-1-pve)

The nodes where I have seen this issue have Dual Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
 
I think I discovered my issue.

Some of the DRBD resources were listening on the Infiniband and others on the Ethernet.
After get everything listening on on Infiniband all seems to work just fine.

Apparently I messed up when adding one of the nodes and forgot to specify the IP address of the IB network.
 
After a two month of use I downgrade cluster from version 4.1 to version 3.4.
3.4 works fine.
 
3.4 works fine
I'll not argue with that, as much as I hate to say it 4.x is not production ready yet.

I had another incident of the IB network interface failing. Again I found processes that were making a connection from the ethernet IP to another server's IPOIB IP address. Instead of DRBD it looked like it was VNC connection from another Proxmox node to a KVM process.

There is clearly something wrong with IPOIB on the newer 4.2 and 4.4 kernels.
 
I've seen those bugs, the problem I am having does not result in just poor performance but complete network outage.
Could be the same bug but my symptoms seems to be a little different.

I downgraded the kernel to 4.2.8-37that Andrey reported as not having problems but it made no difference for me.
 
The kernel guys are looking into this bug. Expect a patch next week.
I think I might be suffering from a different known bug that has not been fixed for my IB driver.

Back in 2014 the mlx4 driver was updated to use GFP_NOIO for QP creation when using connected mode and the IPoIB driver was updated to request GFP_NOIO from the hardware drivers.
This was to prevent a deadlock when using NFS on IPoIB and it was speculated that other things like iSCSI could possibly trigger a deadlock too.
https://lkml.org/lkml/2014/5/11/50

In Jan 2016 the qib driver was also updated with this fix:
http://comments.gmane.org/gmane.linux.drivers.rdma/32914

The driver for my IB cards has only been updated to recognize that GFP_NOIO was requested and report the following message:
ib0: can't use GFP_NOIO for QPs on device mthca0, using GFP_KERNEL

My IB network had been stable until I started using DRBD, been running CEPH over it for months without issue.
I have had stability issues using KRBD but librbd works just fine.
KRBD involves kernel IO and networking thus fits in with the GFP_NOIO problem.

DRBD performs IO over the network so its quite possible that DRBD9 combined with IPoIB and a driver that is not using GFP_NOIO is triggering deadlocks.
There is plenty of evidence that this might be happening.
DRBD commands often stall for a long time.
IB interfaces stop passing any traffic
Network problems happen when there is little free RAM and high IO load on DRBD such a doing a full sync at 100MB/sec or restoring a VM from backup into DRBD storage.

I've been running DRBD8.3 and 8.4 on IPoIB for years on this same hardware without problems but DRBD9 has been nothing but problems.

Today I reconfigured DRBD to use an Ethernet IP address instead of the IB.
No more DRBD stability issues, no more IB networking issues, works as expected just slower.
Had three failed attempts to restore a VM into DRBD when using IPoIB, using ethernet it restored fine on the first try.

I doubt the Mellanox devs want to invest time fixing a driver for a legacy product, two years and no fix is a pretty good sign.
Any idea how I can get my driver patched? This is above my skill set.
 
Have you considered replacing with ConnectX family nics? Development of mthca has stopped so you will likely never see the GFP_NOIO patch for this. On the other hand support is still active for any ConnectX family nic.
 
Attached is the GFP_NOIO patch I wrote for the mthca driver.

I wrote it against the latest pve-kernel source that uses the 4.4 kernel from ubuntu-xenial
I'll let it bake in my test cluster for a few days and report back if it resolved the issues or not.
 

Attachments

  • mthca_noio.patch.txt
    3.1 KB · Views: 13
Hi again.
I installed PVE 4.3 and latest updates.
Ceph over infiniband working fine
NFS periodically stalled. If I connect my NFS server over ethernet its working fine.

root@Node-204:~# uname -a
Linux Node-204 4.4.21-1-pve #1 SMP Thu Oct 20 14:56:39 CEST 2016 x86_64 GNU/Linux
 
Hi again.
I installed PVE 4.3 and latest updates.
Ceph over infiniband working fine
NFS periodically stalled. If I connect my NFS server over ethernet its working fine.

root@Node-204:~# uname -a
Linux Node-204 4.4.21-1-pve #1 SMP Thu Oct 20 14:56:39 CEST 2016 x86_64 GNU/Linux
I have seen the same problems also with NFS over infiniband.
 
I updated nodes to the latest version of the kernel. This does not solve my problem.

root@Node-204:~# uname -a
Linux Node-204 4.4.24-1-pve #1 SMP Mon Nov 14 12:30:24 CET 2016 x86_64 GNU/Linux

After some time, NFS server is stalled. Usually after migration VM HDD about 20-100Gb
iperf works fine. I tested infiniband with iperf and i download and upload more than 10Tb without any problem

Any ideas?
 
I've had nothing but problems running drbd over infiniband in proxmox 4.x. CEPH works fine tho.

I got some connectX cards to see if that makes a difference but not had time to test them.

I'm hoping my workload will lighten up in January so I can start looking into the problems again.
 
I got some connectX cards to see if that makes a difference but not had time to test them.
I am using connectX cards without any lock. Relucktantly I have come to the conclusion that Ubuntu's kernel 4.4 is broken regards infiniband and/or NFS.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!