Anyone using NFS over Infiniband?

swhite

New Member
Mar 20, 2011
9
0
1
We came upon some Infiniband equipment for cheap and have started testing it on our servers. After fooling with iSCSI for a bit we decided to use NFS, since we mostly use OpenVZ in our setup and NFS is significantly easier and more flexible anyways. It's plenty fast, too, so we're happy about that. My question is if anyone else out there is using NFS over Infiniband in Proxmox and if they have any advice or tips to share for optimizing it? As Infiniband gets cheaper I think the number of people using this setup will increase and it would be nice to have a repository for advice. I'll be updating this as I work out configurations myself. Thanks!
 
A follow up: the biggest problem I'm seeing is that during heavy usage of the NFS volume the host of that volume sees some pretty high CPU loads. I'm guessing this is because it it is using IPoIB and that introduces CPU overhead. I know NFS using RDMA is possible (http://kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt though I could not mount the NFS volume using RDMA on the command line), but it seems sharing NFS via Proxmox isn't using RDMA. Does anyone know a way to make NFS use RDMA in Proxmox? Any thoughts at all on the matter would be greatly appreciated, thanks.
 
In case anyone is interested, I got NFS over RDMA to work, apparently:

Code:
x.x.x.x:/NFS/path /mount/path rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=x.x.x.x,mountvers=3,mountproto=tcp,local_lock=none,addr=x.x.x.x0 0

I had to use the nfs-utils in Debian backports. I am not sure if this will break anything else but it seemed to work fine, and existing NFS mounts were unaffected. Unfortunately, RDMA did not perform as I hoped it would have. Speeds were ~200 MBs slower than IPoIB and CPU loads were still high. Something must be missing. I had to use NFS version 3(same as with using TCP) to mount it, and the rsize and wsize were much lower than I had over IPoIB(compare the above to rsize=1048576,wsize=1048576 I had with IPoIB).

Any thoughts on how to make it faster?
 
I use IPoIB for the Proxmox 2.X Cluster communications and for DRBD replication.
No idea about NFS or NFS with RDMA but I can share what I do know.

We happened to get cheap dual port cards so we bonded them for redundancy.
You can ignore the bonding stuff if you are not using it but I bolded the lines that add performance:
Code:
auto bond3 
iface bond3 inet static
    address  x.x.x.x
    netmask  255.255.255.0
    pre-up modprobe ib_ipoib
    slaves ib0 ib1
    bond_miimon 100
    bond_mode active-backup
[B]    pre-up echo connected > /sys/class/net/ib0/mode[/B]
[B]    pre-up echo connected > /sys/class/net/ib1/mode[/B]
    pre-up modprobe bonding
[B]    mtu 65520 [/B]

These are the sysctl settings I use to get the best IPoIB performance.
We have used these settings for many months with no issues:
Code:
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0

CPU and memory bandwidth have an impact on performance.
On our fastest machines Xeon 3680 with 24GB triple channel RAM iperf reports:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  9.11 GBytes  7.83 Gbits/sec

On AMD Phenom II x6 1100T with dual channel 16GB RAM iperf reports:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  6.60 GBytes  5.67 Gbits/sec

The max speed of 10G Infiniband is only 8Gbps since the protocol takes up two bits.
So my intel systems (I think due to triple channel ram) are pushing 97% of max where the AMD systems are getting 70%.
Also expect full duplex communications to impact performance by about 10%, iperf can test that too.
 
Thanks for the reply. Did you write the Proxmox wiki article on Infiniband? It was a huge help.

We're doing a bonded IPoIB setup like you. Right now we're looking at just using NFS for faster backups and restores, maybe moving to a full HA cluster setup later in the year. The NFS volume will be replicated over Infiniband using DRBD. I've gotten iperf results of 7.86GBits/sec on our network now, but we're upgrading to DDR switches that we found for cheap so we're hoping to increase our speeds. We're planning on using IPoIB, but I'm still curious about RDMA since that's the real advantage of Infiniband. If I have any breakthroughs I'll post here.
 
Yes I wrote some IB stuff in the wiki.

I tried getting RDMA nfs working, I could mount and do ls but I could not read or write any files it would just hang.
I did not update my nfs-utils like you mentioned, will doing so fix this hanging issue?

Too bad you can only do failover bonding with infiniband, sure would be great to aggregate the bandwidth.
 
I only experienced hanging like that when using -o vers=4 in my mount command, and that was when I was not using RDMA. Did you check /proc/mounts to verify it was mounted as RDMA? If it is, try mounting with vers=3 and seeing if that clears up the hanging issue. Anyways, I couldn't get RDMA without updating my nfs-utils through Debian backports, so I'm curious if you got it working without it.

Are you using a switch in your setup, or OpenSM? We were using a Topspin 120 that was pretty old before and didn't support high MTUs, and maybe did not have the best support for NFS-RDMA(which is strange to me, but I didn't see anything about NFSRDMA in the documentation of the Topspin and it is a relatively newer use of RDMA, also this post here suggests, without anything to back it up, that some switches do not support NFS-RDMA http://www.spinics.net/lists/linux-nfs/msg20528.html). Also, did you use the OFED Debian repositories to get the userspace drivers and RDMA verbs for your particular HCA? I downloaded the ibverbs-driver-mlx4 , for example. This also installs libibverbs1 which is the library for RDMA.
 
We are using topspin 120 switches, not had issues with high MTU on them. We are running the last firmware that was released.

We have two racks with IB gear.
In each rack we have two switches and all switches are connected to each other with redundant cables.
Each server is connected into two switches using the dual port hca and active backup bonding.
Subnet manager runs on all four switches.
Very reliable setup perfect for corosync and storage traffic.

I did not install the userspace drivers.
Got some tips/clues on how to do that?
 
The MTU thing was that the Topspin switch was reporting each port at 2048 MTU, which didn't match up with the 65520 MTU we had set on our IPoIB interfaces. I didn't get why it was reporting that MTU. Maybe it didn't matter? But the Topspin 120 was running what appeared to be the last firmware (Release 2.9.0 update 2?), but we had QDR capable HCA's and could only get SDR from the switch. Luckily we found a cheap vendor to upgrade us.

I forget where I found this from, probably from scanning various mailing list archives, but put this in your sources:

Code:
deb http://realloc.spb.ru/ofed/ squeeze main
deb-src http://realloc.spb.ru/ofed/ squeeze main

That contains recent OFED packages. We needed some of it to get our Mellanox cards working. The package you want for RDMA verbs is specific to your HCA. Once up aptitude update, aptitude search for ibverbs and find your HCA vendor and install that package. It will also install the libibverbs1 package, too. My understanding is that these packages are necessary, at the very least the libibverbs1 packages needs to be there to allow for RDMA use. Additionally, you will need to modprobe rdma_ucm and ib_uverbs if they are not loaded(source: http://lists.openfabrics.org/pipermail/general/2009-June/060333.html).

I was able to get NFS to mount over rdma using vers=4 in the options, but it still hung when I tried to do anything. Only when I upgraded nfs-utils could I get NFS RDMA working over vers3 and this setup allowed for read/writes. I wonder though, since you were able to get an RDMA connection without installing any of this extra stuff if it is really necessary? Or did it report as rdma, but was it somehow falling back to TCP? Your guess is as good as mine, honestly.
 
In datagram mode MTU 2048 is max, also when using IPoIB multicast data is limited to 2048.
In connected mode 65520 is max, I wonder if what it was showing simply did not matter.

We used firmware from Cisco_SFS7000Series-2.9.0-170.iso (2.9.0 Build 170) in our Topspin 120 switches

When I mounted using rdma it was with vers=4 and I have the hanging issue, vers=3 will not mount rdma.
I bet the backport nfs-utils fixes the rdma mount with vers=3.

I did modprobe svcrdma on the server and modprobe xprtrdma on the client.
That should be all that is needed according to the kernel docs, so at this point I assume the other ofed stuff is not needed for rdma nfs.
 
I have been successfully using NFS RDMA with nfs vers=4. The trick is to load xprtrdma on the client and put sunrpc.rdma_memreg_strategy=6 in /etc/sysctl.conf. I have done this on a standard installation, no external ofed repositories were added.
 
I have been successfully using NFS RDMA with nfs vers=4. The trick is to load xprtrdma on the client and put sunrpc.rdma_memreg_strategy=6 in /etc/sysctl.conf. I have done this on a standard installation, no external ofed repositories were added.

That's pretty cool. I'll need to try this on my setup. What is the speed difference between RDMA and IPoIB for you? And what IB equipment are you using?
 
My results are mixed. Load on the storage server is lower and latency is usually better with RDMA, but beside that it's hard to tell whether you gain or lose with RDMA. Running iometer on a Windows VM I get the following results (14 SATA disks with 2 RAID1 Intel DC3700 100GB version as flashcache):
NO RDMA:

Test nameLatencyAvg iopsAvg MBpscpu load
Max Throughput-100%Read3.35510415995%
RealLife-60%Rand-65%Read3.3546213684%
Max Throughput-50%Read3.77352411093%
Random-8k-70%Read2.0947603791%
4k-Max Throu-100%Read-100%Random1.6352032089%
4k-Max Throu-100%Write-100%Random2.0446701892%

RDMA:

Test nameLatencyAvg iopsAvg MBpscpu load
Max Throughput-100%Read2.75459714394%
RealLife-60%Rand-65%Read2.0645383592%
Max Throughput-50%Read2.82426013392%
Random-8k-70%Read1.9446063590%
4k-Max Throu-100%Read-100%Random1.9547581892%
4k-Max Throu-100%Write-100%Random2.4244151791%

I use Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] dual port cards on Proxmox servers and Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) dual port cards on storage servers.
The results are based on Iometer test from here: http://vmblog.pl/OpenPerformanceTest32-4k-Random.icf
 
I run Iometer (this test case: http://vmblog.pl/OpenPerformanceTest32-4k-Random.icf ) on Windows VM, but the results were mixed. Usually I get lower latency and load on the storage server. I use Mellanox Technologies MT25208 InfiniHost III Ex on storage servers and Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] on Proxmox servers. Below are my results.

IPoIB:

Test nameLatencyAvg iopsAvg MBpscpu load
Max Throughput-100%Read3.35510415995%
RealLife-60%Rand-65%Read3.3546213684%
Max Throughput-50%Read3.77352411093%
Random-8k-70%Read2.0947603791%
4k-Max Throu-100%Read-100%Random1.6352032089%
4k-Max Throu-100%Write-100%Random2.0446701892%


RDMA:

Test nameLatencyAvg iopsAvg MBpscpu load
Max Throughput-100%Read2.75459714394%
RealLife-60%Rand-65%Read2.0645383592%
Max Throughput-50%Read2.82426013392%
Random-8k-70%Read1.9446063590%
4k-Max Throu-100%Read-100%Random1.9547581892%
4k-Max Throu-100%Write-100%Random2.4244151791%
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!