DRDB + Infiniband

loisl

New Member
Oct 10, 2010
27
1
1
Why is infiniband (20GB) as slow as ethernet?


Raid-Controller is: 3WARE 9650SE-16ML + 3WARE 9750-8i

Code:
Every 2.0s: cat /proc/drbd                                                                 Sun May 27 22:18:47 2012

version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by phil@fat-tyre, 2011-01-28 12:17:35
 0: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r-----
    ns:8100984 nr:0 dw:0 dr:8101184 al:0 bm:494 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:3487757196
        [>....................] sync'ed:  0.3% (3406012/3413920)M
        finish: 733:57:20 speed: 1,304 (1,280) K/sec

 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:5719040 nr:0 dw:0 dr:5719240 al:0 bm:349 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:5220203740
        [>....................] sync'ed:  0.2% (5097852/5103440)M
        finish: 1115:25:41 speed: 1,296 (1,280) K/sec

Code:
:~# dmesg | grep ib
[..]
ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
ib_mthca: Initializing 0000:09:00.0 
[..]

# cat /sys/class/infiniband/mthca0/ports/1/rate
20 Gb/sec (4X DDR) 

# ibstat
 
root@x701:~# ibstat
CA 'mthca0'
      CA type: MT25208 (MT23108 compat mode)
      Number of ports: 2
      Firmware version: 4.7.600
      Hardware version: a0
      Node GUID: 0x0002c9020024e414
      System image GUID: 0x0002c9020024e417
      Port 1:
            State: Active
            Physical state: LinkUp
            Rate: 20
            Base lid: 1
            LMC: 0
            SM lid: 1
            Capability mask: 0x02510a6a
            Port GUID: 0x0002c9020024e415
      Port 2:
            State: Down
            Physical state: Polling
            Rate: 10
            Base lid: 0
            LMC: 0
            SM lid: 0
            Capability mask: 0x02510a6a
            Port GUID: 0x0002c9020024e416

Code:
# cat /etc/network/interfaces

auto lo
iface lo inet loopback

auto vmbr0
iface vmbr0 inet static
        address 80.190.120.8
        netmask 255.255.255.0
        gateway 80.190.120.1
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0

auto ib0
iface ib0 inet static
         address  10.10.10.8
         netmask  255.255.255.0
         pre-up modprobe ib_ipoib
         pre-up echo connected > /sys/class/net/ib0/mode
         mtu 65520

auto eth2
iface eth2 inet static
        address  10.10.11.8
        netmask  255.255.255.0
 
You can use iperf to test speed, that will help rule out a IPoIB issue vs a DRBD/Disk IO issue.

To get max throughput you need to tune the tcp/ip stack.
This works well for me on 10G Infiniband:
Code:
#Infiniband Tuning
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0

Did you adjust the DRBD syncer rate?
Code:
drbdsetup /dev/drbd0 syncer -r 1000M

Is your system writing lots of data preventing the syncer from reading data?
 
Many thanks to you e100

# InfiniBand tuning ... I've found previously and enabled

adjust the DRBD syncer rate has brought a great success, see:
Code:
Every 2.0s: cat /proc/drbd                                                                 Mon May 28 01:30:48 2012

version: 8.3.10 (api:88/proto:86-96)
GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by phil@fat-tyre, 2011-01-28 12:17:35
 0: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r-----
    ns:31617556 nr:0 dw:0 dr:31626548 al:0 bm:1929 lo:2 pe:1 ua:64 ap:0 ep:1 wo:b oos:3464240780
        [>....................] sync'ed:  1.0% (3383044/3413920)M
        finish: 53:00:07 speed: 18,152 (1,768) K/sec

 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:20466560 nr:0 dw:0 dr:20466760 al:0 bm:1249 lo:0 pe:1 ua:0 ap:0 ep:1 wo:b oos:5205456348
        [>....................] sync'ed:  0.4% (5083452/5103440)M
        finish: 1066:41:30 speed: 1,336 (1,280) K/sec
 
18MB/sec still seems slow.
Is this an issue with your disk/array?
With 20Gbps IB I would expect the disks to be the bottleneck on the speed and even a single disk should go faster than 18MB/sec.

If you want to test the IB network to ensure it is performing well:
on both nodes:
Code:
apt-get install iperf
On one node:
Code:
iperf -s
On other node:
Code:
iperf -c IP_OF_OTHER_NODE_HERE
or to perform a full duplex test:
Code:
iperf -c IP_OF_OTHER_NODE_HERE -d

I am interested in seeing the performance of your 20Gbps IB cards and would like to know what CPU and IB cards you have.
Information about IB is hard to find, the more we can share the better we can help others.

With 10G IB my fastest machines can do 7.8Gbps and the slower ones can to a little over 5Gbps.
5Gbps happens to be pretty close to the max write speed on my RAID arrays so I am more than happy with the performance even on the slower machines.
I typically get about 600MB/sec when doing DRBD sync.
I run dual DRBD volumes, if both are syncing I get about 300MB/sec on each.
My RAID cards are limited to about 700MB/sec make write speed.
Syncing two volumes @300MB/sec each involves reading 300MB/sec and writing 600MB/sec on each node so really the RAID cards, not the IB, are my bottleneck. Been thinking of getting a couple Areca 1882 cards to see if dual core RAID cards would double my throughput.

I noticed you are using the stock DRBD module.
There is a bug in 8.3.10 that can cause DRBD to produce a protocol error leading to a split brain, this only happens to me when doing snapshot backups.
Installing 8.3.13 (which fixes this and other bugs) is not difficult, I posted directions in the thread about the bug.
We have reported the bug upstream to openvz project but it has not been fixed yet.
 
Hi e100,

Thanks for the info, that bugs are in DRBD 3.8.10 ... will upgrade to 8.3.13

disk/array is OK, not degraded

Hardware:
SERVER1:
SuperMicro x7DB8
Ram 8x 4GB (32GB)
CPU (771) 2x L5420
Mellanox Infiniband Dual Port HCA PCI-E MHGA28-1TC 20GB/S
Raid-Controller 3WARE 9650SE-16ML
Diskarray 8x Seagate Barracuda 1.5 TB 300 MBps ST31500341AS
Volume 0 60.00 GB
Volume 1 8.13 TB


SERVER2:
SuperMicro x7DBE
Ram 8x 4GB (32GB)
Mellanox Infiniband Dual Port HCA PCI-E MHGA28-1TC 20GB/S
Raid-Controller 3WARE 9750-8i
Diskarray 8x Western Digital Caviar Black 1,5TB 600 MBps WD1502FAEX
Volume 0 60.00 GB
Volume 1 8.13 TB

Code:
# iperf -c 10.10.10.9
------------------------------------------------------------
Client connecting to 10.10.10.9, TCP port 5001
TCP window size:   193 KByte (default)
------------------------------------------------------------
[  3] local 10.10.10.8 port 51189 connected with 10.10.10.9 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  8.12 GBytes  6.97 Gbits/sec

# iperf -c 10.10.10.9 -d
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:   128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.10.10.9, TCP port 5001
TCP window size:   193 KByte (default)
------------------------------------------------------------
[  5] local 10.10.10.8 port 51190 connected with 10.10.10.9 port 5001
[  3] local 10.10.10.8 port 5001 connected with 10.10.10.9 port 36201
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  6.55 GBytes  5.62 Gbits/sec
[  3]  0.0-10.0 sec  6.69 GBytes  5.74 Gbits/sec


Update to DRBD 8.3.13rc1 successfully!

believe it is still on the raid controller.
I have a controller "write cache" is enabled and directly what to 169.136 K/sec ... but quickly drops below 56.116 K/sec

I go sometimes to the data center and close to the 2nd Infiniband cable and share it on the raid controller 3ware 9650SE-16ML against a 3ware 9750-8i

3ware-9650.jpg
 
Last edited:
Further down in the DRBD bug thread I mention that 8.3.13 release is out instead of 8.3.13rc1.
Just remove "rc1" from all the commands and you will get version "8.3.13"(actual 8.3.13 release) instead of "8.3.13rc1"(8.3.13 release candidate 1)
Sorry that I did not make that clear.

Do you have that 8.13TB RAID array partitioned into two volumes for the two DRBD disks?
If so, that is why your performance is horrible.

The first volume is at the beginning of the disk.
The second volume is in the middle of the disk.
When syncing both (or performing normal IO to both) you are creating random disk IO which is always slower than sequential IO.
The first volume reads from the beginning of the disk on the UpTodate node and is writing to the beginning of the disk on the Inconsistent node.
The second volume reads from the middle of the disk on the UpToDate node and is writing to the middle of the disk on the Inconsistent node.
The seeking between the beginning and middle of the disk will drastically impact IO performance and this will be an issue when syncing AND with normal disk IO in day to day operations.

You will have better IO performance if you had two RAID arrays of 4 disks each.
Your peak IO to a SINGLE DRBD volume will be less, but the peak IO to BOTH DRBD volumes at the same time will be higher than your current setup.
If your only IO is the sync, pause one volume so you can see the impact of this inefficient RAID setup:
Code:
drbdsetup /dev/drbd1 pause-sync
Once /dev/drbd1 sync is paused you should see a huge increase in the sync speed of /dev/drbd0 because you are no longer doing unnecessary random disk IO.
I would expect to see at least 200MB/sec, with RAID write cache enabled (with battery backup) I would expect to see 300-400MB/sec.
Huge difference from the 18MB/sec you are getting now.

One other thing that will help DRBD performance is having a battery backup for the write cache on your RAID controller.
DRBD makes updates to its metadata which also creates random IO.
see:http://www.drbd.org/users-guide-8.3/ch-latency.html

Your IB speed is about what I see on my 10Gbps IB gear.
Either your CPU or RAM can not go any faster or you do not have a 20Gbps connection.
That is not a huge deal since I doubt your RAID arrays will go faster than the IB speed you have anyway.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!