ceph performance 4node all NVMe 56GBit Ethernet

  • Like
Reactions: Gerhard W. Recher
Has anyone gotten RDMA working with Luminous 12.2? I've been working on it yesterday (https://community.mellanox.com/docs/DOC-2721) but either this method was for a different implementation or my 2nd grade reading comprehension is too low to follow directions.

alex,
i have connectx3 pro which card do you have ?

81:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

I have no glue if this card is RDMA capable ....

... and fetching source from git and compiling ??? puhh this will perhaps disturb pve modules interacting with ceph modules ....
 
it is.


Its supposed to be in the trunk now, or at least thats my understanding.

hmmmm... should it be sufficient to specify
Code:
in global section of ceph.conf :

ms_type=async+rdma
ms_async_rdma_device_name=mlx4_0  or mlx4_en ?
 
well, your rdma device would be mlx4_0 (not the en which stands for ethernet) and yes, thats pretty much it- except it didnt work for me... once I tried to restart the mgr and monitors there was no communication and they all failed.
 
well, your rdma device would be mlx4_0 (not the en which stands for ethernet) and yes, thats pretty much it- except it didnt work for me... once I tried to restart the mgr and monitors there was no communication and they all failed.
maybe these settings must be done ?
Code:
https://community.mellanox.com/docs/DOC-2693

...
Open /etc/security/limits.conf and add the following lines to ping the memory.The RDMA is tightly coupled with the physical memory address.

* soft memlock unlimited

* hard memlock unlimited

root soft memlock unlimited

root hard memlock unlimited

....
[\code]
 
alex and please someone @dietmar ... I have no glue how to get rdma working on strech !

for example rping does not work ...
Code:
 rping -s -v 192.168.100.141
rdma_create_event_channel: No such device

 ifconfig ens1d1
ens1d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.100.141  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::268a:7ff:fee2:6071  prefixlen 64  scopeid 0x20<link>
        ether 24:8a:07:e2:60:71  txqueuelen 1000  (Ethernet)
        RX packets 42407699  bytes 117272882136 (109.2 GiB)
        RX errors 0  dropped 1824  overruns 1824  frame 0
        TX packets 35992868  bytes 55776542133 (51.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

card:
mstflint -d 81:00.0 q
Image type:          FS2
FW Version:          2.40.7000
FW Release Date:     22.3.2017
Product Version:     02.40.70.00
Rom Info:            type=PXE version=3.4.746 devid=4103
Device ID:           4103
Description:         Node             Port1            Port2            Sys image
GUIDs:               ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
MACs:                                     248a07e26070     248a07e26071
VSD:
PSID:                MT_1090111023

 lsmod|grep mlx
mlx4_en               114688  0
mlx4_core             294912  1 mlx4_en
devlink                32768  2 mlx4_en,mlx4_core
ptp                    20480  2 ixgbe,mlx4_en

 dpkg --list ms*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                              Version               Architecture          Description
+++-=================================-=====================-=====================-========================================================================
ii  mstflint                          4.6.0-1               amd64                 Mellanox firmware burning application
 
Last edited:
I have no glue how to get rdma working on strech !

1. you need the following packages:
apt-get install libibcm1 libibverbs1 ibverbs-utils librdmacm1 rdmacm-utils libdapl2 ibsim-utils ibutils libcxgb3-1 libibmad5 libibumad3 libmlx4-1 libmthca1 libnes1

recommeded: infiniband-diags perftest srptools

2. you should have the following modules loaded:
mlx4_ib
rdma_ucm
ib_umad
ib_uverbs

3. unless you're using gluster, there doesnt appear to be any reason to bother.
 
  • Like
Reactions: Gerhard W. Recher
1. you need the following packages:
apt-get install libibcm1 libibverbs1 ibverbs-utils librdmacm1 rdmacm-utils libdapl2 ibsim-utils ibutils libcxgb3-1 libibmad5 libibumad3 libmlx4-1 libmthca1 libnes1

recommeded: infiniband-diags perftest srptools

2. you should have the following modules loaded:
mlx4_ib
rdma_ucm
ib_umad
ib_uverbs

3. unless you're using gluster, there doesnt appear to be any reason to bother.

Thx for this hint !

but i'm still stuck ....

I modified ceph.conf
Code:
ms_type=async+rdma
ms_cluster_type = async+rdma
ms_async_rdma_device_name=mlx4_0
[\code]

and NO GID parameter ! because ceph.conf is common to all nodes and each node has a distinct GID ... or am i wrong ?

[code]
root@pve01:~# ./showgids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx4_0  1       0       fe80:0000:0000:0000:268a:07ff:fee2:6070                 v1      ens1
mlx4_0  1       1       fe80:0000:0000:0000:268a:07ff:fee2:6070                 v2      ens1
mlx4_0  1       2       0000:0000:0000:0000:0000:ffff:c0a8:dd8d 192.168.221.141         v1      vmbr0
mlx4_0  1       3       0000:0000:0000:0000:0000:ffff:c0a8:dd8d 192.168.221.141         v2      vmbr0
mlx4_0  2       0       fe80:0000:0000:0000:268a:07ff:fee2:6071                 v1      ens1d1
mlx4_0  2       1       fe80:0000:0000:0000:268a:07ff:fee2:6071                 v2      ens1d1
mlx4_0  2       2       0000:0000:0000:0000:0000:ffff:c0a8:648d 192.168.100.141         v1      ens1d1
mlx4_0  2       3       0000:0000:0000:0000:0000:ffff:c0a8:648d 192.168.100.141         v2      ens1d1
n_gids_found=8


so rping is running ...
but ceph won't start

Code:
Sep 22 20:44:29 pve01 ceph-mgr[2393]: 2017-09-22 20:44:29.578317 7f7e94aa6540 -1 RDMAStack RDMAStack!!! WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big enough to allow large amount of registered memory. We recommend setting this parameter to infinity
Sep 22 20:44:29 pve01 systemd[1]: Started Ceph object storage daemon osd.4.
Sep 22 20:44:29 pve01 ceph-mgr[2393]: *** Caught signal (Aborted) **
Sep 22 20:44:29 pve01 ceph-mgr[2393]:  in thread 7f7e8d881700 thread_name:msgr-worker-1
Sep 22 20:44:29 pve01 ceph-mgr[2393]: 2017-09-22 20:44:29.581502 7f7e8d881700 -1 DeviceList failed to get rdma device list.  (38) Function not implemented

have no glue how to set these parameters ...

Code:
according to:
https://community.mellanox.com/docs/DOC-2721

11. If you are using systemd services:

11.1     Validate that the following parameters are set in relevant systemd files in /usr/lib/systemd/system/:

     ceph-disk@.service

          LimitMEMLOCK=infinity

     ceph-mds@.service

          LimitMEMLOCK=infinity

          PrivateDevices=no

     ceph-mgr@.service

          LimitMEMLOCK=infinity

     ceph-mon@.service

          LimitMEMLOCK=infinity

          PrivateDevices=no

     ceph-osd@.service

          LimitMEMLOCK=infinity

     ceph-radosgw@.service

          LimitMEMLOCK=infinity

          PrivateDevices=no

no glue how to set these properties ...
Code:
 systemctl  set-property ceph-mon@0.service LimitMEMLOCK=infinity
Failed to set unit properties on ceph-mon@0.service: Cannot set property LimitMEMLOCK, or unknown property.
root@pve01:~# systemctl|grep ceph-mon
  ceph-mon@0.service                                                                                             loaded active     running   Ceph cluster monitor daemon                                         
  ceph-mon.target                                                                                                loaded active     active    ceph target allowing to start/stop all ceph-mon@.service instances at once
root@pve01:~# systemctl  set-property ceph-mon@0.service LimitMEMLOCK=infinity
Failed to set unit properties on ceph-mon@0.service: Cannot set property LimitMEMLOCK, or unknown property.

[\code]
rdma test between nodes are perfect well!

[code]
root@pve01:~# udaddy -s 192.168.100.142
udaddy: starting client
udaddy: connecting
initiating data transfers
receiving data transfers
data transfers complete
test complete
return status 0
root@pve01:~# rdma_client -s 192.168.100.142
rdma_client: start
rdma_client: end 0
root@pve01:~# b_send_bw -d mlx4_0 -i 1 -F --report_gbits 192.168.100.142
-bash: b_send_bw: command not found
root@pve01:~# ib_send_bw -d mlx4_0 -i 1 -F --report_gbits 192.168.100.142
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 2048[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0244 PSN 0x81dc22
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:221:141
 remote address: LID 0000 QPN 0x0244 PSN 0xecc052
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:221:142
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      1000             48.82              43.55              0.083066
---------------------------------------------------------------------------------------
root@pve01:~# ucmatose
cmatose: starting server
initiating data transfers
completing sends
receiving data transfers
data transfers complete
cmatose: disconnecting
disconnected
test complete
return status 0
 
Last edited:
If you have some more time to spare, i would suggest to create multiple OSDs per NVMe disk. I doubt ceph at this stage will be able to fully utilize an NVMe disk hosting only one osd. Hope i am wrong. I had very good results in the past with 0.8x ceph versions and handling 2 or 3 osd per disk.

How is your CPU usage during write tests? I can recall serious demands with these drives. You may reaching your limits.

I would also suggest to start doing some real tests (if you plan to use ceph as a block storage), with multiple vms running fio or any other disk io benchmark you like.

Some links
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance
http://ceph.com/use-cases/
https://www.redhat.com/cms/managed-...ung-nvme-reference-architecture-201610-en.pdf
 
If you have some more time to spare, i would suggest to create multiple OSDs per NVMe disk. I doubt ceph at this stage will be able to fully utilize an NVMe disk hosting only one osd. Hope i am wrong. I had very good results in the past with 0.8x ceph versions and handling 2 or 3 osd per disk.

How is your CPU usage during write tests? I can recall serious demands with these drives. You may reaching your limits.

I would also suggest to start doing some real tests (if you plan to use ceph as a block storage), with multiple vms running fio or any other disk io benchmark you like.

Some links
http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance
http://ceph.com/use-cases/
https://www.redhat.com/cms/managed-...ung-nvme-reference-architecture-201610-en.pdf
Sakis cpu is no issue in current ceph bluestone on my cluster
 

Attachments

  • bluestore_no_rdma.PNG
    bluestore_no_rdma.PNG
    238.1 KB · Views: 23
  • bluestore_no_rdma_write.PNG
    bluestore_no_rdma_write.PNG
    214.7 KB · Views: 21
@fabian i managed to change the systemd files as mentioned earlier in this thread.

but ceph won't start.... so i reverted my ceph.conf not to use rdma :(
I'm totally stuck

Code:
-- Reboot --
Sep 26 18:56:10 pve02 systemd[1]: Started Ceph cluster manager daemon.
Sep 26 18:56:10 pve02 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Sep 26 18:56:10 pve02 ceph-mgr[2233]: 2017-09-26 18:56:10.427474 7f0e2137e700 -1 Infiniband binding_port  port not found
Sep 26 18:56:10 pve02 ceph-mgr[2233]: /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: In function 'void Device::binding_port(CephContext*, int)' thread 7f0e2137e700 time 2017-09-26 18:56:10.427498
Sep 26 18:56:10 pve02 ceph-mgr[2233]: /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: 144: FAILED assert(active_port)
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55e9dde4bd12]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  2: (Device::binding_port(CephContext*, int)+0x573) [0x55e9de1b2c33]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  3: (Infiniband::init()+0x15f) [0x55e9de1b8f1f]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  4: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x4c) [0x55e9ddf2329c]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  5: (AsyncConnection::_process_connection()+0x446) [0x55e9de1a6d86]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  6: (AsyncConnection::process()+0x7f8) [0x55e9de1ac328]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1125) [0x55e9ddf198a5]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  8: (()+0x4c9288) [0x55e9ddf1d288]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  9: (()+0xb9e6f) [0x7f0e259d4e6f]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  10: (()+0x7494) [0x7f0e260d1494]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  11: (clone()+0x3f) [0x7f0e25149aff]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 26 18:56:10 pve02 ceph-mgr[2233]: 2017-09-26 18:56:10.431055 7f0e2137e700 -1 /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: In function 'void Device::binding_port(CephContext*, int)' thread 7f0e213
Sep 26 18:56:10 pve02 ceph-mgr[2233]: /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: 144: FAILED assert(active_port)
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55e9dde4bd12]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  2: (Device::binding_port(CephContext*, int)+0x573) [0x55e9de1b2c33]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  3: (Infiniband::init()+0x15f) [0x55e9de1b8f1f]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  4: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x4c) [0x55e9ddf2329c]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  5: (AsyncConnection::_process_connection()+0x446) [0x55e9de1a6d86]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  6: (AsyncConnection::process()+0x7f8) [0x55e9de1ac328]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1125) [0x55e9ddf198a5]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  8: (()+0x4c9288) [0x55e9ddf1d288]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  9: (()+0xb9e6f) [0x7f0e259d4e6f]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  10: (()+0x7494) [0x7f0e260d1494]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  11: (clone()+0x3f) [0x7f0e25149aff]
Sep 26 18:56:10 pve02 ceph-mgr[2233]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
 
sorry, no infiniband hardware to test here.. maybe the people over in #ceph or on the ceph-users mailing list can provide further input?
 
sorry, no infiniband hardware to test here.. maybe the people over in #ceph or on the ceph-users mailing list can provide further input?
fabian, it's not ib but 56gBit/s Ethernet .....
may i donate a pair of nic's for your lab ?
ceph mailing list is also not really helpful.
 
fabian, it's not ib but 56gBit/s Ethernet .....

may i donate a pair of nic's for your lab ?

you can contact office@proxmox.com for hardware donations :) (that being said, having test hardware does not necessarily mean we can or will support it or get it to work.. rdma in ceph seems to be still experimental)

you might want to try with the 4.13.1 kernel which will be uploaded to pvetest today or tomorrow (you need to opt-in with "apt install pve-kernel-4.13.1-1-pve", it is not yet pulled in by proxmox-ve)
 
you can contact office@proxmox.com for hardware donations :) (that being said, having test hardware does not necessarily mean we can or will support it or get it to work.. rdma in ceph seems to be still experimental)

you might want to try with the 4.13.1 kernel which will be uploaded to pvetest today or tomorrow (you need to opt-in with "apt install pve-kernel-4.13.1-1-pve", it is not yet pulled in by proxmox-ve)
fabian may i switch to ceph install from ceph ? before trying new kernel from proxmox ? perhaps proxmox ceph compiling instruction are not complete for utilizing RDMA on Mellanox connect3 pro ?

regarding support for RDMA.... I think many customers out there are interested in this ... gaining more cpu cycle back to KVM on nodes is a valuable argument for implementing "hyper-converged" Proxmox !
 
fabian may i switch to ceph install from ceph ? before trying new kernel from proxmox ? perhaps proxmox ceph compiling instruction are not complete for utilizing RDMA on Mellanox connect3 pro ?

not sure what you mean with the first question? the packages from download.ceph.com are not built with different flags than ours, if that is what you mean..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!