ceph performance 4node all NVMe 56GBit Ethernet

not sure what you mean with the first question? the packages from download.ceph.com are not built with different flags than ours, if that is what you mean..
fabian yes i meant this .... so both repros are identical ?
Code:
deb https://download.ceph.com/debian-luminous stretch main
deb http://download.proxmox.com/debian/ceph-luminous stretch main
[\code]

ok so i wait for Kernel 4.13.x    ... how to get a notification if kernel is pushed ?
 
fabian yes i meant this .... so both repros are identical ?
Code:
deb https://download.ceph.com/debian-luminous stretch main
deb http://download.proxmox.com/debian/ceph-luminous stretch main
[\code]

ok so i wait for Kernel 4.13.x    ... how to get a notification if kernel is pushed ?

no, they are not identical (currently, the delta is very small, we just enable some systemd units by default which upstream does not). but they are built using the same build systems with the same compile flags, so I doubt that there is a difference regarding RDMA behaviour.

I'll ping this thread once the kernel is available publicly.
 
  • Like
Reactions: Gerhard W. Recher
Fabian,
I installed new kernel, but rdma for ceph still does not work.
I opened a discussion on ceph-user-list and and a case @ mellanox
shall I initiate a new thread "rdma for ceph" or keep updating this thread ?

feel free to continue here.
 
  • Like
Reactions: Gerhard W. Recher
@fabian mellanox said i'm not using ceph with rdma patch, but i'm conviced versions are ok.
Hi Gerhard,


According to the logs you sent 06-Oct, you used CEPH version not including the fix:


2017-10-06 13:03:29.212438 7fcf636fbf80 0 ceph version 12.2.1 (fc129ad90a65dc0b419412e77cb85ac230da42a6) luminous (stable), process (unknown), pid 2140


The patch that was on top when CEPH was built.

It was done before RDMA fix was added.


author

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 29 Sep 2017 11:05:07 +0300 (10:05 +0200)

committer

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 29 Sep 2017 11:05:07 +0300 (10:05 +0200)

commit

fc129ad90a65dc0b419412e77cb85ac230da42a6

tree

72881ace6b87170c6f6f7a17c8ac929e3ef4eb48

tree | snapshot

parent

181888fb293938ba79f4c96c14bf1459f38d18af

commit | diff


bump version to 12.2.1-pve1


RDMA cherry-pick:

author

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 6 Oct 2017 09:43:23 +0300 (08:43 +0200)

committer

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 6 Oct 2017 09:50:25 +0300 (08:50 +0200)

Commit

b3556f0d898c1bf7e9a439fc710863f1e461beb4

Tree

bac1bffcf9123410c639057cf6501dae48bc0f0f

tree | snapshot

Parent

81bbc789e2c1d7a874e029ec8439052f4c3b8cb7

commit | diff

cherry-pick RDMA bug fix


please check maybe you are using old version of CEPH for some reason.


Correct version should include at least:


author

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 6 Oct 2017 09:49:01 +0300 (08:49 +0200)

committer

Fabian Grünbichler <f.gruenbichler@proxmox.com>

Fri, 6 Oct 2017 09:50:25 +0300 (08:50 +0200)

commit

750ddfb54a4cde63dcf3367b967a859d0b7070e2

tree

59985fd83589daa53c769eb6c86e0e8898830af5

tree | snapshot

parent

b3556f0d898c1bf7e9a439fc710863f1e461beb4

commit | diff

bump version to 12.2.1-pve2





Vladimir

Code:
 ceph -v
ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)

 ceph versions
{
    "mon": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 4
    },
    "mgr": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 4
    },
    "osd": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 28
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 36
    }
}
Code:
 dpkg --list *ceph*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                            Version                      Architecture                 Description
+++-===============================================-============================-============================-===================================================================================================
ii  ceph                                            12.2.1-pve3                  amd64                        distributed storage and file system
ii  ceph-base                                       12.2.1-pve3                  amd64                        common ceph daemon libraries and management tools
un  ceph-client-tools                               <none>                       <none>                       (no description available)
ii  ceph-common                                     12.2.1-pve3                  amd64                        common utilities to mount and interact with a ceph storage cluster
un  ceph-fs-common                                  <none>                       <none>                       (no description available)
un  ceph-mds                                        <none>                       <none>                       (no description available)
ii  ceph-mgr                                        12.2.1-pve3                  amd64                        manager for the ceph distributed storage system
ii  ceph-mon                                        12.2.1-pve3                  amd64                        monitor server for the ceph storage system
ii  ceph-osd                                        12.2.1-pve3                  amd64                        OSD server for the ceph storage system
un  ceph-test                                       <none>                       <none>                       (no description available)
un  libceph                                         <none>                       <none>                       (no description available)
un  libceph1                                        <none>                       <none>                       (no description available)
un  libcephfs                                       <none>                       <none>                       (no description available)
ii  libcephfs1                                      10.2.5-7.2                   amd64                        Ceph distributed file system client library
ii  libcephfs2                                      12.2.1-pve3                  amd64                        Ceph distributed file system client library
un  python-ceph                                     <none>                       <none>                       (no description available)
ii  python-cephfs                                   12.2.1-pve3                  amd64                        Python 2 libraries for the Ceph libcephfs library
I'm now lost in a maze, rdma does still not work for me.
I opened a bug ticked, this is still not assigned. https://bugzilla.proxmox.com/show_bug.cgi?id=1521
also posted on dev-list https://pve.proxmox.com/pipermail/pve-devel/2017-October/029003.html

any advise how to get this runing ?
Code:
deb http://download.proxmox.com/debian/pve stretch pvetest
deb http://download.proxmox.com/debian/ceph-luminous stretch test
 
2017-10-06 13:03:29.212438 7fcf636fbf80 0 ceph version 12.2.1 (fc129ad90a65dc0b419412e77cb85ac230da42a6) luminous (stable), process (unknown), pid 2140

that was ten days ago - maybe you were running the old packages then?
 
that was ten days ago - maybe you were running the old packages then?
Fabian, no I retried at Friday 13th :)
Code:
2017-10-13 15:19:45.542352 7ffbf73e4f80  0 ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable), process (unknown), pid 16829
2017-10-13 15:19:45.542384 7ffbf73e4f80  0 pidfile_write: ignore empty --pid-file
2017-10-13 15:19:45.548031 7ffbf73e4f80  0 load: jerasure load: lrc load: isa
2017-10-13 15:19:45.548325 7ffbf73e4f80  1 leveldb: Recovering log #51056
2017-10-13 15:19:45.550070 7ffbf73e4f80  1 leveldb: Delete type=0 #51056

2017-10-13 15:19:45.550108 7ffbf73e4f80  1 leveldb: Delete type=3 #51055

2017-10-13 15:19:45.550666 7ffbf73e4f80  1 RDMAStack RDMAStack ms_async_rdma_enable_hugepage value is: 0
2017-10-13 15:19:45.550682 7ffbf73e4f80 20 RDMAStack RDMAStack constructing RDMAStack...
2017-10-13 15:19:45.550693 7ffbf73e4f80 20 RDMAStack  creating RDMAStack:0x56307d9ade70 with dispatcher:0x56307d9f28c0
2017-10-13 15:19:45.550799 7ffbee572700  2 Event(0x56307dcd68c0 nevent=5000 time_id=1).set_owner idx=1 owner=140720012207872
2017-10-13 15:19:45.550796 7ffbeed73700  2 Event(0x56307dcd6600 nevent=5000 time_id=1).set_owner idx=0 owner=140720020600576
2017-10-13 15:19:45.550833 7ffbee572700 20 Event(0x56307dcd68c0 nevent=5000 time_id=1).create_file_event create event started fd=14 mask=1 original mask is 0
2017-10-13 15:19:45.550838 7ffbee572700 20 EpollDriver.add_event add event fd=14 cur_mask=0 add_mask=1 to 13
2017-10-13 15:19:45.550835 7ffbedd71700  2 Event(0x56307dcd6b80 nevent=5000 time_id=1).set_owner idx=2 owner=140720003815168
2017-10-13 15:19:45.550841 7ffbeed73700 20 Event(0x56307dcd6600 nevent=5000 time_id=1).create_file_event create event started fd=11 mask=1 original mask is 0
2017-10-13 15:19:45.550843 7ffbedd71700 20 Event(0x56307dcd6b80 nevent=5000 time_id=1).create_file_event create event started fd=17 mask=1 original mask is 0
2017-10-13 15:19:45.550844 7ffbeed73700 20 EpollDriver.add_event add event fd=11 cur_mask=0 add_mask=1 to 10
2017-10-13 15:19:45.550843 7ffbee572700 20 Event(0x56307dcd68c0 nevent=5000 time_id=1).create_file_event create event end fd=14 mask=1 original mask is 1
2017-10-13 15:19:45.550845 7ffbedd71700 20 EpollDriver.add_event add event fd=17 cur_mask=0 add_mask=1 to 16
2017-10-13 15:19:45.550847 7ffbee572700 10 stack operator() starting
2017-10-13 15:19:45.550850 7ffbeed73700 20 Event(0x56307dcd6600 nevent=5000 time_id=1).create_file_event create event end fd=11 mask=1 original mask is 1
2017-10-13 15:19:45.550852 7ffbeed73700 10 stack operator() starting
2017-10-13 15:19:45.550853 7ffbedd71700 20 Event(0x56307dcd6b80 nevent=5000 time_id=1).create_file_event create event end fd=17 mask=1 original mask is 1
2017-10-13 15:19:45.550856 7ffbedd71700 10 stack operator() starting
2017-10-13 15:19:45.550870 7ffbf73e4f80  0 starting mon.0 rank 0 at public addr 192.168.100.141:6789/0 at bind addr 192.168.100.141:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid cb0aba69-bad9-4d30-b163-c19f0fd1ec53
2017-10-13 15:19:45.550883 7ffbf73e4f80 10 -- - bind bind 192.168.100.141:6789/0
2017-10-13 15:19:45.550885 7ffbf73e4f80 10 -- - bind Network Stack is not ready for bind yet - postponed
2017-10-13 15:19:45.550899 7ffbf73e4f80  0 starting mon.0 rank 0 at 192.168.100.141:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid cb0aba69-bad9-4d30-b163-c19f0fd1ec53
2017-10-13 15:19:45.551555 7ffbf73e4f80  0 mon.0@-1(probing).mds e1 print_map
e1
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
legacy client fscid: -1

No filesystems configured

2017-10-13 15:19:45.551765 7ffbf73e4f80  0 mon.0@-1(probing).osd e12406 crush map has features 283675107524608, adjusting msgr requires
2017-10-13 15:19:45.551771 7ffbf73e4f80  0 mon.0@-1(probing).osd e12406 crush map has features 283675107524608, adjusting msgr requires
2017-10-13 15:19:45.551773 7ffbf73e4f80  0 mon.0@-1(probing).osd e12406 crush map has features 720859615486820352, adjusting msgr requires
2017-10-13 15:19:45.551775 7ffbf73e4f80  0 mon.0@-1(probing).osd e12406 crush map has features 283675107524608, adjusting msgr requires
2017-10-13 15:19:45.552656 7ffbf73e4f80 10 -- - create_connect 192.168.100.144:6808/2159, creating connection and registering
2017-10-13 15:19:45.552667 7ffbf73e4f80 10 -- - >> 192.168.100.144:6808/2159 conn(0x56307df50000 :-1 s=STATE_NONE pgs=0 cs=0 l=0)._connect csq=0
2017-10-13 15:19:45.552682 7ffbf73e4f80 20 Event(0x56307dcd6b80 nevent=5000 time_id=1).wakeup
2017-10-13 15:19:45.552689 7ffbf73e4f80 10 -- - get_connection mgr.18784131 192.168.100.144:6808/2159 new 0x56307df50000
2017-10-13 15:19:45.552694 7ffbf73e4f80  1 -- - --> 192.168.100.144:6808/2159 -- mgropen(unknown.0) v2 -- 0x56307dcd7080 con 0
2017-10-13 15:19:45.552712 7ffbf73e4f80 15 -- - >> 192.168.100.144:6808/2159 conn(0x56307df50000 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0).send_message inline write is denied, reschedule m=0x56307dcd7080
2017-10-13 15:19:45.552729 7ffbedd71700 20 -- - >> 192.168.100.144:6808/2159 conn(0x56307df50000 :-1 s=STATE_CONNECTING pgs=0 cs=0 l=0).process prev state is STATE_CONNECTING
2017-10-13 15:19:45.552984 7ffbf73e4f80  1 -- - start start
2017-10-13 15:19:45.552990 7ffbf73e4f80  1 -- - start start
:
 
2017-10-13 15:19:45.542352 7ffbf73e4f80 0 ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable), process (unknown), pid 16829

which shows the issue is still there with packages with the patch applied, and now you can show this log with the correct git commit hash to the Mellanox devs ;)
 
  • Like
Reactions: Gerhard W. Recher
which shows the issue is still there with packages with the patch applied, and now you can show this log with the correct git commit hash to the Mellanox devs ;)
i Dropped them a note, will come back if they answer ...
complicated stuff ......
 
rdma is up and running, I uninstalled ceph and re-installed it from proxmox test repository.
but performance is disappointing ...

rados bench -p rbd 300 write --no-cleanup -t 56

Without rdma:
Code:
Total time run:         300.065672
Total writes made:      277607
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     3700.62
Stddev Bandwidth:       55.0509
Max bandwidth (MB/sec): 3824
Min bandwidth (MB/sec): 3464
Average IOPS:           925
Stddev IOPS:            13
Max IOPS:               956
Min IOPS:               866
Average Latency(s):     0.0605242
Stddev Latency(s):      0.0151771
Max latency(s):         0.716172
Min latency(s):         0.0378002
With rdma:
Code:
Total time run:         301.923385
Total writes made:      26627
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     352.765
Stddev Bandwidth:       154.788
Max bandwidth (MB/sec): 980
Min bandwidth (MB/sec): 0
Average IOPS:           88
Stddev IOPS:            38
Max IOPS:               245
Min IOPS:               0
Average Latency(s):     0.634218
Stddev Latency(s):      0.620692
Max latency(s):         9.54576
Min latency(s):         0.0650959
 
good news and bad news.

Ceph will start and is obviously ok ...
read is pretty fast
Code:
 rados bench -p rbd 60 seq --no-cleanup -t 56
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      55      1247      1192   4739.25      4768   0.0467244   0.0446195
    2      55      1923      1868   3720.75      2704    0.086846   0.0579308
    3      55      3077      3022   3994.43      4616   0.0508493   0.0543367
    4      55      4252      4197   4168.11      4700   0.0532006   0.0523014
    5      55      5427      5372   4272.47      4700   0.0420237   0.0512055
    6      55      6629      6574   4359.99      4808   0.0428992   0.0502176
    7      55      7862      7807   4441.25      4932   0.0546657   0.0492794
    8      55      9176      9121   4542.18      5256   0.0400031   0.0482799
    9      55     10486     10431   4618.73      5240   0.0399938   0.0475036
   10      55     11815     11760   4686.23      5316   0.0421387   0.0468279
   11      55     13161     13106   4748.79      5384   0.0413468   0.0462269
   12      55     14447     14392   4779.76      5144   0.0423618    0.045937
   13      56     15786     15730   4818.23      5352    0.042763   0.0455721
   14      55     17085     17030   4844.51      5200   0.0438458   0.0453346
   15      56     18433     18377   4878.34      5388   0.0405596    0.045031
   16      55     19773     19718   4908.41      5364   0.0410883   0.0447618
   17      56     21081     21025   4926.42      5228   0.0408308   0.0446013
   18      56     22457     22401   4955.19      5504   0.0434251   0.0443456
   19      55     23805     23750   4977.58      5396   0.0414481   0.0441513
2017-10-17 19:19:36.765416 min lat: 0.0359675 max lat: 0.386146 avg lat: 0.0440493
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      55     25112     25057   4989.71      5228   0.0405618   0.0440493
   21      55     26519     26464   5009.32      5628   0.0415511   0.0438786
   22      56     27866     27810    5023.2      5384    0.042053   0.0437585
   23      56     29203     29147   5032.24      5348   0.0422641   0.0436834
   24      55     30556     30501   5046.38      5416   0.0397609   0.0435653
   25      56     31901     31845   5058.95      5376   0.0416902   0.0434573
   26      56     33235     33179   5068.87      5336   0.0432662   0.0433711
   27      56     34581     34525   5079.26      5384   0.0404584   0.0432882
   28      56     35973     35917   5091.16      5568   0.0388247   0.0431899
   29      56     37334     37278    5099.8      5444   0.0421165   0.0431158
   30      55     38687     38632   5108.69      5416    0.041287   0.0430428
   31      55     40051     39996    5117.9      5456   0.0396263   0.0429687
Total time run:       31.765494
Total reads made:     40670
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   5121.28
Average IOPS:         1280
Stddev IOPS:          132
Max IOPS:             1407
Min IOPS:             676
Average Latency(s):   0.0429428
Max latency(s):       0.386146
Min latency(s):       0.0122565
write is beyond tcp only...
Code:
 rados bench -p rbd 60 write --no-cleanup -t 56
hints = 1
Maintaining 56 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_pve04_7328
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      55       551       496    1983.9      1984   0.0797449    0.103507
    2      55      1202      1147   2293.77      2604   0.0816362   0.0935867
    3      55      1887      1832   2442.41      2740   0.0743531    0.089253
    4      55      2592      2537   2536.74      2820   0.0663337   0.0866659
    5      55      3261      3206   2564.54      2676   0.0653222    0.085569
    6      55      3928      3873   2581.74      2668   0.0857504    0.085483
    7      55      4594      4539   2593.45      2664   0.0804832   0.0852295
    8      55      5265      5210   2604.73      2684   0.0805655   0.0850718
    9      55      5883      5828   2589.95      2472   0.0773111   0.0856514
   10      55      6521      6466   2586.13      2552    0.080982   0.0859005
   11      55      7147      7092   2578.64      2504    0.078478   0.0862052
   12      55      7818      7763   2587.39      2684   0.0803282   0.0857433
   13      55      8467      8412   2588.04      2596   0.0757922   0.0860146
   14      55      9141      9086   2595.72      2696   0.0717423   0.0857552
   15      55      9796      9741   2597.32      2620   0.0687703   0.0857443
   16      55     10488     10433   2607.97      2768   0.0850134   0.0854111
   17      55     11189     11134   2619.49      2804   0.0749925   0.0851103
   18      55     11880     11825   2627.49      2764   0.0844743   0.0848459
   19      56     12571     12515   2634.45      2760   0.0781356   0.0846448
2017-10-17 19:17:24.676126 min lat: 0.0467328 max lat: 0.565147 avg lat: 0.0844825
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      55     13259     13204   2640.52      2756   0.0750785   0.0844825
   21      55     13958     13903   2647.89      2796   0.0764135   0.0842636
   22      55     14649     14594   2653.15      2764    0.071717   0.0841109
   23      55     15324     15269   2655.18      2700   0.0785988   0.0840613
   24      55     15989     15934   2655.37      2660   0.0817391   0.0840582
   25      55     16673     16618   2658.58      2736   0.0771598    0.083969
   26      55     17320     17265   2655.85      2588   0.0638046     0.08393
   27      55     18027     17972   2662.22      2828    0.087538    0.083882
   28      55     18730     18675   2667.55      2812   0.0791072   0.0837174
   29      55     19415     19360   2670.04      2740   0.0786147   0.0836577
   30      55     20079     20024   2669.56      2656   0.0745437   0.0835688
   31      55     20780     20725   2673.89      2804    0.086797   0.0835369
   32      55     21458     21403   2675.07      2712   0.0761673   0.0835176
   33      55     22156     22101   2678.61      2792   0.0756816   0.0834149
   34      55     22873     22818   2684.17      2868    0.080455   0.0832355
   35      55     23541     23486   2683.81      2672    0.154328   0.0832516
   36      55     24233     24178   2686.14      2768    0.084745     0.08319
   37      55     24938     24883   2689.75      2820   0.0753942   0.0830964
   38      55     25650     25595   2693.91      2848   0.0678812   0.0829804
   39      55     26328     26273   2694.36      2712   0.0841915   0.0829506
2017-10-17 19:17:44.678435 min lat: 0.042991 max lat: 0.565147 avg lat: 0.0829929
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      55     26990     26935    2693.2      2648   0.0843487   0.0829929
   41      55     27663     27608   2693.16      2692   0.0674891    0.083007
   42      55     28349     28294   2694.36      2744   0.0868215   0.0829636
   43      55     29012     28957   2693.37      2652   0.0832474   0.0829966
   44      55     29707     29652   2695.33      2780   0.0827921   0.0829493
   45      55     30403     30348   2697.29      2784   0.0795371   0.0828813
   46      55     31090     31035   2698.39      2748    0.156816   0.0828392
   47      55     31768     31713   2698.67      2712   0.0719161   0.0828623
   48      55     32441     32386   2698.53      2692   0.0882288   0.0828552
   49      55     33102     33047   2697.41      2644   0.0854443   0.0828841
   50      55     33775     33720    2697.3      2692   0.0744844   0.0829179
   51      55     34449     34394   2697.26      2696   0.0850129   0.0829116
   52      55     35128     35073   2697.62      2716   0.0739072   0.0829083
   53      55     35799     35744   2697.35      2684   0.0911926   0.0829044
   54      55     36484     36429   2698.14      2740   0.0745314   0.0828959
   55      55     37186     37131   2700.13      2808    0.083383   0.0828285
   56      55     37892     37837   2702.33      2824   0.0742088   0.0827714
   57      55     38572     38517   2702.64      2720   0.0916781    0.082751
   58      55     39274     39219   2704.45      2808    0.132171   0.0827166
   59      55     39979     39924   2706.41      2820     0.16289   0.0826394
2017-10-17 19:18:04.680764 min lat: 0.0319829 max lat: 0.565147 avg lat: 0.08258
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      55     40670     40615   2707.36      2764   0.0319829     0.08258
Total time run:         60.044364
Total writes made:      40670
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     2709.33
Stddev Bandwidth:       125.562
Max bandwidth (MB/sec): 2868
Min bandwidth (MB/sec): 1984
Average IOPS:           677
Stddev IOPS:            31
Max IOPS:               717
Min IOPS:               496
Average Latency(s):     0.0826086
Stddev Latency(s):      0.0228883
Max latency(s):         0.565147
Min latency(s):         0.0319829

bad news kvm won't start
Code:
/home/builder/source/ceph-12.2.1/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::set_dispatcher(RDMADispatcher*)' thread 7f396c0d5280 time 2017-10-17 19:26:48.498276
/home/builder/source/ceph-12.2.1/src/msg/async/rdma/Infiniband.cc: 779: FAILED assert(!d ^ !dispatcher)
ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f395c0aea72]
2: (()+0x4432fa) [0x7f395c2622fa]
3: (RDMAStack::RDMAStack(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x97e) [0x7f395c270d1e]
4: (NetworkStack::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x212) [0x7f395c2562d2]
5: (AsyncMessenger::AsyncMessenger(CephContext*, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long)+0xeb5) [0x7f395c24a995]
6: (Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long)+0x10f) [0x7f395c1f7f2f]
7: (Messenger::create_client_messenger(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x14a) [0x7f395c1f85fa]
8: (librados::RadosClient::connect()+0x85) [0x7f396aee4ef5]
9: (rados_connect()+0x1f) [0x7f396ae9184f]
10: (()+0x56a691) [0x5568c87b4691]
11: (()+0x507464) [0x5568c8751464]
12: (()+0x50ae84) [0x5568c8754e84]
13: (()+0x50bb8b) [0x5568c8755b8b]
14: (()+0x50aa54) [0x5568c8754a54]
15: (()+0x50bcd1) [0x5568c8755cd1]
16: (()+0x545f06) [0x5568c878ff06]
17: (()+0x212b36) [0x5568c845cb36]
18: (()+0x3414bc) [0x5568c858b4bc]
19: (()+0x34c1c1) [0x5568c85961c1]
20: (()+0x5d536a) [0x5568c881f36a]
21: (main()+0x12a0) [0x5568c8460120]
22: (__libc_start_main()+0xf1) [0x7f39663a52b1]
23: (()+0x21cdda) [0x5568c8466dda]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
TASK ERROR: start failed: command '/usr/bin/kvm -id 120 -chardev 'socket,id=qmp,path=/var/run/qemu-server/120.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/120.pid -daemonize -smbios 'type=1,uuid=ac188274-8f76-4369-9ec1-4e98e18cfed5' -name wsus -smp '16,sockets=2,cores=8,maxcpus=16' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/120.vnc,x509,password -no-hpet -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed' -m 8096 -k de -object 'iothread,id=iothread-virtio1' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/120.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:cda4de8d061' -drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=rbd:vmpool/vm-120-disk-3:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-scsi2,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2' -drive 'file=rbd:vmpool/vm-120-disk-1:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-virtio0,cache=none,format=raw,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=101' -drive 'file=rbd:vmpool/vm-120-disk-2:mon_host=192.168.100.141;192.168.100.142;192.168.100.143;192.168.100.144:auth_supported=none,if=none,id=drive-virtio1,format=raw,cache=none,aio=native,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb,iothread=iothread-virtio1' -netdev 'type=tap,id=net0,ifname=tap120i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=C6:17:91:A8:66:9C,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'' failed: exit code 1
 
does mapping an image via KRBD work?
 
does mapping an image via KRBD work?
@fabian , which steps to map a image via KRBD ?
may i have to modify vm configuration ? can't follow your suggestion ... I have have no glue how to accomplish this :(

just got a result from Mellanox:
Hi,

1. It could be some performance limitation in ConnectX3-Pro with global pause when RDMA SEND operations are used.
I was advised to try PFC instead of global pause.

2. But before switching to the PFC, could you please run rados bench with smaller block sizes.
In our lab we saw RDMA outperforming TCP with block size= 256k

rados bench -p rbd 60 write --no-cleanup -t 56 -b 256K -o 1M

3. For the crash with KVM - unfortunately 2.2.1 is not including major redesign that was done in the code where the crash occurs.
I guess we'll need to wait until 13.0.0 will be ready for you in order to retest the scenario.

Regards,
Vladimir

update 15:56 not really amusing topic #2 :(
1. We didn't test "rados bench" in our cluster (Spectrum switch with RoCEv2 PFC/ECN, Connectx4-LX cards, SSD disks (not NVME)).
Unfortunately it is not available now for more testing

For block device (FIO WRITE) we have got 4.1 GB/s randwrite performance with one client, io depth =128 and BS=256k

2. the crash is in ceph rdma code when kvm starts and invokes rados_client.
This ceph rdma code is totally different in 13.0.0
 
Last edited:
if you used a recent enough PVE to setup Ceph, you should already have a storage using KRBD (for containers). you can just create a container on it and attempt to start it. but if the Mellanox people are not sure if it will work in Luminous, than it might not make much sense to spend more time on this (now).

the massive refactoring is unfortunate as it also makes it unlikely that we can backport fixes from Mimic to our Luminous, but on the other hand the RDMA part of messenger is marked experimental so stuff like this is to be expected.
 
hi
CEPH 12.2.2: 4n x 4OSD(4 NVMe Intel P4500), net 10G
Code:
rados bench -p pool1 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_m204_54931
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       387       371   1483.92      1484   0.0353006   0.0421762
    2      16       759       743   1485.85      1488   0.0673896   0.0425258
    3      15      1130      1115   1486.52      1488   0.0282569   0.0427402
    4      16      1510      1494   1493.86      1516   0.0249644    0.042531
    5      16      1883      1867   1493.46      1492   0.0253364   0.0426022
    6      16      2261      2245   1496.53      1512   0.0464474   0.0426541
    7      16      2630      2614   1493.58      1476   0.0576214   0.0427081
    8      16      3002      2986   1492.86      1488   0.0320428   0.0427463
    9      16      3393      3377   1500.74      1564   0.0382185   0.0425315
   10      15      3768      3753   1501.05      1504   0.0212161   0.0425504
Total time run:         10.044017
Total writes made:      3769
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1500.99
Stddev Bandwidth:       25.4419
Max bandwidth (MB/sec): 1564
Min bandwidth (MB/sec): 1476
Average IOPS:           375
Stddev IOPS:            6
Max IOPS:               391
Min IOPS:               369
Average Latency(s):     0.0426199
Stddev Latency(s):      0.0165351
Max latency(s):         0.244824
Min latency(s):         0.0121168

in LXC :
Code:
dd if=/dev/zero of=here bs=1M count=10240 oflag=direct
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 41.3299 s, 260 MB/s
iotop on proxmox ~400-430MB/s

in LXC:
Code:
sysbench --threads=100 --db-driver=mysql --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=foo --range_size=1000 --table_size=45000000 --time=60 --rand-type=uniform /usr/share/sysbench/oltp_read_write.lua run
iotop on proxmox ~200MB/s

why?
 
@fabian , which steps to map a image via KRBD ?
may i have to modify vm configuration ? can't follow your suggestion ... I have have no glue how to accomplish this :(

just got a result from Mellanox:


update 15:56 not really amusing topic #2 :(

do you have step by step instruction on how to enable ceph over rdma? I also followed your email thread on
[ceph-users] RDMA with mellanox connect x3pro on debian stretch and proxmox v5.0 kernel 4.10.17-3

I am also stuck on
ms_async_rdma_local_gid= as I have multiple nodes.
 
Last edited:
well... I figure it out as well
.
34278654_10215916886545588_4995711400983658496_n.jpg
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!