CEPH SSD are painfully slow at 33MB/Sec

Discussion in 'Proxmox VE: Installation and configuration' started by FlyingTux, Apr 22, 2019.

  1. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    Hello

    I've some performance issues with my Proxmox Cluster.
    I use 3* DELL R620 Server with 2 * 2.6 GHZ XEONs and each server has its own dual port 10G Mellanox 3rd gen NIC for CEPH with a MTU of 9000, the 3 servers are directly connected (no 10G Switch for ceph). The setup is simple (used as a test setup) where each server has one Samsung PM883 480GB SSD as OSD connected to a DELL H710p mini RAID controller via RAID-0. Proxmox itself runs at the latest version freshly installed, the firmware of the nics is the latest available. Iperf3 delivers a nice throughput of around 9.9 GB

    Problem:
    A simple dd or a copy from VM 1 to VM 2 struggles with around 33MB/sec which is absolutely no acceptable speed. While I read much about Ceph an run it also with Mimic on a different cluster I'm clueless at the moment. Sure I hit first the obviously blog articles and tuning guides. However any input it highly appreciated.

    Code:
    [global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 172.16.1.0/28
         fsid = d1222c67-c6e9-4b0b-b8b9-0abe53e5f590
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.0.10.0/24
    
    [osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring
    
    [mon.server-3]
         host = server-3
         mon addr = 10.0.10.3:6789
    
    [mon.server-1]
         host = server-1
         mon addr = 10.0.10.1:6789
    
    [mon.server-2]
         host = server-2
         mon addr = 10.0.10.2:6789
    
    
    Code:
    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
    
    # devices
    device 0 osd.0 class ssd
    device 1 osd.1 class ssd
    device 2 osd.2 class ssd
    
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
    
    # buckets
    host server-1 {
        id -3        # do not change unnecessarily
        id -2 class ssd        # do not change unnecessarily
        # weight 0.436
        alg straw2
        hash 0    # rjenkins1
        item osd.0 weight 0.436
    }
    host server-2 {
        id -5        # do not change unnecessarily
        id -4 class ssd        # do not change unnecessarily
        # weight 0.436
        alg straw2
        hash 0    # rjenkins1
        item osd.1 weight 0.436
    }
    host server-3 {
        id -7        # do not change unnecessarily
        id -6 class ssd        # do not change unnecessarily
        # weight 0.436
        alg straw2
        hash 0    # rjenkins1
        item osd.2 weight 0.436
    }
    root default {
        id -1        # do not change unnecessarily
        id -8 class ssd        # do not change unnecessarily
        # weight 1.308
        alg straw2
        hash 0    # rjenkins1
        item server-1 weight 0.436
        item server-2 weight 0.436
        item server-3 weight 0.436
    }
    
    # rules
    rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
    }
    
    # end crush map
    
    Latency stays low between 0-1 while copying.
    Code:
    ceph osd perf
    osd commit_latency(ms) apply_latency(ms)
      0                  0                 0
      2                  0                 0
      1                  0                 0
     
    #1 FlyingTux, Apr 22, 2019
    Last edited: Apr 22, 2019
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    A RAID controller is not really suited for software defined storage technologies and possibly a point for improvement.
    Please read the precondition on our docs.
    https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

    How did you do the DD from one machine to the other? And better run tests with fio and rados bench. Please also share the results, to see where its going. To find commands see our Ceph benchmark paper.
    https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

    I recommend also doing a fio test (see the benchmark paper) on the raw device to have a comparison with the others in the above thread.

    In general, there are not enough OSDs in those servers to achieve good speeds. I recommend, you look into the Ceph benchmark paper to get some numbers, so you can compare to where you want to go with it.

    EDIT:
    Do these networks reside on the same link? And how did you configure your mesh setup?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    The probleme with dd or a simple copy, it that it's only use 1thread, low queue depth. so here, the network latency + cpu power can really impact you (with small block size copy, like 4k).



    ceph.conf tuning (reduce cpu usage/latency)
    -------------------------
    #disable cephx (need to restart whole ceph cluster + vm, and this break cephfs if you need it)
    auth client required = none
    auth cluster required = none
    auth service required = none

    #disable debug
    [global]
    debug asok = 0/0
    debug auth = 0/0
    debug buffer = 0/0
    debug client = 0/0
    debug context = 0/0
    debug crush = 0/0
    debug filer = 0/0
    debug filestore = 0/0
    debug finisher = 0/0
    debug heartbeatmap = 0/0
    debug journal = 0/0
    debug journaler = 0/0
    debug lockdep = 0/0
    debug mds = 0/0
    debug mds balancer = 0/0
    debug mds locker = 0/0
    debug mds log = 0/0
    debug mds log expire = 0/0
    debug mds migrator = 0/0
    debug mon = 0/0
    debug monc = 0/0
    debug ms = 0/0
    debug objclass = 0/0
    debug objectcacher = 0/0
    debug objecter = 0/0
    debug optracker = 0/0
    debug osd = 0/0
    debug paxos = 0/0
    debug perfcounter = 0/0
    debug rados = 0/0
    debug rbd = 0/0
    debug rgw = 0/0
    debug throttle = 0/0
    debug timer = 0/0
    debug tp = 0/0
    rbd_skip_partial_discard = true
    perf = true
    mutex_perf_counter = false
    throttler_perf_counter = false

    ceph.conf need to be on ceph servers, and proxmox server.



    use the cache=none in vm option, you can try to use cache=writeback, it's only help with sequential write with small block (which could work for your copy workload), but be careful, currently it's slowing down read. (it should be fixed soon in ceph nautilus).


    After that,
    - don't use raid0 but passthrough for your disk. (if raid0, at least, disable cache in your controller)
    - try with faster frequency cpu
    - use switchs with low latencies

    also your 3 servers are direct connected ? can you send your /etc/network/interfaces configuration ?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #3 spirit, Apr 23, 2019
    Last edited: Apr 23, 2019
  4. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    I have to use raid-0 but disabled the cache. My CPU info in my 1st post is misleading...actually each server has 2 CPU's and each CPU hast 8 cores, with HT that 32 cores per server....should be more than enough. Kernel stands on performance regarding the CPU's.

    Code:
     rados bench -p CePH-VM 10 write
    hints = 1
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_server-1_17247
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        25         9   35.9979        36   0.0180919    0.265156
        2      16        31        15   29.9968        24     1.81558    0.743033
        3      16        37        21   27.9967        24     2.12503     1.18066
        4      16        44        28   27.9964        28     3.57194     1.49964
        5      16        49        33   26.3968        20      3.5692      1.6615
        6      16        57        41     27.33        32   0.0159856     1.75642
        7      16        63        47   26.8538        24     2.85781     1.88906
        8      16        69        53   26.4967        24     3.21376      1.9983
        9      16        76        60   26.6633        28     3.21524     2.03883
       10      16        80        64   25.5968        16     2.14982     2.07346
       11      15        81        66    23.997         8     3.57423     2.08645
       12      15        81        66   21.9972         0           -     2.08645
       13      15        81        66   20.3051         0           -     2.08645
       14       7        81        74   21.1402   10.6667     7.14804     2.42559
    Total time run:         14.679955
    Total writes made:      81
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     22.0709
    Stddev Bandwidth:       11.2218
    Max bandwidth (MB/sec): 36
    Min bandwidth (MB/sec): 0
    Average IOPS:           5
    Stddev IOPS:            2
    Max IOPS:               9
    Min IOPS:               0
    Average Latency(s):     2.70551
    Stddev Latency(s):      1.73505
    Max latency(s):         7.50364
    Min latency(s):         0.0159856
    Cleaning up (deleting benchmark objects)
    Removed 81 objects
    Clean up completed and total clean up time :0.012810
    
    
    
    rados bench -p CePH-VM 10 seq
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        22         6   23.9954        24    0.783942    0.256662
        2      16        31        15   29.9952        36   0.0109593     0.42979
        3      16        34        18   23.9964        12     2.99816    0.807603
        4      16        40        24   23.9965        24   0.0104326    0.911911
        5      16        44        28   22.3968        16      4.7625     1.26101
        6      16        48        32   21.3303        16       5.946     1.60255
        7      16        51        35   19.9972        12     3.29575     1.84323
        8      16        54        38   18.9974        12     2.91921     1.95155
        9      16        58        42   18.6641        16      1.2321     2.02211
       10      16        62        46   18.3975        16     7.90569     2.20937
       11      16        63        47   17.0886         4     1.96798     2.20423
       12      16        63        47   15.6645         0           -     2.20423
       13      16        63        47   14.4596         0           -     2.20423
       14      15        63        48   13.7125   1.33333     7.97992     2.32456
       15      15        63        48   12.7983         0           -     2.32456
    Total time run:       15.737216
    Total reads made:     63
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   16.013
    Average IOPS:         4
    Stddev IOPS:          2
    Max IOPS:             9
    Min IOPS:             0
    Average Latency(s):   3.95356
    Max latency(s):       13.2759
    Min latency(s):       0.0101389
    
    
    
    rados bench -p CePH-VM 10 rand
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        27        11   43.9911        44    0.010397     0.13392
        2      16        32        16   31.9946        20  0.00184208    0.394396
        3      16        35        19   25.3294        12     2.90997    0.756772
        4      16        37        21   20.9971         8     3.77022     1.02611
        5      16        42        26   20.7974        20     4.11215     1.31635
        6      16        45        29    19.331        12     5.98672     1.67422
        7      16        51        35   19.9976        24     3.47032     1.77268
        8      16        54        38   18.9977        12     7.26645     2.10099
        9      16        56        40   17.7756         8     7.43102     2.22619
       10      16        59        43    17.198        12     1.42732     2.33517
       11      16        60        44   15.9982         4     8.26329      2.4699
       12      16        60        44    14.665         0           -      2.4699
       13      16        60        44   13.5369         0           -      2.4699
       14      12        60        48   13.7127   5.33333     9.47551     2.85015
       15       7        60        53   14.1317        20     5.57713     3.33544
    Total time run:       15.716590
    Total reads made:     60
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   15.2705
    Average IOPS:         3
    Stddev IOPS:          2
    Max IOPS:             11
    Min IOPS:             0
    Average Latency(s):   3.9689
    Max latency(s):       11.9401
    Min latency(s):       0.00146251
    
    
    The performance is almost identical like on a old ESXI with 3 VM's who share 1 old HDD for Ceph...so there must be some weird issue.

    Code:
    auto lo
    iface lo inet loopback
    
    iface eno1 inet manual
    #Member of bond0
    
    iface eno2 inet manual
    #Member of bond0
    
    iface eno3 inet manual
    
    iface eno4 inet manual
    
    auto enp65s0
    iface enp65s0 inet static
        address  172.16.1.1
        netmask  255.255.255.240
        mtu 9000
        up   ip route add 172.16.1.3/32 dev enp65s0
        down ip route del 172.16.1.3/32
    
    auto enp65s0d1
    iface enp65s0d1 inet static
        address  172.16.1.1
        netmask  255.255.255.240
        mtu 9000
        up   ip route add 172.16.1.2/32 dev enp65s0d1
        down ip route del 172.16.1.2/32
    
    auto bond0
    iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer3+4
    
    auto vmbr0
    iface vmbr0 inet static
        address 10.0.10.1
        netmask 255.255.255.0
        gateway 10.0.10.254
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0
    
    The cluster-network is directly attached ( no switch) with LWL cables, the public network is bonded to a fastEthernet Cisco 3560 Layer3 Switch...that should be enough for the monitor traffic and some ssh sessions (those servers are for testing and practicing).

    After editing the ceph.conf like mentioned in the above post, the performance went worse...
     
    #4 FlyingTux, Apr 23, 2019
    Last edited: Apr 23, 2019
  5. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,829
    Likes Received:
    158
    Hi,
    which profile do you have selected in the bios? If isn't Performance you can have very weird IO-performance (had this with normal raid volumes on an R620).

    Udo
     
  6. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,093
    Likes Received:
    184
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,302
    Likes Received:
    131
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    @Alvin
    I know that diagram but actually I do not know what you try to tell me ?

    @udo
    Yes, I have enabled the performance mode

    @all
    Here is an output of a 3 Tier Ceph cluster without a separate clusternetwork, only 1G Nics, consumer grade HDD's (1 OSD per Node) and each server is on a different geological location. OS is a CentOS 7 with Ceph Mimic:
    Code:
    rados bench -p bkpub_data 10 rand --no-cleanup
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
    lat(s)
        0       0         0         0         0         0         
    -           0
        1      16        42        26   103.983       104   0.0510472   
    0.227717
        2      16        58        42   83.9866        64   0.0840142   
    0.251901
        3      16        75        59   78.6561        68     0.10658   
    0.445594
        4      16       105        89   88.9894       120     1.79773   
    0.605022
        5      16       137       121   96.7892       128     1.09862   
    0.586289
        6      16       169       153   101.989       128  0.00219087   
    0.565845
        7      16       196       180   102.846       108     1.21281   
    0.550116
        8      16       224       208   103.989       112   0.0633576   
    0.563047
        9      16       250       234    103.99       104  0.00401208   
    0.571408
       10      16       269       253    101.19        76  0.00151874   
    0.585309
       11       7       270       263   95.6267        40     1.80983   
    0.624011
    Total time run:       11.3504
    Total reads made:     270
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   95.1505
    Average IOPS:         23
    Stddev IOPS:          7
    Max IOPS:             32
    Min IOPS:             10
    Average Latency(s):   0.65186
    Max latency(s):       3.13723
    Min latency(s):       0.0013428
     
  9. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    No I'm curious...I switched on all the Caches of my Raid Controller and plugged in a 1 G Switch for the public network....look what I got::confused::confused::cool:
    Code:
    rados bench -p VM 10 rand --no-cleanup
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        71        55    219.96       220   0.0107213    0.168748
        2      16       108        92   183.967       148    0.271629    0.270663
        3      16       149       133   177.303       164    0.529153    0.322423
        4      16       192       176   175.972       172    0.437054    0.336477
        5      16       230       214   171.174       152    0.286403    0.349977
        6      16       269       253   168.642       156    0.128445    0.350509
        7      16       315       299   170.833       184    0.231363    0.352929
        8      16       356       340   169.976       164  0.00288305    0.360454
        9      16       398       382   169.755       168     0.84199    0.359518
       10      16       438       422   168.778       160    0.397807    0.368608
    Total time run:       10.574067
    Total reads made:     438
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   165.688
    Average IOPS:         41
    Stddev IOPS:          5
    Max IOPS:             55
    Min IOPS:             37
    Average Latency(s):   0.384536
    Max latency(s):       1.10396
    Min latency(s):       0.00165442
    
    Question:
    Can I use the 10GB Network as a public/cluster network just for ceph and use my other nic ports as normal access ports to reach the server ???
     
  10. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    I replaced my Switch and get much higher bandwith:
    Code:
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_server-1_12493
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16        91        75   299.983       300    0.320847    0.182094
        2      16       178       162   323.966       348   0.0085761    0.182038
        3      16       253       237   315.966       300   0.0427078    0.186121
        4      16       314       298   297.968       244  0.00814626    0.202689
        5      16       387       371   296.768       292    0.106296    0.206307
        6      16       450       434   289.301       252    0.106864    0.213456
        7      16       519       503   287.396       276  0.00793366    0.216395
        8      16       589       573   286.468       280    0.140945    0.217727
        9      16       665       649   288.411       304    0.490874    0.218006
       10      16       741       725   289.966       304    0.454621    0.217219
    Total time run:         10.353948
    Total writes made:      742
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     286.654
    Stddev Bandwidth:       29.4694
    Max bandwidth (MB/sec): 348
    Min bandwidth (MB/sec): 244
    Average IOPS:           71
    Stddev IOPS:            7
    Max IOPS:               87
    Min IOPS:               61
    Average Latency(s):     0.220838
    Stddev Latency(s):      0.191353
    Max latency(s):         0.667782
    Min latency(s):         0.00716044
    
    However, I noticed multiple effects... all Kernel tuning parameters like read_ahead etc. slow down ceph massively, also the hashing part of LACP is picky.

    I figured out via bmon, that the heavy load is on my bond interface where the public part of ceph resides (3* 1 Gig LACP) and not on the 10G Full Mesh Network. I don't understand why, should it be wise to run Ceph completely on the 10GB Network without a separate cluster network ???
    When I try it to setup an error message pops up:
    Code:
    Multiple IPs for ceph public network '172.16.1.0/28' detected on server-1: 172.16.1.1 172.16.1.1 use 'mon-address' to specify one of them. (500)
     
    #10 FlyingTux, May 4, 2019
    Last edited: May 4, 2019
  11. FlyingTux

    FlyingTux New Member

    Joined:
    Apr 13, 2019
    Messages:
    6
    Likes Received:
    0
    After poking around with obviously some bugs (or at least different worklflow than normal ceph) in proxmox like creating osd's which don't appear in the gui ... I placed the whole CEPH to my 10GB Network ...which means also the public part.

    The various books of ceph I read and also the proxmox wiki is kind of misleading. To make the story short ONE NEEDS 10GB at minimum on each side !!!

    Code:
    rados bench -p VM 10 write --no-cleanup
    hints = 1
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_server-1_6571
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16       245       229   915.936       916    0.115509   0.0669229
        2      16       364       348   695.919       476     0.10949   0.0894676
        3      16       480       464   618.588       464    0.141777    0.100721
        4      16       602       586   585.926       488    0.134688    0.107445
        5      16       721       705   563.929       476    0.133774    0.111867
        6      16       841       825   549.932       480    0.149869    0.114826
        7      16       962       946   540.506       484    0.151114    0.117325
        8      16      1081      1065   532.433       476    0.133783    0.119167
        9      16      1202      1186   527.041       484    0.120018    0.120681
       10      16      1321      1305   521.931       476    0.131636    0.121862
    Total time run:         10.092970
    Total writes made:      1322
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     523.929
    Stddev Bandwidth:       138.593
    Max bandwidth (MB/sec): 916
    Min bandwidth (MB/sec): 464
    Average IOPS:           130
    Stddev IOPS:            34
    Max IOPS:               229
    Min IOPS:               116
    Average Latency(s):     0.12212
    Stddev Latency(s):      0.0351463
    Max latency(s):         0.22965
    Min latency(s):         0.0140939
    
    Code:
    rados bench -p VM 10 rand
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      15       336       321   1283.64      1284   0.0277775   0.0463885
        2      15       697       682    1363.6      1444   0.0447637   0.0441119
        3      15      1085      1070   1426.26      1552   0.0449257   0.0425291
        4      15      1501      1486   1485.62      1664   0.0514913     0.04079
        5      15      1920      1905   1523.64      1676   0.0209739   0.0397816
        6      15      2352      2337   1557.65      1728   0.0688246   0.0388397
        7      15      2780      2765   1579.66      1712   0.0233193   0.0383174
        8      15      3197      3182   1590.66      1668   0.0400136   0.0380396
        9      15      3645      3630      1613      1792    0.068332   0.0375842
       10      16      4090      4074   1629.26      1776   0.0993875   0.0372176
    Total time run:       10.042272
    Total reads made:     4090
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1629.11
    Average IOPS:         407
    Stddev IOPS:          39
    Max IOPS:             448
    Min IOPS:             321
    Average Latency(s):   0.0373253
    Max latency(s):       0.139204
    Min latency(s):       0.00208962
    
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice