CEPH SSD are painfully slow at 33MB/Sec

FlyingTux

New Member
Apr 13, 2019
6
0
1
Germany
Hello

I've some performance issues with my Proxmox Cluster.
I use 3* DELL R620 Server with 2 * 2.6 GHZ XEONs and each server has its own dual port 10G Mellanox 3rd gen NIC for CEPH with a MTU of 9000, the 3 servers are directly connected (no 10G Switch for ceph). The setup is simple (used as a test setup) where each server has one Samsung PM883 480GB SSD as OSD connected to a DELL H710p mini RAID controller via RAID-0. Proxmox itself runs at the latest version freshly installed, the firmware of the nics is the latest available. Iperf3 delivers a nice throughput of around 9.9 GB

Problem:
A simple dd or a copy from VM 1 to VM 2 struggles with around 33MB/sec which is absolutely no acceptable speed. While I read much about Ceph an run it also with Mimic on a different cluster I'm clueless at the moment. Sure I hit first the obviously blog articles and tuning guides. However any input it highly appreciated.

Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 172.16.1.0/28
     fsid = d1222c67-c6e9-4b0b-b8b9-0abe53e5f590
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 5120
     osd pool default min size = 2
     osd pool default size = 3
     public network = 10.0.10.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.server-3]
     host = server-3
     mon addr = 10.0.10.3:6789

[mon.server-1]
     host = server-1
     mon addr = 10.0.10.1:6789

[mon.server-2]
     host = server-2
     mon addr = 10.0.10.2:6789
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host server-1 {
    id -3        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 0.436
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.436
}
host server-2 {
    id -5        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 0.436
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 0.436
}
host server-3 {
    id -7        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 0.436
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.436
}
root default {
    id -1        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 1.308
    alg straw2
    hash 0    # rjenkins1
    item server-1 weight 0.436
    item server-2 weight 0.436
    item server-3 weight 0.436
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Latency stays low between 0-1 while copying.
Code:
ceph osd perf
osd commit_latency(ms) apply_latency(ms)
  0                  0                 0
  2                  0                 0
  1                  0                 0
 
Last edited:
OSD connected to a DELL H710p mini RAID controller via RAID-0
A RAID controller is not really suited for software defined storage technologies and possibly a point for improvement.
Please read the precondition on our docs.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

A simple dd or a copy from VM 1 to VM 2 struggles with around 33MB/sec which is absolutely no acceptable speed.
How did you do the DD from one machine to the other? And better run tests with fio and rados bench. Please also share the results, to see where its going. To find commands see our Ceph benchmark paper.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

Samsung PM883 480GB SSD
I recommend also doing a fio test (see the benchmark paper) on the raw device to have a comparison with the others in the above thread.

In general, there are not enough OSDs in those servers to achieve good speeds. I recommend, you look into the Ceph benchmark paper to get some numbers, so you can compare to where you want to go with it.

EDIT:
[global]
cluster network = 172.16.1.0/28
public network = 10.0.10.0/24
Do these networks reside on the same link? And how did you configure your mesh setup?
 
The probleme with dd or a simple copy, it that it's only use 1thread, low queue depth. so here, the network latency + cpu power can really impact you (with small block size copy, like 4k).



ceph.conf tuning (reduce cpu usage/latency)
-------------------------
#disable cephx (need to restart whole ceph cluster + vm, and this break cephfs if you need it)
auth client required = none
auth cluster required = none
auth service required = none

#disable debug
[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0
rbd_skip_partial_discard = true
perf = true
mutex_perf_counter = false
throttler_perf_counter = false

ceph.conf need to be on ceph servers, and proxmox server.



use the cache=none in vm option, you can try to use cache=writeback, it's only help with sequential write with small block (which could work for your copy workload), but be careful, currently it's slowing down read. (it should be fixed soon in ceph nautilus).


After that,
- don't use raid0 but passthrough for your disk. (if raid0, at least, disable cache in your controller)
- try with faster frequency cpu
- use switchs with low latencies

also your 3 servers are direct connected ? can you send your /etc/network/interfaces configuration ?
 
Last edited:
After that,
- don't use raid0 but passthrough for your disk. (if raid0, at least, disable cache in your controller)
- try with faster frequency cpu
- use switchs with low latencies

also your 3 servers are direct connected ? can you send your /etc/network/interfaces configuration ?
I have to use raid-0 but disabled the cache. My CPU info in my 1st post is misleading...actually each server has 2 CPU's and each CPU hast 8 cores, with HT that 32 cores per server....should be more than enough. Kernel stands on performance regarding the CPU's.

Code:
 rados bench -p CePH-VM 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_server-1_17247
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        25         9   35.9979        36   0.0180919    0.265156
    2      16        31        15   29.9968        24     1.81558    0.743033
    3      16        37        21   27.9967        24     2.12503     1.18066
    4      16        44        28   27.9964        28     3.57194     1.49964
    5      16        49        33   26.3968        20      3.5692      1.6615
    6      16        57        41     27.33        32   0.0159856     1.75642
    7      16        63        47   26.8538        24     2.85781     1.88906
    8      16        69        53   26.4967        24     3.21376      1.9983
    9      16        76        60   26.6633        28     3.21524     2.03883
   10      16        80        64   25.5968        16     2.14982     2.07346
   11      15        81        66    23.997         8     3.57423     2.08645
   12      15        81        66   21.9972         0           -     2.08645
   13      15        81        66   20.3051         0           -     2.08645
   14       7        81        74   21.1402   10.6667     7.14804     2.42559
Total time run:         14.679955
Total writes made:      81
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     22.0709
Stddev Bandwidth:       11.2218
Max bandwidth (MB/sec): 36
Min bandwidth (MB/sec): 0
Average IOPS:           5
Stddev IOPS:            2
Max IOPS:               9
Min IOPS:               0
Average Latency(s):     2.70551
Stddev Latency(s):      1.73505
Max latency(s):         7.50364
Min latency(s):         0.0159856
Cleaning up (deleting benchmark objects)
Removed 81 objects
Clean up completed and total clean up time :0.012810



rados bench -p CePH-VM 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        22         6   23.9954        24    0.783942    0.256662
    2      16        31        15   29.9952        36   0.0109593     0.42979
    3      16        34        18   23.9964        12     2.99816    0.807603
    4      16        40        24   23.9965        24   0.0104326    0.911911
    5      16        44        28   22.3968        16      4.7625     1.26101
    6      16        48        32   21.3303        16       5.946     1.60255
    7      16        51        35   19.9972        12     3.29575     1.84323
    8      16        54        38   18.9974        12     2.91921     1.95155
    9      16        58        42   18.6641        16      1.2321     2.02211
   10      16        62        46   18.3975        16     7.90569     2.20937
   11      16        63        47   17.0886         4     1.96798     2.20423
   12      16        63        47   15.6645         0           -     2.20423
   13      16        63        47   14.4596         0           -     2.20423
   14      15        63        48   13.7125   1.33333     7.97992     2.32456
   15      15        63        48   12.7983         0           -     2.32456
Total time run:       15.737216
Total reads made:     63
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   16.013
Average IOPS:         4
Stddev IOPS:          2
Max IOPS:             9
Min IOPS:             0
Average Latency(s):   3.95356
Max latency(s):       13.2759
Min latency(s):       0.0101389



rados bench -p CePH-VM 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        27        11   43.9911        44    0.010397     0.13392
    2      16        32        16   31.9946        20  0.00184208    0.394396
    3      16        35        19   25.3294        12     2.90997    0.756772
    4      16        37        21   20.9971         8     3.77022     1.02611
    5      16        42        26   20.7974        20     4.11215     1.31635
    6      16        45        29    19.331        12     5.98672     1.67422
    7      16        51        35   19.9976        24     3.47032     1.77268
    8      16        54        38   18.9977        12     7.26645     2.10099
    9      16        56        40   17.7756         8     7.43102     2.22619
   10      16        59        43    17.198        12     1.42732     2.33517
   11      16        60        44   15.9982         4     8.26329      2.4699
   12      16        60        44    14.665         0           -      2.4699
   13      16        60        44   13.5369         0           -      2.4699
   14      12        60        48   13.7127   5.33333     9.47551     2.85015
   15       7        60        53   14.1317        20     5.57713     3.33544
Total time run:       15.716590
Total reads made:     60
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   15.2705
Average IOPS:         3
Stddev IOPS:          2
Max IOPS:             11
Min IOPS:             0
Average Latency(s):   3.9689
Max latency(s):       11.9401
Min latency(s):       0.00146251

The performance is almost identical like on a old ESXI with 3 VM's who share 1 old HDD for Ceph...so there must be some weird issue.

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
#Member of bond0

iface eno2 inet manual
#Member of bond0

iface eno3 inet manual

iface eno4 inet manual

auto enp65s0
iface enp65s0 inet static
    address  172.16.1.1
    netmask  255.255.255.240
    mtu 9000
    up   ip route add 172.16.1.3/32 dev enp65s0
    down ip route del 172.16.1.3/32

auto enp65s0d1
iface enp65s0d1 inet static
    address  172.16.1.1
    netmask  255.255.255.240
    mtu 9000
    up   ip route add 172.16.1.2/32 dev enp65s0d1
    down ip route del 172.16.1.2/32

auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
    address 10.0.10.1
    netmask 255.255.255.0
    gateway 10.0.10.254
    bridge_ports bond0
    bridge_stp off
    bridge_fd 0

The cluster-network is directly attached ( no switch) with LWL cables, the public network is bonded to a fastEthernet Cisco 3560 Layer3 Switch...that should be enough for the monitor traffic and some ssh sessions (those servers are for testing and practicing).

After editing the ceph.conf like mentioned in the above post, the performance went worse...
 
Last edited:
Hm... there seems to be some... see for yourself.
ditaa-2452ee22ef7d825a489a08e0b935453f2b06b0e6.png

http://docs.ceph.com/docs/mimic/rados/configuration/network-config-ref/

More on the architecture and hardware to read.
http://docs.ceph.com/docs/luminous/architecture/
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/
 
@Alvin
I know that diagram but actually I do not know what you try to tell me ?

@udo
Yes, I have enabled the performance mode

@all
Here is an output of a 3 Tier Ceph cluster without a separate clusternetwork, only 1G Nics, consumer grade HDD's (1 OSD per Node) and each server is on a different geological location. OS is a CentOS 7 with Ceph Mimic:
Code:
rados bench -p bkpub_data 10 rand --no-cleanup
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0         0         
-           0
    1      16        42        26   103.983       104   0.0510472   
0.227717
    2      16        58        42   83.9866        64   0.0840142   
0.251901
    3      16        75        59   78.6561        68     0.10658   
0.445594
    4      16       105        89   88.9894       120     1.79773   
0.605022
    5      16       137       121   96.7892       128     1.09862   
0.586289
    6      16       169       153   101.989       128  0.00219087   
0.565845
    7      16       196       180   102.846       108     1.21281   
0.550116
    8      16       224       208   103.989       112   0.0633576   
0.563047
    9      16       250       234    103.99       104  0.00401208   
0.571408
   10      16       269       253    101.19        76  0.00151874   
0.585309
   11       7       270       263   95.6267        40     1.80983   
0.624011
Total time run:       11.3504
Total reads made:     270
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   95.1505
Average IOPS:         23
Stddev IOPS:          7
Max IOPS:             32
Min IOPS:             10
Average Latency(s):   0.65186
Max latency(s):       3.13723
Min latency(s):       0.0013428
 
No I'm curious...I switched on all the Caches of my Raid Controller and plugged in a 1 G Switch for the public network....look what I got::confused::confused::cool:
Code:
rados bench -p VM 10 rand --no-cleanup
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        71        55    219.96       220   0.0107213    0.168748
    2      16       108        92   183.967       148    0.271629    0.270663
    3      16       149       133   177.303       164    0.529153    0.322423
    4      16       192       176   175.972       172    0.437054    0.336477
    5      16       230       214   171.174       152    0.286403    0.349977
    6      16       269       253   168.642       156    0.128445    0.350509
    7      16       315       299   170.833       184    0.231363    0.352929
    8      16       356       340   169.976       164  0.00288305    0.360454
    9      16       398       382   169.755       168     0.84199    0.359518
   10      16       438       422   168.778       160    0.397807    0.368608
Total time run:       10.574067
Total reads made:     438
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   165.688
Average IOPS:         41
Stddev IOPS:          5
Max IOPS:             55
Min IOPS:             37
Average Latency(s):   0.384536
Max latency(s):       1.10396
Min latency(s):       0.00165442

Question:
Can I use the 10GB Network as a public/cluster network just for ceph and use my other nic ports as normal access ports to reach the server ???
 
I replaced my Switch and get much higher bandwith:
Code:
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_server-1_12493
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        91        75   299.983       300    0.320847    0.182094
    2      16       178       162   323.966       348   0.0085761    0.182038
    3      16       253       237   315.966       300   0.0427078    0.186121
    4      16       314       298   297.968       244  0.00814626    0.202689
    5      16       387       371   296.768       292    0.106296    0.206307
    6      16       450       434   289.301       252    0.106864    0.213456
    7      16       519       503   287.396       276  0.00793366    0.216395
    8      16       589       573   286.468       280    0.140945    0.217727
    9      16       665       649   288.411       304    0.490874    0.218006
   10      16       741       725   289.966       304    0.454621    0.217219
Total time run:         10.353948
Total writes made:      742
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     286.654
Stddev Bandwidth:       29.4694
Max bandwidth (MB/sec): 348
Min bandwidth (MB/sec): 244
Average IOPS:           71
Stddev IOPS:            7
Max IOPS:               87
Min IOPS:               61
Average Latency(s):     0.220838
Stddev Latency(s):      0.191353
Max latency(s):         0.667782
Min latency(s):         0.00716044

However, I noticed multiple effects... all Kernel tuning parameters like read_ahead etc. slow down ceph massively, also the hashing part of LACP is picky.

I figured out via bmon, that the heavy load is on my bond interface where the public part of ceph resides (3* 1 Gig LACP) and not on the 10G Full Mesh Network. I don't understand why, should it be wise to run Ceph completely on the 10GB Network without a separate cluster network ???
When I try it to setup an error message pops up:
Code:
Multiple IPs for ceph public network '172.16.1.0/28' detected on server-1: 172.16.1.1 172.16.1.1 use 'mon-address' to specify one of them. (500)
 
Last edited:
After poking around with obviously some bugs (or at least different worklflow than normal ceph) in proxmox like creating osd's which don't appear in the gui ... I placed the whole CEPH to my 10GB Network ...which means also the public part.

The various books of ceph I read and also the proxmox wiki is kind of misleading. To make the story short ONE NEEDS 10GB at minimum on each side !!!

Code:
rados bench -p VM 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_server-1_6571
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       245       229   915.936       916    0.115509   0.0669229
    2      16       364       348   695.919       476     0.10949   0.0894676
    3      16       480       464   618.588       464    0.141777    0.100721
    4      16       602       586   585.926       488    0.134688    0.107445
    5      16       721       705   563.929       476    0.133774    0.111867
    6      16       841       825   549.932       480    0.149869    0.114826
    7      16       962       946   540.506       484    0.151114    0.117325
    8      16      1081      1065   532.433       476    0.133783    0.119167
    9      16      1202      1186   527.041       484    0.120018    0.120681
   10      16      1321      1305   521.931       476    0.131636    0.121862
Total time run:         10.092970
Total writes made:      1322
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     523.929
Stddev Bandwidth:       138.593
Max bandwidth (MB/sec): 916
Min bandwidth (MB/sec): 464
Average IOPS:           130
Stddev IOPS:            34
Max IOPS:               229
Min IOPS:               116
Average Latency(s):     0.12212
Stddev Latency(s):      0.0351463
Max latency(s):         0.22965
Min latency(s):         0.0140939
Code:
rados bench -p VM 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15       336       321   1283.64      1284   0.0277775   0.0463885
    2      15       697       682    1363.6      1444   0.0447637   0.0441119
    3      15      1085      1070   1426.26      1552   0.0449257   0.0425291
    4      15      1501      1486   1485.62      1664   0.0514913     0.04079
    5      15      1920      1905   1523.64      1676   0.0209739   0.0397816
    6      15      2352      2337   1557.65      1728   0.0688246   0.0388397
    7      15      2780      2765   1579.66      1712   0.0233193   0.0383174
    8      15      3197      3182   1590.66      1668   0.0400136   0.0380396
    9      15      3645      3630      1613      1792    0.068332   0.0375842
   10      16      4090      4074   1629.26      1776   0.0993875   0.0372176
Total time run:       10.042272
Total reads made:     4090
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1629.11
Average IOPS:         407
Stddev IOPS:          39
Max IOPS:             448
Min IOPS:             321
Average Latency(s):   0.0373253
Max latency(s):       0.139204
Min latency(s):       0.00208962
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!