Proxmox CEPH Cluster's Performance

Yes, I meant virtio-scsi driver.
After clear cache, read drops from 1309 MB/sec to 1042 MB/sec
Hi,
1GB/s for your setup looks for me allready too high... perhaps controller-cache.

For comparision you can clear the cache again, do benchmarking inside the VM (so that the controller-cache is filled with VM data) and do the read bench (without previous write-bench) again.

Udo
 
Udo,

Which one is limiting our write speed? Journal or Number of OSDs?
Hi Dave,
ceph is complex. The write speed is (or can be) limited by multible factors. Mainly by the speed (and amount) of the Journal-SSDs.
Not all SSDs are fast for journaling, because ceph do an sync after every write.
See here: https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

Number of replica changed the write speed also.
Other Factors: Network-Speed (latency!), enough RAM on the OSD-Nodes (if you see reads from the journal device you don't have enough RAM!), SAS/Sata-Controller, CPU-Power.

Udo
 
  • Like
Reactions: Dave Wood
I am quite curious, have you upgraded to Bluestore? And if so, what are your speeds like now?
 
Hi everybody,

I also have very poor performance with CEPH on my cluster :(

My cluster config:

Proxmox version:
5.1-35

Node-1:
HP Proliant ML350 G6
  • 32 Go RAM
  • CPU X 16 (2 thread/core; 4 core/socket ; 2 sockets)
  • Disk used for CEPH: 2 x 1To hdd SAS 15K (raid 1 on HP raid card)
    • CEPH type Bluestore with ceph journal into the same disk
  • CEPH network port: 1Gbits
Node-2: Lenovo D20
  • 32 Go RAM
  • 16 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2 sockets)
  • Disk used for CEPH: 1 WDC WD20EFRX-68A Red - 2To SATA 6Gb/s 64 MB - 5400 RPM
    • CEPH type Bluestore with ceph journal into a SSD (220 Go)
  • CEPH network port: 1Gbit
Node-3: Lenovo D30
  • 32 Go RAM
  • 24 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (2 sockets)
  • Disk used for CEPH: 1 WD20EARS Green - 2To SATA II 64Mo - 5400 RPM
    • CEPH type Bluestore with ceph journal into the same disk
  • CEPH network port: 1Gbit

Node- 4: HP Proliant DL380 Gen9
  • 32 Go RAM
  • CPU X 16 (2 thread/core; 8 core/socket ; 1 sockets)
  • Disk used for CEPH: 2 x 500 Go hdd SAS 15K (raid 1 on HP raid card)
    • CEPH type Bluestore with ceph journal into a 20Go partition on SSD ( 2x 220 Go SSD into HP controler raid 1)
  • CEPH network port: 1Gbits

Switch HP 1Gbit for CEPH network

1 pool where all osd was stored:

  • Size/min: 3/1 (Because I want a single machine in the cluster to be able to run all VMs in case of a problem with the cluster)
  • pg_num: 64
4 osd ( 1 per node)


My ceph.conf

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.12.16.0/24
fsid = e41d45fc-30e3-43bf-a028-472a03f47i1f
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.12.16.0/24

# Disable in-memory logs
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.node-1]
host = node-1
mon addr = 10.12.16.1:6789

[mon.node-2]
host = node-2
mon addr = 10.12.16.2:6789

[mon.node-3]
host = node-3
mon addr = 10.12.16.3:6789

[mon.node-4]
host = node-4
mon addr = 10.12.16.4:6789


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node-1 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.909
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.909
}
host node-2 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 1.940
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.940
}
host node-3 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 1.819
alg straw2
hash 0 # rjenkins1
item osd.2
weight 1.819
}
host node-4 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 0.565
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.565
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 5.234
alg straw2 hash 0 # rjenkins1
item node-1 weight 0.909
item node-2 weight 1.940
item node-3 weight 1.819
item node-4 weight 0.565
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit }

# end crush map





 
Benchmarks from node
From node1

root@node1:~# rados -p my-pool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_atlas_24983
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 20 4 15.9961 16 0.770399 0.565837
2 16 34 18 35.9909 56 0.613186 1.18047
3 16 44 28 37.3242 40 1.53114 1.26343
4 16 59 43 42.99 60 1.84357 1.24766
5 16 71 55 43.9898 48 1.7848 1.29295
6 16 78 62 41.324 28 1.34471 1.31874
7 16 90 74 42.2761 48 1.88063 1.35202
8 16 103 87 43.4907 52 1.22835 1.34052
9 16 109 93 41.3246 24 1.81452 1.37555
10 16 120 104 41.5912 44 2.06756 1.42633
Total time run: 10.800131
Total writes made: 121
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 44.8143
Stddev Bandwidth: 14.5082
Max bandwidth (MB/sec): 60
Min bandwidth (MB/sec): 16
Average IOPS: 11
Stddev IOPS: 3
Max IOPS: 15
Min IOPS: 4
Average Latency(s): 1.41582
Stddev Latency(s): 0.476247
Max latency(s): 2.30416
Min latency(s): 0.283322
root@node1:~#

root@node1:~# rados -p my-pool bench 60 seq --no-cleanup
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 45 29 115.967 116 0.997326 0.378364
2 16 79 63 125.968 136 0.956501 0.405377
3 16 109 93 123.971 120 0.787081 0.45363
4 14 121 107 106.977 56 1.05066 0.459556
Total time run: 4.470531
Total reads made: 121
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 108.265
Average IOPS: 27
Stddev IOPS: 8
Max IOPS: 34
Min IOPS: 14
Average Latency(s): 0.552363
Max latency(s): 1.72134
Min latency(s): 0.0343347
root@node1:~#

root@node-1:~# fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=6 --iodepth=2 --runtime=60 --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=2
...
fio-2.16
Starting 6 processes
Jobs: 6 (f=6): [W(6)] [100.0% done] [0KB/1708KB/0KB /s] [0/427/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=6): err= 0: pid=18260: Mon Jan 15 15:56:07 2018
write: io=90952KB, bw=1514.6KB/s, iops=378, runt= 60053msec
clat (msec): min=1, max=493, avg=15.83, stdev=22.63
lat (msec): min=1, max=493, avg=15.83, stdev=22.63
clat percentiles (msec):
| 1.00th=[ 9], 5.00th=[ 9], 10.00th=[ 9], 20.00th=[ 9],
| 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 9], 60.00th=[ 13],
| 70.00th=[ 17], 80.00th=[ 17], 90.00th=[ 25], 95.00th=[ 33],
| 99.00th=[ 117], 99.50th=[ 186], 99.90th=[ 293], 99.95th=[ 396],
| 99.99th=[ 482]
lat (msec) : 2=0.01%, 4=0.01%, 10=56.29%, 20=27.07%, 50=14.33%
lat (msec) : 100=1.12%, 250=1.03%, 500=0.15%
cpu : usr=0.04%, sys=0.16%, ctx=22763, majf=0, minf=59
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=22738/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: io=90952KB, aggrb=1514KB/s, minb=1514KB/s, maxb=1514KB/s, mint=60053msec, maxt=60053msec

Disk stats (read/write):
sdb: ios=1024/24305, merge=0/351, ticks=5264/600420, in_queue=605720, util=100.00%


Nagios check for CEPH interface:
Traffic In : 3.36 Mb/s (0.3 %), Out : 4.48 Mb/s (0.4 %) - Link Speed : 1000000000

From node-2
root@node-2:~# rados -p my-pool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cadcluster-2_23011
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 18 2 7.99957 8 0.980511 0.837696
2 16 30 14 27.9964 48 0.713963 1.32265
3 16 33 17 22.6635 12 2.72949 1.47817
4 16 48 32 31.9953 60 0.85023 1.7088
5 16 53 37 29.5955 20 1.7041 1.68138
6 16 68 52 34.6613 60 1.62137 1.66006
7 16 77 61 34.8517 36 1.66339 1.60751
8 16 85 69 34.4946 32 1.21054 1.62791
9 16 92 76 33.7724 28 3.02591 1.64775
10 16 109 93 37.1941 68 1.56255 1.67116
11 16 110 94 34.1765 4 0.673507 1.66054
12 14 110 96 31.9951 8 2.21228 1.67381
Total time run: 12.142081
Total writes made: 110
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 36.2376
Stddev Bandwidth: 22.6274
Max bandwidth (MB/sec): 68
Min bandwidth (MB/sec): 4
Average IOPS: 9
Stddev IOPS: 5
Max IOPS: 17
Min IOPS: 1
Average Latency(s): 1.76148
Stddev Latency(s): 0.60523
Max latency(s): 3.13594
Min latency(s): 0.38257

root@node-2:~# rados -p my-pool bench 60 seq --no-cleanup
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 57 41 163.959 164 0.192498 0.302417
2 16 101 85 169.954 176 0.289278 0.329004
Total time run: 2.792823
Total reads made: 110
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 157.547
Average IOPS: 39
Stddev IOPS: 2
Max IOPS: 44
Min IOPS: 41
Average Latency(s): 0.399894
Max latency(s): 1.57963
Min latency(s): 0.0412193

root@node-2:~# fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=6 --iodepth=2 --runtime=60 --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=2
...
fio-2.16
Starting 6 processes
Jobs: 6 (f=6): [W(6)] [100.0% done] [0KB/1464KB/0KB /s] [0/366/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=6): err= 0: pid=31340: Mon Jan 15 16:05:11 2018
write: io=60184KB, bw=1003.6KB/s, iops=250, runt= 60001msec
clat (msec): min=4, max=568, avg=23.92, stdev=33.54
lat (msec): min=4, max=568, avg=23.92, stdev=33.54
clat percentiles (msec):
| 1.00th=[ 11], 5.00th=[ 11], 10.00th=[ 12], 20.00th=[ 12],
| 30.00th=[ 12], 40.00th=[ 20], 50.00th=[ 23], 60.00th=[ 23],
| 70.00th=[ 23], 80.00th=[ 23], 90.00th=[ 33], 95.00th=[ 49],
| 99.00th=[ 188], 99.50th=[ 302], 99.90th=[ 400], 99.95th=[ 441],
| 99.99th=[ 570]
lat (msec) : 10=0.99%, 20=39.04%, 50=55.05%, 100=3.25%, 250=0.88%
lat (msec) : 500=0.75%, 750=0.04%
cpu : usr=0.03%, sys=0.27%, ctx=30138, majf=0, minf=65
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=15046/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: io=60184KB, aggrb=1003KB/s, minb=1003KB/s, maxb=1003KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=318/31666, merge=0/209, ticks=6260/486752, in_queue=493228, util=99.87%


Nagios check for CEPH interface: Traffic In : 5.01 Mb/s (0.5 %), Out : 6.38 Mb/s (0.6 %) - Link Speed : 1000000000


From node-3


root@node-3:~# rados -p my-pool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cadcluster-1_25994
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 23 7 27.9981 28 0.8353 0.701256
2 16 34 18 35.996 44 0.622092 0.987845
3 16 46 30 39.9958 48 0.789123 1.10658
4 16 63 47 46.9952 68 1.29141 1.20888
5 16 70 54 43.1953 28 1.32297 1.23323
6 16 84 68 45.3281 56 1.94003 1.26118
7 16 93 77 43.9946 36 0.414177 1.27202
8 16 99 83 41.4949 24 1.96712 1.32592
9 16 110 94 41.7725 44 2.0745 1.38354
10 16 119 103 41.1947 36 1.14513 1.38374
Total time run: 10.972736
Total writes made: 120
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 43.7448
Stddev Bandwidth: 13.734
Max bandwidth (MB/sec): 68
Min bandwidth (MB/sec): 24
Average IOPS: 10
Stddev IOPS: 3
Max IOPS: 17
Min IOPS: 6
Average Latency(s): 1.46007
Stddev Latency(s): 0.608293
Max latency(s): 2.70165
Min latency(s): 0.287395


root@node-3:~# rados -p my-pool bench 60 seq --no-cleanup
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 49 33 131.972 132 0.166881 0.288893
2 16 74 58 115.976 100 0.808538 0.426684
3 16 107 91 121.311 132 0.100169 0.444951
4 15 120 105 104.982 56 0.687433 0.460856
Total time run: 4.196130
Total reads made: 120
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 114.391
Average IOPS: 28
Stddev IOPS: 9
Max IOPS: 33
Min IOPS: 14
Average Latency(s): 0.545573
Max latency(s): 1.64453
Min latency(s): 0.0383955

root@node-3:~# fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=6 --iodepth=2 --runtime=60 --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=2
...
fio-2.16
Starting 6 processes
Jobs: 6 (f=6): [W(6)] [100.0% done] [0KB/1512KB/0KB /s] [0/378/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=6): err= 0: pid=2376: Mon Jan 15 16:09:29 2018
write: io=28728KB, bw=490225B/s, iops=119, runt= 60008msec
clat (msec): min=8, max=610, avg=50.13, stdev=60.64
lat (msec): min=8, max=610, avg=50.13, stdev=60.64
clat percentiles (msec):
| 1.00th=[ 12], 5.00th=[ 12], 10.00th=[ 13], 20.00th=[ 24],
| 30.00th=[ 24], 40.00th=[ 25], 50.00th=[ 25], 60.00th=[ 43],
| 70.00th=[ 49], 80.00th=[ 61], 90.00th=[ 85], 95.00th=[ 157],
| 99.00th=[ 334], 99.50th=[ 420], 99.90th=[ 537], 99.95th=[ 611],
| 99.99th=[ 611]
lat (msec) : 10=0.08%, 20=12.03%, 50=59.26%, 100=20.75%, 250=5.22%
lat (msec) : 500=2.45%, 750=0.21%
cpu : usr=0.01%, sys=0.09%, ctx=14402, majf=0, minf=61
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=7182/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: io=28728KB, aggrb=478KB/s, minb=478KB/s, maxb=478KB/s, mint=60008msec, maxt=60008msec

Disk stats (read/write):
sdb: ios=302/16859, merge=0/261, ticks=10192/593900, in_queue=604192, util=99.92%


Nagios check for CEPH interface:
Traffic In : 5.12 Mb/s (0.5 %), Out : 4.17 Mb/s (0.4 %) - Link Speed : 1000000000



From node-4

root@node-4:~# rados -p my-pool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_zeus_9250
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 17 1 3.99954 4 0.547193 0.547193
2 16 24 8 15.9976 28 1.57422 1.39662
3 16 33 17 22.6634 36 0.256374 1.71357
4 16 45 29 28.996 48 0.456743 1.71227
5 16 57 41 32.7955 48 1.33359 1.63479
6 16 69 53 35.3286 48 1.49434 1.56238
7 16 80 64 36.5666 44 0.581883 1.52677
8 16 93 77 38.495 52 0.569441 1.52224
9 16 103 87 38.6617 40 0.295466 1.45903
10 16 113 97 38.795 40 1.82926 1.50819
11 16 114 98 35.6317 4 1.40668 1.50715
Total time run: 11.251195
Total writes made: 114
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 40.529
Stddev Bandwidth: 17.0134
Max bandwidth (MB/sec): 52
Min bandwidth (MB/sec): 4
Average IOPS: 10
Stddev IOPS: 4
Max IOPS: 13
Min IOPS: 1
Average Latency(s): 1.57879
Stddev Latency(s): 0.670161
Max latency(s): 3.88376
Min latency(s): 0.251962


root@node-4:~# rados -p my-pool bench 60 seq --no-cleanup
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 40 24 95.9845 96 0.966508 0.361017
2 16 70 54 107.983 120 0.217577 0.45263
3 16 103 87 115.984 132 1.15813 0.491828
4 16 114 98 97.9868 44 0.201338 0.49361
Total time run: 4.737685
Total reads made: 114
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 96.2495
Average IOPS: 24
Stddev IOPS: 9
Max IOPS: 33
Min IOPS: 11
Average Latency(s): 0.64647
Max latency(s): 2.04525
Min latency(s): 0.00323975
root@node-4:~#


root@node-4:~# fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=6 --iodepth=2 --runtime=60 --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=2
...
fio-2.16
Starting 6 processes
Jobs: 6 (f=6): [W(6)] [100.0% done] [0KB/589.8MB/0KB /s] [0/151K/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=6): err= 0: pid=30522: Mon Jan 15 16:21:25 2018
write: io=35203MB, bw=600786KB/s, iops=150196, runt= 60001msec
clat (usec): min=24, max=2722, avg=39.17, stdev= 9.17
lat (usec): min=24, max=2722, avg=39.28, stdev= 9.17
clat percentiles (usec):
| 1.00th=[ 30], 5.00th=[ 32], 10.00th=[ 34], 20.00th=[ 35],
| 30.00th=[ 36], 40.00th=[ 38], 50.00th=[ 39], 60.00th=[ 40],
| 70.00th=[ 41], 80.00th=[ 42], 90.00th=[ 44], 95.00th=[ 47],
| 99.00th=[ 58], 99.50th=[ 68], 99.90th=[ 76], 99.95th=[ 91],
| 99.99th=[ 185]
lat (usec) : 50=96.91%, 100=3.05%, 250=0.04%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=7.21%, sys=17.14%, ctx=9014570, majf=0, minf=71
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=9011947/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: io=35203MB, aggrb=600786KB/s, minb=600786KB/s, maxb=600786KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdb: ios=332/8977011, merge=0/18282, ticks=392/283780, in_queue=283984, util=100.00%




Benchmarks from VM

From a VM into node-1 (Linux CentOS)

[root@vmcentos ~]# dd if=/dev/zero of=/tmp/test.data bs=1M count=1000 oflag=direct
1000+0 enregistrements lus
1000+0 enregistrements écrits
1048576000 octets (1.0 GB) copiés, 225.527 s, 4.6 MB/s

[root@vmcentos ~]# dd if=/dev/zero of=/tmp/test.data bs=1M count=1024 conv=fdatasync
1024+0 enregistrements lus
1024+0 enregistrements écrits
1073741824 octets (1.1 GB) copiés, 107.071 s, 10.0 MB/s
From another VM into node-1 (Linux Debian)

moi@vmdebian-1:~$ dd if=/dev/zero of=/tmp/test.data bs=1M count=1024 conv=fdatasync
1024+0 enregistrements lus
1024+0 enregistrements écrits
1073741824 octets (1.1 GB) copiés, 26.5522 s, 40.4 MB/s



Questions
  1. What are the problems with my configuration that can cause these poor performances ?
  2. How can I fix this ?


Thanks !


 
...
  1. What are the problems with my configuration that can cause these poor performances ?
  2. How can I fix this ?

Your OSD hard disks are slow, and your 1 Gbit network is too slow.

Use at least 10 Gbit Network and for small Ceph Cluster, either use SSDs for your OSDs with bluestore or use filestore on normal HDDs but the journal on a fast SSD.
 
  • Like
Reactions: Dubard
In addition to @tom's post, don't use RAID, neither RAID 0, those controllers lie about disks and use their own algorithms on how to read/write data to the disks. Not to mention, that their cache is often the culprit and lets OSDs starve (eg. blocked IO).
 
  • Like
Reactions: Dubard
Hi everybody,

Thank you very much @tom and @Alwin for your reply.
I'll reconfigure my servers and keep you informed.
Still a few more questions please:
  1. For some reason, if I want to temporarily run all my virtual machines on a single server, is it enough to have put (for 3 nodes cluster) 3/1 into "Size/min" configuration of my CEPH pool ?
  2. Can i keep 64 Pgs for a config: 3 nodes - 3 mon - 3 osds ( 1To capacity / osd) ?
  3. Can use 10GBaseT network cards (example: https://www.amazon.fr/Intel-X540-T2-Réseau-Adaptateurs-Ethernet/dp/B0077CS9UM) with 10BaseT switch pour CEPH network ?...Because cards with RJ45 ports are cheaper than cards with fiber ports !
Thanks you very much !
 
For some reason, if I want to temporarily run all my virtual machines on a single server, is it enough to have put (for 3 nodes cluster) 3/1 into "Size/min" configuration of my CEPH pool ?
Possibly, but don't risk it, as if one drive fails you loose data then. As you have three servers, you could do them one by one. Moving VMs away on the node you are working on. AND always have a backup, just in case.

Can i keep 64 Pgs for a config: 3 nodes - 3 mon - 3 osds ( 1To capacity / osd) ?
A target are always a 100PGs per OSD, depending on your configuration you will need to adjust your PG count. http://ceph.com/pgcalc/

Can use 10GBaseT network cards (example: https://www.amazon.fr/Intel-X540-T2-Réseau-Adaptateurs-Ethernet/dp/B0077CS9UM) with 10BaseT switch pour CEPH network ?...Because cards with RJ45 ports are cheaper than cards with fiber ports !
The cards should work, but you need to test OFC. If you stay with three nodes, you could also go for a full mesh setup and spare the switch. And I hope you don't use a 10BaseT switch. ;)
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
 
  • Like
Reactions: Dubard
Re,
Thanks @Alwin for your reply.
  • I use http://ceph.com/pgcalc/ and suggest me 128 PGs (you can see the value for pool name "tkpool"). Does this seem fair to you because right now, with 64PGs (default value)... the cluster doesn't indicate that PGs are missing. Can we increase the number of PGS in an existing pool ?
  • Thanks for advising me on the network card, I will actually try to configure it as "Full Mesh".
Thanks you very much !
 
Re,

I try to test hdds config into my HP Proliant DL 380 G9. This server have 8 bays and use a Smart array P440ar controller. The problem is that i can't bypass this controller when plug SAS hdd or SSD. When i run Proxmox server without create additional array via Smart array P440ar, we can't view SAS disk or SSD disk plugged ...we must create an array (RAID1, RAID0, RAID5,etc...) who use this disks !
Looks like (or i didn't found a way...) we can't bypass the Smart array controller into HP Proliant DL 380 G9.

A common "solution" to this problem i found is to create single-disk RAID-0 volumes at the controller level for each SAS or SSD disk use for my CEPH storage. Also into Smart array P440ar controller settings, i activated "Physical Drive Write Cache State" option.
I know @Alwin said "...don't use RAID, neither RAID 0"...but in my case, i don't known what i can doing the best !

What is the risk if i configure my 3 nodes with this method ?
Do you have any other ideas ?

Thanks
 
I use http://ceph.com/pgcalc/ and suggest me 128 PGs (you can see the value for pool name "tkpool"). Does this seem fair to you because right now, with 64PGs (default value)... the cluster doesn't indicate that PGs are missing. Can we increase the number of PGS in an existing pool ?
You can increase PGs on an existing cluster, but not decrease. Lower limit is 30 PGs per OSD.

Smart array P440ar controller
Set the controller in IT mode, delete an RAID config from the disks, otherwise they may not be visible.
 
  • Like
Reactions: Dubard
You can increase PGs on an existing cluster, but not decrease. Lower limit is 30 PGs per OSD.

Thanks @Alwin but...how i can increase PG's on an existing pool ?
I haven't figured out how to do it from GUI and I don't know the command line to use :(

Can you give me the procedure to follow ?

Thank you
 
Thanks @Andrew Hart for your link !
I increase the value for my pgs ;)

Set the controller in IT mode, delete an RAID config from the disks, otherwise they may not be visible.

Thanks @Alwin !

But...for my HP Smart array P440ar controller, i didn't found how disable RAID mode and use disk directly on Proxmox for ZFS storage. When set the controller in IT mode there is just a option where we can put one disk to "Array Raid 0" mode but I don't know what that actually means from the disk access point of view. Also, I was able to enable an option that allows a direct access to the disk cache ("Physical Drive Write Cache State" option).

Also, if by some miracle I managed to disable the RAID mode of the HP Smart array P440ar controller and then see the disks when I boot on the Proxmox DVD, I couldn't from the setup, configure a Linux software raid (mdraid) for the "/" because Proxmox doesn't allow it.

So, i ended up use RAID 1 from HP Smart array P440ar controller for "/" on Proxmox cluster and for all CEPH storage disk, i use "Array RAID 0 on one disk" mode...I don't think I have any other choice with this damn HP Smart array P440ar controller !

How do you feel about that ?

Thanks
 
Hi everybody !
Thanks @Alwin for your reply.

[reply]
The controller supports HBA mode (also called IT mode).

I didn't realize that this HBA mode was the "IT mode" ! Thank you so much for explaining it to me.
So if I understood the manipulation correctly, it would be necessary:
  • Remove my server on my cluster

  • Remove all RAIDs from the HP Smart array P440ar
  • Switch HP Smart array P440ar controller to HBA MODE
  • Redo all my Proxmox installation on the server from the installation CD
  • Re-integrate the server into my cluster
  • Redo my CEPH storage then
However... I'm going to have to install Proxmox on a single disk and not in RAID and that... that sucks !

Why, in the Proxmox setup, it is not possible to create a Linux RAID (mdraid) to be able to put the "/" on i t ?

Thanks !
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!