Performance issue with Ceph under Proxmox 6

G0ldmember · Oct 2, 2019

Hi community,

we have a server cluster consisting of 3 nodes with EPYC 7402P 24-Core CPUs and 6 Intel Enterprise SSDs (4620) and 256GB RAM each. Also we have a 10Gbits NIC for Ceph.

SSD performance alone is fine, Jumbo frames are enabled and also iperf gives resonable results in terms of performance with the 10Gbits link.

We have read the Proxmox Benchmark document (https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark) and compared the results to our cluster.

The SSD performance is better, also the CPUs are more powerful that we have.

Anyway if we setup the Ceph cluster and perform the benchmark test, we only get around 180MB/s in average instead of 800-1000MB/s which are mentioned in the benchmark PDF and we don't know why. I think the hardware is pretty powerful so I guess it's a configuration problem.

Our Ceph config:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.10.0/24
     fsid = d05ccd07-d328-47a1-b39b-fa3c440aa859
     mon_allow_pool_delete = true
     mon_host = 91.199.162.40 91.199.162.41 91.199.162.42
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 91.199.162.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

Crush map:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve1 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 8.730
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 1.746
    item osd.4 weight 1.746
    item osd.2 weight 1.746
    item osd.3 weight 1.746
    item osd.0 weight 1.746
}
host pve2 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 8.730
    alg straw2
    hash 0    # rjenkins1
    item osd.9 weight 1.746
    item osd.8 weight 1.746
    item osd.5 weight 1.746
    item osd.7 weight 1.746
    item osd.6 weight 1.746
}
host pve3 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 8.730
    alg straw2
    hash 0    # rjenkins1
    item osd.10 weight 1.746
    item osd.12 weight 1.746
    item osd.13 weight 1.746
    item osd.11 weight 1.746
    item osd.14 weight 1.746
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 26.191
    alg straw2
    hash 0    # rjenkins1
    item pve1 weight 8.730
    item pve2 weight 8.730
    item pve3 weight 8.730
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Logs
()

These are the default settings.

We are using bluestore as well as OSD type which was also the default.

Any idea why the Ceph performance is so bad?

Code:

root@pve1:~# echo 3 | tee /proc/sys/vm/drop_caches && sync && rados -p bench bench 60 write --no-cleanup &&rados -p bench bench 60 seq && rados -p bench bench 60 rand && rados -p bench cleanup
3
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_pve1_644283
[...]
Total time run:       60.5082
Total reads made:     2993
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   197.857
Average IOPS:         49
Stddev IOPS:          6.97015
Max IOPS:             67
Min IOPS:             38
Average Latency(s):   0.322536
Max latency(s):       3.38795
Min latency(s):       0.00226084

Also we have disabled swap with swapoff -a on all nodes.

Any help would be highly appreciated.

Alwin · Oct 3, 2019

wikon-it said:
The SSD performance is better, also the CPUs are more powerful that we have.

Could you please share the results of the benchmark?

wikon-it said:
cluster_network = 192.168.10.0/24
public_network = 91.199.162.0/24

The major traffic is going over the public network, only replication of the OSDs is put on the cluster network. See the first picture [0] on Ceph's network configuration reference. Also to point out, if you run Ceph on a public (internet) interface, you may think about network encryption as Ceph only encrypts the authentication, not the rest.

Have you seen the forum thread [1] on the Benchmark paper? There are also results of other users for comparison.

[0] https://docs.ceph.com/docs/mimic/rados/configuration/network-config-ref/
[1] https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

G0ldmember · Oct 7, 2019

Hey Alwin,

thanks for your feedback.

Could you please share the results of the benchmark?

Code:

fio --ioengine=libaio --filename=/dev/sdf --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio --output-format=terse,json,normal --output=fio.log --bandwidth-log

[...]
{
[...]
},
"jobs" : [
[...]

"write" : {
"io_bytes" : 4951105536,
"io_kbytes" : 4835064,
"bw_bytes" : 82517050,
"bw" : 80583,
"iops" : 20145.764237,
"runtime" : 60001,
"total_ios" : 1208766,
"short_ios" : 0,
"drop_ios" : 0,
"slat_ns" : {
"min" : 2710,
"max" : 90334,
"mean" : 3244.294309,
"stddev" : 451.334024
},
"clat_ns" : {
"min" : 320,
"max" : 1366719,
"mean" : 45972.577949,
"stddev" : 8124.813535,
"percentile" : {
"1.000000" : 43264,
"5.000000" : 43264,
"10.000000" : 43264,
"20.000000" : 43776,
"30.000000" : 43776,
"40.000000" : 43776,
"50.000000" : 43776,
"60.000000" : 44288,
"70.000000" : 44288,
"80.000000" : 44800,
"90.000000" : 49408,
"95.000000" : 55552,
"99.000000" : 81408,
"99.500000" : 102912,
"99.900000" : 127488,
"99.950000" : 136192,
"99.990000" : 158720
}
},
"lat_ns" : {
"min" : 45917,
"max" : 1369989,
"mean" : 49291.875710,
"stddev" : 8141.767345
},
"bw_min" : 80000,
"bw_max" : 80840,
"bw_agg" : 100.000000,
"bw_mean" : 80584.394958,
"bw_dev" : 142.971801,
"bw_samples" : 119,
"iops_min" : 20000,
"iops_max" : 20210,
"iops_mean" : 20146.092437,
"iops_stddev" : 35.778507,
"iops_samples" : 119
},
"trim" : {
"io_bytes" : 0,
"io_kbytes" : 0,
"bw_bytes" : 0,
"bw" : 0,
"iops" : 0.000000,
"runtime" : 0,
"total_ios" : 0,
"short_ios" : 0,
"drop_ios" : 0,
"slat_ns" : {
"min" : 0,
"max" : 0,
"mean" : 0.000000,
"stddev" : 0.000000
},
"clat_ns" : {
"min" : 0,
"max" : 0,
"mean" : 0.000000,
"stddev" : 0.000000,
"percentile" : {
"1.000000" : 0,
"5.000000" : 0,
"10.000000" : 0,
"20.000000" : 0,
"30.000000" : 0,
"40.000000" : 0,
"50.000000" : 0,
"60.000000" : 0,
"70.000000" : 0,
"80.000000" : 0,
"90.000000" : 0,
"95.000000" : 0,
"99.000000" : 0,
"99.500000" : 0,
"99.900000" : 0,
"99.950000" : 0,
"99.990000" : 0
}
},
"lat_ns" : {
"min" : 0,
"max" : 0,
"mean" : 0.000000,
"stddev" : 0.000000
},
"bw_min" : 0,
"bw_max" : 0,
"bw_agg" : 0.000000,
"bw_mean" : 0.000000,
"bw_dev" : 0.000000,
"bw_samples" : 0,
"iops_min" : 0,
"iops_max" : 0,
"iops_mean" : 0.000000,
"iops_stddev" : 0.000000,
"iops_samples" : 0
},
"sync" : {
"lat_ns" : {
"min" : 0,
"max" : 0,
"mean" : 0.000000,
"stddev" : 0.000000,
"percentile" : {
"1.000000" : 0,
"5.000000" : 0,
"10.000000" : 0,
"20.000000" : 0,
"30.000000" : 0,
"40.000000" : 0,
"50.000000" : 0,
"60.000000" : 0,
"70.000000" : 0,
"80.000000" : 0,
"90.000000" : 0,
"95.000000" : 0,
"99.000000" : 0,
"99.500000" : 0,
"99.900000" : 0,
"99.950000" : 0,
"99.990000" : 0
}
},
"total_ios" : 0
},
"job_runtime" : 60000,
"usr_cpu" : 1.595000,
"sys_cpu" : 5.946667,
"ctx" : 2417538,
"majf" : 0,
"minf" : 13,
"iodepth_level" : {
"1" : 100.000000,
"2" : 0.000000,
"4" : 0.000000,
"8" : 0.000000,
"16" : 0.000000,
"32" : 0.000000,
">=64" : 0.000000
},
"iodepth_submit" : {
"0" : 0.000000,
"4" : 100.000000,
"8" : 0.000000,
"16" : 0.000000,
"32" : 0.000000,
"64" : 0.000000,
">=64" : 0.000000
},
"iodepth_complete" : {
"0" : 0.000000,
"4" : 100.000000,
"8" : 0.000000,
"16" : 0.000000,
"32" : 0.000000,
"64" : 0.000000,
">=64" : 0.000000
},
"latency_ns" : {
"2" : 0.000000,
"4" : 0.000000,
"10" : 0.000000,
"20" : 0.000000,
"50" : 0.000000,
"100" : 0.000000,
"250" : 0.000000,
"500" : 0.010000,
"750" : 0.000000,
"1000" : 0.000000
},
"latency_us" : {
"2" : 0.000000,
"4" : 0.000000,
"10" : 0.010000,
"20" : 0.010000,
"50" : 91.458314,
"100" : 8.007836,
"250" : 0.531947,
"500" : 0.010000,
"750" : 0.010000,
"1000" : 0.000000
},
"latency_ms" : {
"2" : 0.010000,
"4" : 0.000000,
"10" : 0.000000,
"20" : 0.000000,
"50" : 0.000000,
"100" : 0.000000,
"250" : 0.000000,
"500" : 0.000000,
"750" : 0.000000,
"1000" : 0.000000,
"2000" : 0.000000,
">=2000" : 0.000000
},
"latency_depth" : 1,
"latency_target" : 0,
"latency_percentile" : 100.000000,
"latency_window" : 0
}
],
"disk_util" : [
{
"name" : "sdf",
"read_ios" : 61,
"write_ios" : 1206557,
"read_merges" : 0,
"write_merges" : 0,
"read_ticks" : 19,
"write_ticks" : 57153,
"in_queue" : 0,
"util" : 99.921618
}
]
}

fio: (groupid=0, jobs=1): err= 0: pid=2569530: Mon Oct 7 07:44:26 2019
write: IOPS=20.1k, BW=78.7MiB/s (82.5MB/s)(4722MiB/60001msec); 0 zone resets
slat (nsec): min=2710, max=90334, avg=3244.29, stdev=451.33
clat (nsec): min=320, max=1366.7k, avg=45972.58, stdev=8124.81
lat (usec): min=45, max=1369, avg=49.29, stdev= 8.14
clat percentiles (usec):
| 1.00th=[ 44], 5.00th=[ 44], 10.00th=[ 44], 20.00th=[ 44],
| 30.00th=[ 44], 40.00th=[ 44], 50.00th=[ 44], 60.00th=[ 45],
| 70.00th=[ 45], 80.00th=[ 45], 90.00th=[ 50], 95.00th=[ 56],
| 99.00th=[ 82], 99.50th=[ 103], 99.90th=[ 128], 99.95th=[ 137],
| 99.99th=[ 159]
bw ( KiB/s): min=80000, max=80840, per=100.00%, avg=80584.39, stdev=142.97, samples=119
iops : min=20000, max=20210, avg=20146.09, stdev=35.78, samples=119
lat (nsec) : 500=0.01%
lat (usec) : 10=0.01%, 20=0.01%, 50=91.46%, 100=8.01%, 250=0.53%
lat (usec) : 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%
cpu : usr=1.59%, sys=5.95%, ctx=2417538, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1208766,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=78.7MiB/s (82.5MB/s), 78.7MiB/s-78.7MiB/s (82.5MB/s-82.5MB/s), io=4722MiB (4951MB), run=60001-60001msec

Disk stats (read/write):
sdf: ios=61/1206557, merge=0/0, ticks=19/57153, in_queue=0, util=99.92%

The major traffic is going over the public network, only replication of the OSDs is put on the cluster network.

Ok, 192.168.10.0/24 is our 10Gbit/s network. Isn't the OSD replication the biggest part of the traffic? For accessing the data only, the public 1Gbits link should be ok?

Or does that mean that although we have a 10Gbits and running the test locally on the machine, the benchmark will still use the public 1Gbits connection? But then, 180Mbytes/s would be quite ambitious.

if you run Ceph on a public (internet) interface, you may think about network encryption as Ceph only encrypts the authentication, not the rest.

Our "public" network is still protected from the internet by a firewall and not directly exposed. Ceph is not accessible at all from "outside" although we're using public IP addresses at this point.

G0ldmember · Oct 7, 2019

One thing I have noticed is that the official ceph examples use spaces and not underscores in ceph.conf file:

https://github.com/ceph/ceph/blob/master/src/sample.ceph.conf

e.g.

Code:

public network = x.x.x.x
cluster network = y.y.y.y

whereas the autogenerated cepf.conf file in PVE looks like

Code:

public_network = x.x.x.x
cluster_network = y.y.y.y

Maybe both notations are valid?

At least some other options seem to be valid in both notations according to https://docs.ceph.com/docs/giant/rados/configuration/ceph-conf/

In your ceph.conf file, you may use spaces when specifying a setting name. When specifying a setting name on the command line, ensure that you use an underscore or hyphen (_ or -) between terms (e.g., debug osd becomes debug-osd).

Alwin · Oct 7, 2019

wikon-it said:
Maybe both notations are valid?

Both notations are valid. Ceph will complain if it can't parse something.

wikon-it said:
Ok, 192.168.10.0/24 is our 10Gbit/s network. Ist the OSD replication the biggest part of the traffic? For accessing the data only, the public 1Gbits link should be ok?

As you can see, the avg. bandwidth of that OSD is at 78.7MiB/s and ~20K IO/s, this is faster than the SM863 used in the Benchmark Paper. As multiple nodes will write at the same time onto the cluster, they will max out the 1 GbE link.

G0ldmember · Oct 7, 2019

Ok, thanks Alwin for clarifying this. I guess we'll have to go for 10Gbits on the "public" network as well to improve performance.

G0ldmember · Oct 16, 2019

After reinstall an proper network configuration the results are much better:

Total time run: 39.2912
Total reads made: 13438
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1368.04
Average IOPS: 342
Stddev IOPS: 23.2001
Max IOPS: 381
Min IOPS: 265
Average Latency(s): 0.0457668
Max latency(s): 0.683604
Min latency(s): 0.0118792

Search

Search

Performance issue with Ceph under Proxmox 6

G0ldmember

Active Member

Attachments

Alwin

Proxmox Retired Staff

G0ldmember

Active Member

G0ldmember

Active Member

Alwin

Proxmox Retired Staff

G0ldmember

Active Member

G0ldmember

Active Member