Ceph performance problems

AlexLup · Mar 26, 2018

In short, I am transferring a file at 8 GB file at 107mb/sek for the first 2 GB, then it drops down to 20mb sec. Also, the ceph recovery is at 25mb sec on a 10gbit network.

Tried messing with the journal size but no go..

Hardware;

Code:

pve1
Disks -----------------------------------------------------
SATA1 ’sda’ - OS - 300gb
SATA2 ’sdb’ - WD BLUE - 4TB - journal sdd
SATA3 ’sdc’ - Hitachi 7449 - 500GB - journal sdd
SATA6 ’sde’ - Samsung 0311 - 500GB - journal sdd
SATACARD1 ’sdf’ - Micron SSD 1240BB (180mb/sec read) - 120GB - Unused
SATACARD2 ’sdd’ - Intel SSD 1207GN (210MB/sec read)- 120GB
--------------------------------------------------------------
Network -------------------
Front 1gb
192.168.1.x
SAN -
QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.712.30-0 (2014/02/10)
MTU 9000
172.16.1.x
iperf tested to 9.8gb
--------------------------------

Code:

pve2
Disks ------------------------------------------------------
SATA1 ’sda’ - OS - 300gb
SATA1 ’sdb' - WD BLUE - 4TB - journal sdf
SATA4 ’sdc'- Samsung - 500gb - journal sdf
SATA5 ’sdd'- Hitachi - 500gb - journal sdf
SATA6 ’sde’- samsung  - 500gb - journal sdf
SATACARD1 ’sdg'- Samsung EVO SSD (189MB/sec read)- 120GB  - Unused
SATACARD2 ’sdf' - Kingston v300 - 120GB
--------------------------------------------------------------
Network -------------------
Front 1gb
192.168.1.x
SAN - Dual XGb SFP+ LP Board S/N ZQ38BK0221  Chip rev 0x42
MTU 9000
172.16.1.x
iperf tested to 9.8gb
--------------------------------

Code:

pve3 - Witness for now
1gbit
No OSDs
Only mon

Software;

Code:

 ceph.conf

[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 172.16.1.0/24
     fsid = e44fbe1c-b1c7-481d-bd25-dc595eae2d13
     keyring = /etc/pve/priv/$cluster.$name.keyring
     mon allow pool delete = true
     osd journal size = 28120
     osd pool default min size = 2
     osd pool default size = 2
     public network = 192.168.1.0/24

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve1]
     host = pve1
     mon addr = 192.168.1.12:6789

[mon.pve3]
     host = pve3
     mon addr = 192.168.1.14:6789

[mon.pve2]
     host = pve2
     mon addr = 192.168.1.13:6789

Code:

pveversion

pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pve1 {
    id -3        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 4.551
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.456
    item osd.5 weight 0.456
    item osd.2 weight 3.640
}
host pve2 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 5.007
    alg straw2
    hash 0    # rjenkins1
    item osd.3 weight 3.640
    item osd.0 weight 0.456
    item osd.1 weight 0.456
    item osd.6 weight 0.456
}
root default {
    id -1        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 9.558
    alg straw2
    hash 0    # rjenkins1
    item pve weight 4.551
    item pve3 weight 5.007
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

Alwin · Mar 26, 2018

What is your question? And where you want to go with that setup?

AlexLup · Mar 26, 2018

Alwin said:
What is your question? And where you want to go with that setup?

My question is simply, how can I improve my ceph performance to saturate both the 1gb network and ceph recovery to the maximum. I am guessing there is an issue in my config besides the underwhelming HW seeing as the speed is ok for 2gb..

Thanks,
Alex

Alwin · Mar 26, 2018

Well, on a first glance, you have two OSD server and the three monitors. The third monitor is slow, as the latency of the 1Gb is x10 higher then 10Gb. With only two OSD server, there is not much gain on configuration as you don't have enough hosts or drives to spread the load. In general, you need to upsize the cluster with hardware to gain in performance.

Check what your hardware can do with sync writes. Take the fio tests from the link below to compare your setup.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

An alternative to ceph and maybe more beneficial for you, is the setup with zfs and storage replication (pvesr).
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

AlexLup · Mar 26, 2018

Hi,
I am "out"-ing a small usage drive to run the FIO test, but its slow business as I said, it sub 1gb ethernet speed and 20% of the speed from a sata mechanical drive.

Code:

 data:
    pools:   1 pools, 128 pgs
    objects: 188k objects, 751 GB
    usage:   1511 GB used, 8275 GB / 9787 GB avail
    pgs:     10586/386084 objects misplaced (2.742%)
             121 active+clean
             6   active+remapped+backfill_wait
             1   active+remapped+backfilling
 
  io:
    client:   5117 B/s wr, 0 op/s rd, 1 op/s wr
    recovery: 20527 kB/s, 5 objects/s<----

Here is the write test at 64mb/sek min and 164mb/sek max which is ok in my book.

Code:

rados bench 60 write -b 4M -t 16 -p ceph_pool

Total time run:         60.486178
Total writes made:      1780
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     117.713
Stddev Bandwidth:       14.694
Max bandwidth (MB/sec): 164 <---
Min bandwidth (MB/sec): 64
Average IOPS:           29
Stddev IOPS:            3
Max IOPS:               41
Min IOPS:               16
Average Latency(s):     0.543648
Stddev Latency(s):      0.24444
Max latency(s):         1.88796
Min latency(s):         0.0979623
Cleaning up (deleting benchmark objects)
Removed 1780 objects
Clean up completed and total clean up time :3.953643

Search

Search

Ceph performance problems

AlexLup

Well-Known Member

Alwin

Proxmox Retired Staff

AlexLup

Well-Known Member

Alwin

Proxmox Retired Staff

AlexLup

Well-Known Member

We value your privacy