Ceph high latency

daros · Oct 31, 2019

Hello,

We got an cluster of 6 servers.
1 server is beeining used as backup server so there are not vm or ceph is not an member of the ceph cluster.
5 servers got 2x e5-2620v4 256gb ddr and 4x 1tb ssd and 1x 4tb ssd.

The latency of the 4tb is very high what will cause that the cluster in total will be slow.
As i understand an mix of ssd;s is possible with ceph but is the size of the smallest and biggest to far apart?

We ordered an extra server, with 4x4tb ssd. will this make an more balance of the ceph storage? Or what do you recommand?
We made some manual changes to the crush map to balance the performance more between the osd.

The 4tb where at first 3.4 but change it at first to 2.7 and now even to 2.2.
So all the 0.873 ssd are 1tb (pm863a) the weight 2.200 are 4tb (pm863a)
In the prox-s05 are the 1tb intel (S3520)

Currently we are still at 5.6 with Proxmox, wont dont dare to upgrade yet.

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd
device 24 osd.24 class ssd
device 25 osd.25 class ssd
device 26 osd.26 class ssd
device 27 osd.27 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host prox-s01 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 6.566
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.873
    item osd.2 weight 0.873
    item osd.5 weight 0.873
    item osd.19 weight 0.873
    item osd.1 weight 0.873
    item osd.24 weight 2.200
}
host prox-s02 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 6.566
    alg straw2
    hash 0    # rjenkins1
    item osd.3 weight 0.873
    item osd.6 weight 0.873
    item osd.8 weight 0.873
    item osd.10 weight 0.873
    item osd.17 weight 0.873
    item osd.25 weight 2.200
}
host prox-s03 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 6.566
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.873
    item osd.7 weight 0.873
    item osd.9 weight 0.873
    item osd.11 weight 0.873
    item osd.18 weight 0.873
    item osd.26 weight 2.200
}
host prox-s04 {
    id -9        # do not change unnecessarily
    id -10 class ssd        # do not change unnecessarily
    # weight 6.566
    alg straw2
    hash 0    # rjenkins1
    item osd.12 weight 0.873
    item osd.13 weight 0.873
    item osd.14 weight 0.873
    item osd.15 weight 0.873
    item osd.16 weight 0.873
    item osd.27 weight 2.200
}
host prox-s05 {
    id -11        # do not change unnecessarily
    id -12 class ssd        # do not change unnecessarily
    # weight 3.493
    alg straw2
    hash 0    # rjenkins1
    item osd.20 weight 0.873
    item osd.21 weight 0.873
    item osd.22 weight 0.873
    item osd.23 weight 0.873
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 29.757
    alg straw2
    hash 0    # rjenkins1
    item prox-s01 weight 6.566
    item prox-s02 weight 6.566
    item prox-s03 weight 6.566
    item prox-s04 weight 6.566
    item prox-s05 weight 3.493
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

RobFantini · Oct 31, 2019

what type of network is used for storage?

daros · Oct 31, 2019

RobFantini said:
what type of network is used for storage?

alle the servers got 2x10gb.
1x 10gb for ceph
1x10gb voor ceph monitor and public data (internet,management en internal).

RobFantini · Oct 31, 2019

send output of ` ceph df `

daros · Oct 31, 2019

RobFantini said:
send output of ` ceph df `

root@prox-s01:~# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
34.9TiB 17.4TiB 17.5TiB 50.21
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
ceph01 3 6.01TiB 77.18 1.78TiB 1593307

RobFantini · Oct 31, 2019

daros said:
root@prox-s01:~# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
34.9TiB 17.4TiB 17.5TiB 50.21
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
ceph01 3 6.01TiB 77.18 1.78TiB 1593307

can you format that using code format?

what is the pool size? ceph osd pool get {pool-name} size

daros · Oct 31, 2019

RobFantini said:
can you format that using code format?

what is the pool size?

Code:

root@prox-s01:~# ceph df
GLOBAL:
    SIZE        AVAIL       RAW USED     %RAW USED
    34.9TiB     17.4TiB      17.5TiB         50.21
POOLS:
    NAME       ID     USED        %USED     MAX AVAIL     OBJECTS
    ceph01     3      6.01TiB     77.18       1.78TiB     1593307

RobFantini · Nov 1, 2019

for further debugging I'd check system logs.
we have a rsyslog sitewide logfile that is checked daily using cron and by hand as needed.

if you want i could post the script and you could run it on each node on /var/log/syslog

RobFantini · Nov 2, 2019

also - we had issues with latency and used zabbix graphs to isolate which times to check in logs.

we found bad disks, bad disk models etc.

one bad disk caused huge spikes in latency [ 5k+ ] .

if you will use zabbix - until issues are fixed it should be installed out of ceph storage ad there be a lot of i/o [ in my opinion, as i am not an expert ].

daros · Nov 4, 2019

The high latency is on all the 4tb disk.
SSD mix is possible with ceph but maybe the mix of 20x 1tb and 4x 4tb when you use 17,54tb of the 34,93 to much io for the 4tb.
Because when we lower the wiehgt so there is less data on the 4tb disks (so more reads and write on the other 1tb disks) the problem is gone.

So will it help to add some more 4tb disk so the read write will be across more disks.

RobFantini · Nov 4, 2019

you may want to consider having the 4tb in a different ceph class.
and in case you are not - add db or wall fast ssd or nvme for hdd .

we use classes so i can help with that set up. as we have 0 hdd's we do not use db or wall .

Seed · Nov 4, 2019

RobFantini said:
we use classes so i can help with that set up. as we have 0 hdd's we do not use db or wall .

I'm curious how you use a class with ceph. I have a mix of HDD, SSD and NVMe M.2 and want to use a possible class delineation.

RobFantini · Nov 4, 2019

1ST I am not a ceph expert.

When starting out it is good to never use the default classes of ssd , hdd . That way when an osd is added data redistribution is not started.

this is an example from some 4TB nvme's we added last week:

Code:

0-  after adding osd to ceph, make note of the  OSD NUMBER.  
    you will use that number in place of ' X '  in the 2 commands below


1-
ceph osd crush rm-device-class osd.X


2-
ceph osd crush set-device-class nvme-4tb osd.X


3- This only gets done once per class.   It creates a new class and puts the osd to it:
# then make a osd crush rule for pool creation to use.  This only needs to be done one time
#   ceph osd crush rule create-replicated nvme-4tb  default host  nvme-4tb

after that create a pool with same name as the class [ you can use a diff name if you choose] ,
select the new class,
when creating the pool check the box to have pve create the storage.

daros · Nov 5, 2019

RobFantini said:
you may want to consider having the 4tb in a different ceph class.
and in case you are not - add db or wall fast ssd or nvme for hdd .

we use classes so i can help with that set up. as we have 0 hdd's we do not use db or wall .

Could you explain the classes and what you recommend?
And we dont have HDD only SSD's.

RobFantini · Nov 5, 2019

we originally had a class for ssd and another for nvme . those were named after the model names . s3700 and p3520 . the models were all same capacity.

then we had a class for bluestore and another for filestore. [ there was a bluestore bug ]

now we have a class for two different nvme models, class names are : nvme-2TB and nvme-4TB

RobFantini · Nov 5, 2019

useful class/pools are:
- fastest storage for data entry / databases etc.
- slower for backups and data archives . like hdd's with write cache ssd/nvme [ zfs log/cache = ceph wal/db? ]
- some nvme's are designed for very fast writes, other for reads etc.

daros · Nov 6, 2019

RobFantini said:
useful class/pools are:
- fastest storage for data entry / databases etc.
- slower for backups and data archives . like hdd's with write cache ssd/nvme [ zfs log/cache = ceph wal/db? ]
- some nvme's are designed for very fast writes, other for reads etc.

Hello,

But in my case all the SSD's are the same type, PM863a.
Only different is most of them are 1tb and 5 of them are 4tb.
The 4tb having sometimges high latency, problely because they hold more data.
We fixed it manually by changing the weight so they hold less data but what is the best step for us?

RobFantini · Nov 6, 2019

of course only you can decide what is best,

for us - for data entry the lowest latency possible along with reliability is the goal. users measure this by keyboard lag.

in the past we'd put security video on hdd and data on ssd. then data went to nvme and video to ssd. after that we found a great deal on more nvme's so we are now all nvme.

with the latency you are experiencing - it would be good to make it so users have 0 keyboard lag, and frequently accessed data is served fast. so multiple classes could help with tuning that.

czechsys · Nov 6, 2019

Or create 2 osds on 4T variant.
But you can't break hw limits of disks.
Read: <97k iops per disk
Write <16k iops 1T, 24k iops 4T

Depending on workload, 4x 1T (=4T capacity) disks can give teoretically 4x16k write iops, but 1x4T is <24k only. You can't override it.

daros · Nov 25, 2019

I added some more 4tb drives to the load is more balanced.

So for other readers yes it possible to have an mixed of ssd's but dont make the 'mix' (so size) too big.

Ceph high latency

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Famous Member

Active Member

Famous Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member