Ceph - High apply latency on OSD causes poor performance on VM

fall · May 21, 2015

Hi,

Since we have installing our new Ceph Cluster, we have frequently high apply latency on OSDs (near 200 ms to 1500 ms), while commit latency is continuously at 0 ms !

In Ceph documentation, when you run the command "ceph osd perf", the fs_commit_latency is generally higher than fs_apply_latency. For us it's the opposite.
The phenomenon has increased since we changed the Ceph version (migrate Giant 0.87.1 to Hammer 0.94.1)
The consequence is that our Windows VMs are very slow.
Does anyone could tell us if our configuration is good or not, and in what direction investigate ?

Code:

# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
  0                     0                   62
  1                     0                  193
  2                     0                   88
  3                     0                  269
  4                     0                 1055
  5                     0                  322
  6                     0                  272
  7                     0                  116
  8                     0                  653
  9                     0                    4
 10                     0                    1
 11                     0                    7
 12                     0                    4

Different informations in our configuration :

- Proxmox 3.4-6
- kernel : 3.10.0-10-pve
- CEPH :
- Hammer 0.94.1
- 3 hosts with 3 OSDs of 4 TB (9 OSDs) + 1 SSD of 500 GB per host for journals
- 1 host with 4 OSDs of 300 GB (4 OSDs) + 1 SSD of 500 GB for journals

- OSD Tree :

Code:

# ceph osd tree
ID WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 33.83995 root default
-6 22.91995     room salle-dr
-2 10.92000         host ceph01
 0  3.64000             osd.0        up  1.00000          1.00000
 2  3.64000             osd.2        up  1.00000          1.00000
 1  3.64000             osd.1        up  1.00000          1.00000
-3 10.92000         host ceph02
 3  3.64000             osd.3        up  1.00000          1.00000
 4  3.64000             osd.4        up  1.00000          1.00000
 5  3.64000             osd.5        up  1.00000          1.00000
-5  1.07996         host ceph06
 9  0.26999             osd.9        up  1.00000          1.00000
10  0.26999             osd.10       up  1.00000          1.00000
11  0.26999             osd.11       up  1.00000          1.00000
12  0.26999             osd.12       up  1.00000          1.00000
-7 10.92000     room salle-log
-4 10.92000         host ceph03
 6  3.64000             osd.6        up  1.00000          1.00000
 7  3.64000             osd.7        up  1.00000          1.00000
 8  3.64000             osd.8        up  1.00000          1.00000

- ceph.conf

Code:

[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         auth supported = cephx
         cluster network = 10.10.1.0/24
         filestore xattr use omap = true
         fsid = 2dbbec32-a464-4bc5-bb2b-983695d1d0c6
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon osd adjust heartbeat grace = true
         mon osd down out subtree limit = host
         osd disk threads = 24
         osd heartbeat grace = 10
         osd journal size = 5120
         osd max backfills = 1
         osd op threads = 24
         osd pool default min size = 1
         osd recovery max active = 1
         public network = 192.168.80.0/24


[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring


[mon.0]
         host = ceph01
         mon addr = 192.168.80.41:6789


[mon.1]
         host = ceph02
         mon addr = 192.168.80.42:6789


[mon.2]
         host = ceph03
         mon addr = 192.168.80.43:6789

Thanks.
Best regards

udo · May 21, 2015

fall said:
Hi,
...

- ceph.conf

Code:

[global] osd disk threads = 24 osd op threads = 24

[

Hi,
do you think this two options are an good choice?

osd disk threads mean, how many threads are used for background things like scrubbing!! Scrubbing eating IO! I would use 1 thread here.
and osd op threads looks very high for me (do you have real monster server??) I would do performance test with 4/8 threads.

How looks you journaling (partition?) and what kind of SSD do you use for the journal?

Udo

fall · May 22, 2015

Hi,

Sorry for the error in ceph.conf : the value of the 2 parameter is actually 4

Code:

# ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep threads
    "osd_op_threads": "4",
    "osd_disk_threads": "4",

We have 1 dedicated SSD of 240 GB per OSD node for journals (1 SSD of 240 GB for 3 OSD of 4TB).
The SSD is an "HP 240GB 6G SATA VE 3.5in SCC EV" (ref HP : 718177-B21)
The journal size is 5 GB. Is it enough ?

Partition table of SSD :

Code:

# fdisk -l /dev/sdb*


Disk /dev/sdb: 240.0 GB, 240021504000 bytes
256 heads, 63 sectors/track, 29066 cylinders, total 468792000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk identifier: 0x00000000


   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1   468791999   234395999+  ee  GPT
Partition 1 does not start on physical sector boundary.


Disk /dev/sdb1: 5367 MB, 5367677952 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10483746 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk identifier: 0x00000000


Disk /dev/sdb1 doesn't contain a valid partition table


Disk /dev/sdb2: 5367 MB, 5367677952 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10483746 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk identifier: 0x00000000


Disk /dev/sdb2 doesn't contain a valid partition table


Disk /dev/sdb3: 5367 MB, 5367677952 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10483746 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk identifier: 0x00000000


Disk /dev/sdb3 doesn't contain a valid partition table

Our 3 OSD nodes hardware configuration :
- HP DL380p G8, 2 cpu Xeon E5-2630 2.6 Ghz, 24GB RAM
- RAID : Smart Array P420i
- 2 x 450GB HP SAS 15000 in Raid 1 for OS
- 3 x 4 TB HP SAS 7200 for OSD
- 1 x 240 GB SSD for journals
- network : 4 x 1GbE HP + 1 card HP 530T 10Gb 2 ports for replication

One wonders if this does not come from a hardware incompatibility : Raid controller, ...

If you have any idea ...
Thanks a lot.

Fall

nethfel · May 22, 2015

I can't really respond to the hardware compatibility, but I know I personally find larger journals to be better performers for me - so I had changed my default journal size in the conf to 12Gig. My ceph osd perf looks like:

Code:

osd fs_commit_latency(ms) fs_apply_latency(ms) 
  0                     0                   10 
  1                     0                    7 
  2                     0                    3 
  3                     0                    2 
  4                     1                    5 
  5                     0                    1 
  6                     0                    1 
  7                     0                    7 
  8                     0                    5 
  9                     1                    3 
 10                     0                    1 
 11                     0                    3 
 12                     0                   12 
 13                     0                    1 
 14                     0                    0 
 15                     0                    1 
 16                     1                    6 
 17                     1                    5

This is with smaller (and probably slower due to data density on the platters) drives than yours. My journal drives are Intel DC s3700's (this is currently with 18 VMs using it for back end storage for Proxmox, 8 of which are Windows guests, ceph 0.94.1 (which was upgraded from Giant))

What's interesting from your numbers is the best latency numbers are from your ceph06 box, so I'm curious what might be different about that box vs the others? Is there a hardware difference between this box that holds 4 OSDs vs the other boxes that have 3?and it looks like you have 4 OSD host nodes, but in your first post, you said you only had 3?

fall · May 28, 2015

Hi nethfel,

My ceph06 box is a new OSD server which is older than the others (HP G6, Raid card P410i, 8GB RAM, 4 x 300GB for OSD), but strangely has the best latency numbers. So it's the reason why I think there is probably a hardware problem.

I have also seen that the options format et mount of OSDs are different.
On the server that have a latency problem, here is the format log (in proxmox) and disk mount :

Code:

create OSD on /dev/sdd (xfs)
using device '/dev/sdb' for journal
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
The operation has completed successfully.
WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
Information: Moved requested sector from 52428834 to 52430848 in
order to align on 2048-sector boundaries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
meta-data=/dev/sdd1 isize=2048 agcount=32, agsize=30523264 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=976744448, imaxpct=5
= sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=476928, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
TASK OK

Code:

# mount
....
/dev/sdd1 on /var/lib/ceph/osd/ceph-4 type xfs (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
...

Sunit and Swidth values are equal to 64 at the formatting and 512 at the mounting (strip size of raid logical disk is 256KB).

On the server that have no problem, here is the format log (in proxmox) and disk mount :

Code:

create OSD on /dev/sdc (xfs)
using device '/dev/sdg' for journal
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
The operation has completed successfully.
WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data
Information: Moved requested sector from 10485794 to 10487808 in
order to align on 2048-sector boundaries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
meta-data=/dev/sdc1 isize=2048 agcount=4, agsize=18308434 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=73233735, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=35758, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
The operation has completed successfully.
TASK OK

Code:

# mount
....
/dev/sdc1 on /var/lib/ceph/osd/ceph-9 type xfs (rw,noatime,attr2,inode64,noquota)
...

Sunit and Swidth values are equal to 0 at the formatting and mounting (strip size of raid logical disk is 128KB).

Do you think it could be the reason of latencies ?

Search

Search

Ceph - High apply latency on OSD causes poor performance on VM

fall

Renowned Member

udo

Distinguished Member

fall

Renowned Member

nethfel

Member

fall

Renowned Member