Ceph OSD Performance Issue

troycarpenter · Jan 3, 2018

While investigating OSD performance issues on a new ceph cluster, I did the same analysis on my "good" cluster. I discovered something interesting and fixing it may be the solution to my new cluster issue.

For the "good" cluster, I have three nearly identical servers. Each server has four OSDs, for 12 total. When I do a "ceph tell osd.x bench -f plain", I get the following:

Code:

osd.0:bench: wrote 1024 MB in blocks of 4096 kB in 12.109905 sec at 86588 kB/sec
osd.1:bench: wrote 1024 MB in blocks of 4096 kB in 8.501180 sec at 120 MB/sec
osd.2:bench: wrote 1024 MB in blocks of 4096 kB in 11.384842 sec at 92102 kB/sec
osd.3:bench: wrote 1024 MB in blocks of 4096 kB in 8.695865 sec at 117 MB/sec
osd.4:bench: wrote 1024 MB in blocks of 4096 kB in 0.753332 sec at 1359 MB/sec
osd.5:bench: wrote 1024 MB in blocks of 4096 kB in 1.712017 sec at 598 MB/sec
osd.6:bench: wrote 1024 MB in blocks of 4096 kB in 2.815910 sec at 363 MB/sec
osd.7:bench: wrote 1024 MB in blocks of 4096 kB in 1.698323 sec at 602 MB/sec
osd.8:bench: wrote 1024 MB in blocks of 4096 kB in 0.283092 sec at 3617 MB/sec
osd.9:bench: wrote 1024 MB in blocks of 4096 kB in 2.606005 sec at 392 MB/sec
osd.10:bench: wrote 1024 MB in blocks of 4096 kB in 2.652026 sec at 386 MB/sec
osd.11:bench: wrote 1024 MB in blocks of 4096 kB in 2.468191 sec at 414 MB/sec

Note that the first four OSDs in one server all are very slow. All nodes are connected via a 10GbE network and use the same HDD drives with a single SSD for the Bluestore DB in each server. There are 13 Proxmox nodes using this ceph cluster and each Proxmox node shows the same numbers above.

I am looking for help to figure out why the OSDs in the one server are so slow. I tried deleting osd.0 and recreating it, but it still shows the slow speed after recreation. I've been reading through documentation, but I'm hoping that someone may provide a pointer to get me to a solution faster.

Thanks in advance.

troycarpenter · Feb 13, 2018

I am now getting back to this issue. I haven't found anything that would explain why the OSDs in the first server (OSDs 0, 1, 2, 3) show the write time to be on average 9 seconds, while the other OSDs (4 through 11) all have write times on average about 1.5 seconds.

I have a different cluster and ALL the OSDs have write times around 9 seconds:

Code:

0:bench: wrote 1024 MB in blocks of 4096 kB in 8.988710 sec at 113 MB/sec
1:bench: wrote 1024 MB in blocks of 4096 kB in 8.344043 sec at 122 MB/sec
2:bench: wrote 1024 MB in blocks of 4096 kB in 8.901831 sec at 115 MB/sec
3:bench: wrote 1024 MB in blocks of 4096 kB in 9.016917 sec at 113 MB/sec
4:bench: wrote 1024 MB in blocks of 4096 kB in 8.334694 sec at 122 MB/sec
5:bench: wrote 1024 MB in blocks of 4096 kB in 8.300840 sec at 123 MB/sec
6:bench: wrote 1024 MB in blocks of 4096 kB in 8.102995 sec at 126 MB/sec
7:bench: wrote 1024 MB in blocks of 4096 kB in 7.966184 sec at 128 MB/sec
8:bench: wrote 1024 MB in blocks of 4096 kB in 9.441145 sec at 108 MB/sec

On the second node above, the overall ceph speed is so slow that people using the VM guests complain and some guests have issues with timeouts.

What do I need to look at to try to figure out why this benchmark is such.

w00dmAn · Feb 14, 2018

Hi there,

What kind of servers and HDD do you have?
Seems I have a similar issue

regards,
Alex

aderumier · Feb 14, 2018

ssd model ?

Andrew Hart · Feb 14, 2018

ceph tell osd.0 bench -f plain
bench: wrote 1024 MB in blocks of 4096 kB in 8.649105 sec at 118 MB/sec

This is with no ssd and no 10G network, so I would guess either your ssd is not there or your 10G network is 1G. Just a guess mind.

troycarpenter · Feb 14, 2018

In the first cluster, there are three nodes with four OSDs each. All HDDs (2 TB SAS) and SSD units are identical.

I have looked back over the configuration for the first node in the first cluster (OSDs 0-3) and compared it to the configuration on the other nodes and there doesn't appear to be any difference. On all nodes, the SSD is on /dev/sdb and each block.db files point to a partition on the SSD.

The node that is slower is connected to the main switch over a 10GB fiber link (both the node and switch report 10GB on that link). The "ceph tell" command is run on the same node that has the slower OSDs.

w00dmAn · Feb 15, 2018

Hi there,

4 Servers HP G8 with p420i in a HBA mode
4 OSD - SAS 1TB 7.2k Bluestore

1Gbits network for Proxmox
10Gbitsx2 for CEPH on Netgear XS708T totally separated

osd.0: bench: wrote 1024 MB in blocks of 4096 kB in 20.028508 sec at 52354 kB/sec
osd.1: bench: wrote 1024 MB in blocks of 4096 kB in 16.748930 sec at 62605 kB/sec
osd.2: bench: wrote 1024 MB in blocks of 4096 kB in 12.350212 sec at 84903 kB/sec
osd.4: bench: wrote 1024 MB in blocks of 4096 kB in 20.396930 sec at 51408 kB/sec
osd.5: bench: wrote 1024 MB in blocks of 4096 kB in 16.246150 sec at 64543 kB/sec
osd.6: bench: wrote 1024 MB in blocks of 4096 kB in 16.796991 sec at 62426 kB/sec
osd.7: bench: wrote 1024 MB in blocks of 4096 kB in 14.001216 sec at 74891 kB/sec
osd.8: bench: wrote 1024 MB in blocks of 4096 kB in 10.875927 sec at 96412 kB/sec

OSD showing something like that

What to have a look at first instance?
Thanks

P.S. before was version 4 and filestore OSD and speed was about 130MB/s for each disk

Andrew Hart · Feb 15, 2018

troycarpenter said:
On the second node above, the overall ceph speed is so slow that people using the VM guests complain and some guests have issues with timeouts.

Just taking this bit first - the VMs would be accessing ceph so their data wouldn't need to be any particular node or osd. This sounds like a separate problem.

So back to the main, I take it that your 3 node "good" cluster is live with a lot of data on it but could you delete all the osd on bad node and just add back ssd and hdd as separate osds to do tests? It might be better to do that on the "new" cluster instead though.

On the "other" two node cluster all the performance is low, but you were happy until now?
Perhaps your ssds are degrading (in some ssd magic way)?
Perhaps you could borrow an ssd from the "new" cluster for diagnosis?

Search

Search

Ceph OSD Performance Issue

troycarpenter

Renowned Member

troycarpenter

Renowned Member

w00dmAn

Renowned Member

aderumier

Renowned Member

Andrew Hart

Member

troycarpenter

Renowned Member

w00dmAn

Renowned Member

Andrew Hart

Member

We value your privacy