Ceph performance seems too slow

kentravis

Member
Feb 23, 2023
16
0
6
I am thinking our Ceph performance isn't what it should be with a three node cluster each with three NVMe drives (9 OSDs). Ceph runs on a independent 10 Gbps network. I am seeing Apply/Commit latency of about 3/3 with about a dozen VMs/containers sitting idle. I noticed that it had only 33 PGs, but when I changed it to 128 auto-scaling scaled it back to 33. I know less than 10 ms is usually considered OK, but having a bit of experience with SANs years ago with SSDs, I just think I should see sub ms latency.

Am I crazy?

Ken
 
Am I crazy?
You did not give us enough numbers to decide ;-)

What is the topology? Default "size=3/min=2"?

What is the actual performance?
  • for the OSDs you could run for OSD in {0..8} ; do ceph tell osd.$OSD bench; done
  • for a pool "ceph0" you could run rados bench -p ceph0 10 write -b 4K
  • ...and then there is "fio" for a mounted CephFS and/or in a VM using virtual disks
Please post some results - in a CODE-block "</>".


Disclaimer: I am not using Ceph currently...
 
Ceph runs on a independent 10 Gbps network
you DO understand that your "speed" can't be faster than the transport. and if you are using the same interface for both public and private traffic that essentially caps your performance at 5gbit/s.

I just think I should see sub ms latency.
this is your drives' observed latency, and has nothing to do with ceph.
 
You did not give us enough numbers to decide ;-)

What is the topology? Default "size=3/min=2"?

What is the actual performance?
  • for the OSDs you could run for OSD in {0..8} ; do ceph tell osd.$OSD bench; done
  • for a pool "ceph0" you could run rados bench -p ceph0 10 write -b 4K
  • ...and then there is "fio" for a mounted CephFS and/or in a VM using virtual disks
Please post some results - in a CODE-block "</>".


Disclaimer: I am not using Ceph currently
 
Hi UdoB,

I am not sure I understand about posting in a code block. but here are results from of commands. Also my VMs and Proxmox couldn't find the fio command, so I may need some more guidance for that:

[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = fc00::/64 fsid = e5d1de59-d114-468c-ac1c-71022ca09761 mon_allow_pool_delete = true mon_host = fc00::2 fc00::3 fc00::4 ms_bind_ipv4 = false ms_bind_ipv6 = true osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = fc00::/64


oot@pve2:~# ceph tell osd.1 bench

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.61238325900000001,
"bytes_per_sec": 1753382066.246197,
"iops": 418.03886085658002
}
root@pve2:~# ceph tell osd.$osd bench
error handling command target: osd id not integer
root@pve2:~# ceph tell osd.1 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53740845699999995,
"bytes_per_sec": 1997999491.8464785,
"iops": 476.3601998916813
}
root@pve2:~# ceph tell osd.2 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.52995045900000004,
"bytes_per_sec": 2026117358.2642372,
"iops": 483.06402165037088
}
root@pve2:~# ceph tell osd.3 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.60851644299999996,
"bytes_per_sec": 1764523927.5810335,
"iops": 420.69528760457837
}
root@pve2:~# ceph tell osd.4 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.62637273800000004,
"bytes_per_sec": 1714221834.4758165,
"iops": 408.70233404059803
}
root@pve2:~# ceph tell osd.5 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.59378917200000003,
"bytes_per_sec": 1808287982.7252896,
"iops": 431.12945144779434
}
root@pve2:~# ceph tell osd.6 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.85508710799999998,
"bytes_per_sec": 1255710457.9806154,
"iops": 299.38470315471062
}
root@pve2:~# ceph tell osd.7 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.52125375399999996,
"bytes_per_sec": 2059921517.6107874,
"iops": 491.12356128949818
}
root@pve2:~# ceph tell osd.8 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53113702200000001,
"bytes_per_sec": 2021591001.0505726,
"iops": 481.98485399498287
}
root@pve2:~# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53072796499999997,
"bytes_per_sec": 2023149136.2999876,
"iops": 482.35634238719643

root@pve2:~# rados bench -p whac-pool 10 write -b 4K
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve2_977252
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 3107 3091 12.0729 12.0742 0.005117 0.0051595
2 16 5327 5311 10.3721 8.67188 0.00524738 0.0060144
3 16 8299 8283 10.7842 11.6094 0.0052023 0.00578668
4 16 10565 10549 10.3008 8.85156 0.00487787 0.00605959
5 16 13697 13681 10.6873 12.2344 0.00516581 0.0058415
6 16 15812 15796 10.283 8.26172 0.0040372 0.0060716
7 16 17930 17914 9.99581 8.27344 0.00511451 0.00624703
8 16 19039 19023 9.28778 4.33203 0.00437441 0.00642582
9 16 21166 21150 9.1789 8.30859 0.00491179 0.00680369
10 15 24227 24212 9.45699 11.9609 0.00488335 0.0066035
Total time run: 10.0048
Total writes made: 24227
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 9.4591
Stddev Bandwidth: 2.51358
Max bandwidth (MB/sec): 12.2344
Min bandwidth (MB/sec): 4.33203
Average IOPS: 2421
Stddev IOPS: 643.476
Max IOPS: 3132
Min IOPS: 1109
Average Latency(s): 0.00660262
Stddev Latency(s): 0.0211973
Max latency(s): 0.399299
Min latency(s): 0.00318786
Cleaning up (deleting benchmark objects)
Removed 24227 objects
Clean up completed and total clean up time :6.0361
 
you DO understand that your "speed" can't be faster than the transport. and if you are using the same interface for both public and private traffic that essentially caps your performance at 5gbit/s.


this is your drives' observed latency, and has nothing to do with ceph.
For the three nodes, we put in a extra two NICs in each box to dedicate just for Ceph communications, not connected to a switch, just each box directly connected to the other two. How would I see the latency for the entire Ceph cluster? (Back in the day, SANs/Windows had applications that show that.)
 
For the three nodes, we put in a extra two NICs in each box to dedicate just for Ceph communications, not connected to a switch
What you're describing is a mesh network. for the purposes of the conversation, this is the same as having a single active link on each node for both public and private networking. so you're sharing the IO for any PG between the client and disk traffic on a single 10g link.

How would I see the latency for the entire Ceph cluster?
Keep adding load until IO wait starts to climb. in your configuration that's likely no more than 10 active vms, but latency only affects specific applications (eg, database) so just benchmark those.

Back in the day, SANs/Windows had applications that show that.
whatever tools you used in the past likely still exist, although the applications that were used "back in the day" may have had different requirements so keep that in mind, and the benchmarks may not be of actual use.

iops": 482.35634238719643
you have VERY SLOW SSDs or VERY FAST HDDs. with 500iops drives you're not likely to even hit your bandwidth limit.