Ceph performance seems too slow

kentravis · Dec 27, 2025

I am thinking our Ceph performance isn't what it should be with a three node cluster each with three NVMe drives (9 OSDs). Ceph runs on a independent 10 Gbps network. I am seeing Apply/Commit latency of about 3/3 with about a dozen VMs/containers sitting idle. I noticed that it had only 33 PGs, but when I changed it to 128 auto-scaling scaled it back to 33. I know less than 10 ms is usually considered OK, but having a bit of experience with SANs years ago with SSDs, I just think I should see sub ms latency.

Am I crazy?

Ken

UdoB · Dec 27, 2025

kentravis said:
Am I crazy?

You did not give us enough numbers to decide ;-)

What is the topology? Default "size=3/min=2"?

What is the actual performance?

for the OSDs you could run for OSD in {0..8} ; do ceph tell osd.$OSD bench; done
for a pool "ceph0" you could run rados bench -p ceph0 10 write -b 4K
...and then there is "fio" for a mounted CephFS and/or in a VM using virtual disks

Please post some results - in a CODE-block "</>".

Disclaimer: I am not using Ceph currently...

SteveITS · Dec 27, 2025

Also are they enterprise drives?

alexskysilk · Dec 27, 2025

kentravis said:
Ceph runs on a independent 10 Gbps network

you DO understand that your "speed" can't be faster than the transport. and if you are using the same interface for both public and private traffic that essentially caps your performance at 5gbit/s.

kentravis said:
I just think I should see sub ms latency.

this is your drives' observed latency, and has nothing to do with ceph.

kentravis · Dec 28, 2025

UdoB said:
You did not give us enough numbers to decide ;-)

What is the topology? Default "size=3/min=2"?

What is the actual performance?

for the OSDs you could run for OSD in {0..8} ; do ceph tell osd.$OSD bench; done

for a pool "ceph0" you could run rados bench -p ceph0 10 write -b 4K

...and then there is "fio" for a mounted CephFS and/or in a VM using virtual disks

Please post some results - in a CODE-block "</>".

Disclaimer: I am not using Ceph currently

kentravis · Dec 28, 2025

Hi UdoB,

I am not sure I understand about posting in a code block. but here are results from of commands. Also my VMs and Proxmox couldn't find the fio command, so I may need some more guidance for that:

[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = fc00::/64 fsid = e5d1de59-d114-468c-ac1c-71022ca09761 mon_allow_pool_delete = true mon_host = fc00::2 fc00::3 fc00::4 ms_bind_ipv4 = false ms_bind_ipv6 = true osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = fc00::/64

oot@pve2:~# ceph tell osd.1 bench

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.61238325900000001,
"bytes_per_sec": 1753382066.246197,
"iops": 418.03886085658002
}
root@pve2:~# ceph tell osd.$osd bench
error handling command target: osd id not integer
root@pve2:~# ceph tell osd.1 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53740845699999995,
"bytes_per_sec": 1997999491.8464785,
"iops": 476.3601998916813
}
root@pve2:~# ceph tell osd.2 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.52995045900000004,
"bytes_per_sec": 2026117358.2642372,
"iops": 483.06402165037088
}
root@pve2:~# ceph tell osd.3 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.60851644299999996,
"bytes_per_sec": 1764523927.5810335,
"iops": 420.69528760457837
}
root@pve2:~# ceph tell osd.4 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.62637273800000004,
"bytes_per_sec": 1714221834.4758165,
"iops": 408.70233404059803
}
root@pve2:~# ceph tell osd.5 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.59378917200000003,
"bytes_per_sec": 1808287982.7252896,
"iops": 431.12945144779434
}
root@pve2:~# ceph tell osd.6 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.85508710799999998,
"bytes_per_sec": 1255710457.9806154,
"iops": 299.38470315471062
}
root@pve2:~# ceph tell osd.7 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.52125375399999996,
"bytes_per_sec": 2059921517.6107874,
"iops": 491.12356128949818
}
root@pve2:~# ceph tell osd.8 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53113702200000001,
"bytes_per_sec": 2021591001.0505726,
"iops": 481.98485399498287
}
root@pve2:~# ceph tell osd.0 bench
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.53072796499999997,
"bytes_per_sec": 2023149136.2999876,
"iops": 482.35634238719643

root@pve2:~# rados bench -p whac-pool 10 write -b 4K
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve2_977252
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 3107 3091 12.0729 12.0742 0.005117 0.0051595
2 16 5327 5311 10.3721 8.67188 0.00524738 0.0060144
3 16 8299 8283 10.7842 11.6094 0.0052023 0.00578668
4 16 10565 10549 10.3008 8.85156 0.00487787 0.00605959
5 16 13697 13681 10.6873 12.2344 0.00516581 0.0058415
6 16 15812 15796 10.283 8.26172 0.0040372 0.0060716
7 16 17930 17914 9.99581 8.27344 0.00511451 0.00624703
8 16 19039 19023 9.28778 4.33203 0.00437441 0.00642582
9 16 21166 21150 9.1789 8.30859 0.00491179 0.00680369
10 15 24227 24212 9.45699 11.9609 0.00488335 0.0066035
Total time run: 10.0048
Total writes made: 24227
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 9.4591
Stddev Bandwidth: 2.51358
Max bandwidth (MB/sec): 12.2344
Min bandwidth (MB/sec): 4.33203
Average IOPS: 2421
Stddev IOPS: 643.476
Max IOPS: 3132
Min IOPS: 1109
Average Latency(s): 0.00660262
Stddev Latency(s): 0.0211973
Max latency(s): 0.399299
Min latency(s): 0.00318786
Cleaning up (deleting benchmark objects)
Removed 24227 objects
Clean up completed and total clean up time :6.0361

kentravis · Dec 28, 2025

SteveITS said:
Also are they enterprise drives?

No, we built this environment from spare parts and what we could get cheep.

kentravis · Dec 28, 2025

alexskysilk said:
you DO understand that your "speed" can't be faster than the transport. and if you are using the same interface for both public and private traffic that essentially caps your performance at 5gbit/s.

this is your drives' observed latency, and has nothing to do with ceph.

For the three nodes, we put in a extra two NICs in each box to dedicate just for Ceph communications, not connected to a switch, just each box directly connected to the other two. How would I see the latency for the entire Ceph cluster? (Back in the day, SANs/Windows had applications that show that.)

alexskysilk · Dec 28, 2025

kentravis said:
For the three nodes, we put in a extra two NICs in each box to dedicate just for Ceph communications, not connected to a switch

What you're describing is a mesh network. for the purposes of the conversation, this is the same as having a single active link on each node for both public and private networking. so you're sharing the IO for any PG between the client and disk traffic on a single 10g link.

kentravis said:
How would I see the latency for the entire Ceph cluster?

Keep adding load until IO wait starts to climb. in your configuration that's likely no more than 10 active vms, but latency only affects specific applications (eg, database) so just benchmark those.

kentravis said:
Back in the day, SANs/Windows had applications that show that.

whatever tools you used in the past likely still exist, although the applications that were used "back in the day" may have had different requirements so keep that in mind, and the benchmarks may not be of actual use.

kentravis said:
iops": 482.35634238719643

you have VERY SLOW SSDs or VERY FAST HDDs. with 500iops drives you're not likely to even hit your bandwidth limit.

kentravis · Dec 29, 2025

alexskysilk said:
What you're describing is a mesh network. for the purposes of the conversation, this is the same as having a single active link on each node for both public and private networking. so you're sharing the IO for any PG between the client and disk traffic on a single 10g link.

Keep adding load until IO wait starts to climb. in your configuration that's likely no more than 10 active vms, but latency only affects specific applications (eg, database) so just benchmark those.

whatever tools you used in the past likely still exist, although the applications that were used "back in the day" may have had different requirements so keep that in mind, and the benchmarks may not be of actual use.

you have VERY SLOW SSDs or VERY FAST HDDs. with 500iops drives you're not likely to even hit your bandwidth limit.

The drives are nine NVMe drives each as an osd. Could it be we have the drive type set wrong somewhere? I thought we set it up a dedicated mesh network for just the cluster private network, but I see the pubic has the same ipv6 address range. The Promox/VMs/container use different network cards, so I don't know. I will try to figure out how best to benchmark under load.

SteveITS · Dec 29, 2025

Do some searching here or online about Proxmox/Ceph on consumer/non-PLP drives, they can make a big difference.

UdoB · Dec 29, 2025

kentravis said:
I am not sure I understand about posting in a code block.

You could have put tags like this: [code]...your content...[/code] around it or use the "</>" symbol in the editor's tool bar. The result is just better readable, compare (visually) with your first stanza:

Code:

oot@pve2:~# ceph tell osd.1 bench

{
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.61238325900000001,
"bytes_per_sec": 1753382066.246197,
"iops": 418.03886085658002
}

Those number and also...

kentravis said:
root@pve2:~# rados bench -p whac-pool 10 write -b 4K

Bandwidth (MB/sec): 9.4591

Average IOPS: 2421
Stddev IOPS: 643.476
Max IOPS: 3132
Min IOPS: 1109

... do look fine to me - using cheap devices. (As already said I do not have such a setup at hand...)

beisser · Dec 29, 2025

what exact nvme drives are you using? cheap consumer grade drives (maybe even with qlc?) or enterprise grade drives?
exact brand and model will give us an idea of what to expect from the drives themselves.

reason is that very cheap qlc drives can be slower than harddrives depending on io-pattern.

SteveITS · Dec 29, 2025

UdoB said:
"iops": 418

I'd expect that to be HDD speeds, which matches our real world experience... Ceph uses 500 and 80000 IOPS as the upper "unrealistic" limits for HDD and SSD: https://docs.redhat.com/en/document...ler#the-ceph-osd-capacity-determination_admin. The value is listed in the GUI: (node) > Ceph > Configuration.

Search

Search

Ceph performance seems too slow

kentravis

Member

UdoB

Distinguished Member

SteveITS

Renowned Member

alexskysilk

Distinguished Member

kentravis

Member

kentravis

Member

kentravis

Member

kentravis

Member

alexskysilk

Distinguished Member

kentravis

Member

SteveITS

Renowned Member

UdoB

Distinguished Member

beisser

Well-Known Member

SteveITS

Renowned Member

We value your privacy