Terrible Ceph IOPS performance

starkruzr

Well-Known Member
Hi folks,

I have a three-node cluster on a 10G network with very little traffic. I have a six-OSD flash-only pool with two devices — a 1TB NVMe drive and a 256GB SATA SSD — on each node, and here’s how it benchmarks:
1585422754637.png
Those IOPS numbers. Oof. How can I troubleshoot this? Someone mentioned that I might be able to run more than one OSD on the NVMe — how is that done, and can I do it “on the fly” with the system already up and running like this? And, will more OSDs give me better IOPS? All three hosts are on 10GbE on a very quiet network. This is on 6.1-8. One host is a Ryzen 3600 with 32GB RAM, the other two are Skylake i5s with 32GB RAM.
 
Last edited:
Hi folks,

I have a three-node cluster on a 10G network with very little traffic. I have a six-OSD flash-only pool with two devices — a 1TB NVMe drive and a 256GB SATA SSD

Found your problem. You do see that your bandwidth is maxed in the benchmark correct? It can't go any faster because of it.
 
Wait, how does my bandwidth being around 8Gb make my IOPS so low?

You have a 10Gbps network and your bandwidth is 8Gbps. You do not get the full 10Gbps with ethernet there is overhead to account for. 8Gbps is max speed.
 
I'm not worried about the throughput; the throughput is great. I'm talking about IOPS being around 258. Those different performance characteristics are unrelated to each other.

Your bandwidth is maxed. It's not possible to increase the iops.

You can use fio if you use smaller file sizes you will have more iops.
 
Also do not expect native speed with ceph. It's going to be slower then a standard setup.

And by putting nvme with ssds in the same pool your max speed is what the weakest ssd can do slows down the entire pool. So unless you're using a enterprise grade ssd, well, some ssds can be slower then hard drives once their cache is full.
 
So, I first tested this after having left the cluster alone for several hours; their caches wouldn't be full.

I am starting to think after doing different testing with `rados bench` and `fio` that I must have overlooked the extremely large file sizes used by default in rados bench. I was tempted to believe it just couldn't get above 300IOPS with any workload because my VM performance especially when installing a bunch of packages just felt very poor.

But:

Code:
root@ibnmajid:~# rados bench -p fastwrx 10 write --no-cleanup -b 4096 -t 10
hints = 1
Maintaining 10 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ibnmajid_3422290
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      10      2070      2060   8.04638   8.04688  0.00282386  0.00484976
    2      10      3896      3886   7.58931   7.13281  0.00291829  0.00514156
    3      10      5851      5841   7.60492   7.63672  0.00284875  0.00508713
    4      10      7612      7602   7.42328   6.87891  0.00258881  0.00526062
    5      10      9670      9660   7.54631   8.03906  0.00273798     0.00516
    6      10     11534     11524   7.50204   7.28125  0.00298023  0.00520527
    7      10     13651     13641   7.61158   8.26953  0.00246954  0.00510792
    8      10     15625     15615   7.62392   7.71094  0.00378808  0.00512255
    9      10     17603     17593   7.63526   7.72656  0.00372817  0.00507857
   10      10     19547     19537   7.63104   7.59375  0.00221603  0.00511757
Total time run:         10.0033
Total writes made:      19547
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     7.63302
Stddev Bandwidth:       0.43465
Max bandwidth (MB/sec): 8.26953
Min bandwidth (MB/sec): 6.87891
Average IOPS:           1954
Stddev IOPS:            111.27
Max IOPS:               2117
Min IOPS:               1761
Average Latency(s):     0.00511712
Stddev Latency(s):      0.0156618
Max latency(s):         0.145381
Min latency(s):         0.00119912

That's not GREAT, but it's also not nearly as bad as what I saw before. And:

Code:
jtd@tauron ~ % sudo fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread --bs=4k --direct=0 --size=512M --numjobs=4 --runtime=240 --group_reporting
randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.1
Starting 4 processes
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
randread: Laying out IO file (1 file / 512MiB)
Jobs: 3 (f=3): [r(3),_(1)][100.0%][r=58.0MiB/s,w=0KiB/s][r=14.9k,w=0 IOPS][eta 00m:00s]
randread: (groupid=0, jobs=4): err= 0: pid=15224: Sat Mar 28 21:08:38 2020
   read: IOPS=14.3k, BW=55.0MiB/s (58.7MB/s)(2048MiB/36578msec)
    slat (usec): min=21, max=31377, avg=275.04, stdev=242.34
    clat (nsec): min=1472, max=42651k, avg=4155397.59, stdev=1118801.88
     lat (usec): min=172, max=42996, avg=4430.82, stdev=1167.15
    clat percentiles (usec):
     |  1.00th=[ 3392],  5.00th=[ 3589], 10.00th=[ 3687], 20.00th=[ 3785],
     | 30.00th=[ 3884], 40.00th=[ 3949], 50.00th=[ 4015], 60.00th=[ 4113],
     | 70.00th=[ 4228], 80.00th=[ 4293], 90.00th=[ 4490], 95.00th=[ 4752],
     | 99.00th=[ 7046], 99.50th=[ 9765], 99.90th=[21627], 99.95th=[27395],
     | 99.99th=[38536]
   bw (  KiB/s): min=10720, max=15712, per=25.15%, avg=14417.31, stdev=856.71, samples=287
   iops        : min= 2680, max= 3928, avg=3604.27, stdev=214.18, samples=287
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=46.07%, 10=53.46%, 20=0.35%, 50=0.12%
  cpu          : usr=1.17%, sys=2.14%, ctx=524869, majf=0, minf=92
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=524288,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=55.0MiB/s (58.7MB/s), 55.0MiB/s-55.0MiB/s (58.7MB/s-58.7MB/s), io=2048MiB (2147MB), run=36578-36578msec

Disk stats (read/write):
  vda: ios=523884/10, merge=0/7, ticks=138256/4, in_queue=106396, util=99.49%

jtd@tauron ~ % sudo fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=1G --numjobs=2 --runtime=240 --group_reporting
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.1
Starting 2 processes
randwrite: Laying out IO file (1 file / 1024MiB)
randwrite: Laying out IO file (1 file / 1024MiB)
Jobs: 2 (f=2): [w(2)][100.0%][r=0KiB/s,w=194MiB/s][r=0,w=49.6k IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=2): err= 0: pid=15259: Sat Mar 28 21:11:21 2020
  write: IOPS=60.9k, BW=238MiB/s (250MB/s)(2048MiB/8606msec)
    slat (nsec): min=1182, max=72997k, avg=30990.46, stdev=680805.77
    clat (nsec): min=240, max=2522.8k, avg=416.25, stdev=7170.00
     lat (nsec): min=1583, max=73001k, avg=31679.12, stdev=681194.83
    clat percentiles (nsec):
     |  1.00th=[   262],  5.00th=[   262], 10.00th=[   262], 20.00th=[   262],
     | 30.00th=[   262], 40.00th=[   262], 50.00th=[   270], 60.00th=[   270],
     | 70.00th=[   270], 80.00th=[   282], 90.00th=[   350], 95.00th=[   402],
     | 99.00th=[   548], 99.50th=[  1528], 99.90th=[ 33024], 99.95th=[ 75264],
     | 99.99th=[142336]
   bw (  KiB/s): min=62488, max=555169, per=50.48%, avg=123022.94, stdev=111263.33, samples=34
   iops        : min=15622, max=138792, avg=30755.56, stdev=27815.85, samples=34
  lat (nsec)   : 250=0.03%, 500=98.67%, 750=0.65%, 1000=0.08%
  lat (usec)   : 2=0.14%, 4=0.15%, 10=0.11%, 20=0.04%, 50=0.05%
  lat (usec)   : 100=0.06%, 250=0.03%, 500=0.01%
  lat (msec)   : 4=0.01%
  cpu          : usr=2.35%, sys=7.63%, ctx=10457, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,524288,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=238MiB/s (250MB/s), 238MiB/s-238MiB/s (250MB/s-250MB/s), io=2048MiB (2147MB), run=8606-8606msec

Disk stats (read/write):
  vda: ios=0/120932, merge=0/77210, ticks=0/58756, in_queue=50284, util=29.81%
[CODE]

^-- that's from a VM hosted on the Flash pool. I kicked up the size of the write to 1GB and those are still pretty good numbers, though I imagine some of that must be cached in RAM.

Now I don't know what to think, tbh.
 
Last edited:
Now I don't know what to think, tbh.

Yes, osd's have their own cache so you're probably seeing that. The problem is you have a 10Gbps network and your SSD/NVME pool is maxing the bandwidth.

If you had a 56Gbps FDR infiniband setup you would probably see that hitting 30Gbps+ with significantly higher iops. Depending on pool size you might even max that setup.

For the record is this a 10Gbps dedicated public network with a 10gbps private network or are you sharing a single 10gbps port? This will also affect your performance.
 
Sharing a single 10Gbps port. I thought about taking some dual-port 10G NICs and "ring" topology-ing them together as a private Ceph network in addition to the access network. There's barely any traffic on this network already, though. I guess it might still help because you're effectively doubling the bandwidth in and out of every node?
 
Sharing a single 10Gbps port. I thought about taking some dual-port 10G NICs and "ring" topology-ing them together as a private Ceph network in addition to the access network. There's barely any traffic on this network already, though. I guess it might still help because you're effectively doubling the bandwidth in and out of every node?

If you bond them you can double the bandwidth. But the point of dedicated public+private networking on ceph is the private network acts as the backend for data transfer between OSD's and the public is your cephfs/radosgw network.

I would test both out. Bonding would probably be more rewarding but with the amount of IOPS you are trying to get it might be better to do a public+private network.

You keep mentioning traffic but that's not relevant. Ceph has to talk to every OSD on the network - it's not just the bandwidth you see on cephfs. You are using NVME drives which are fast enough to clog the pipes. Requests get slowed down and ceph is unable to keep up.

I would not be surprised at all if a FDR 56Gbps infiniband setup would be maxing out on your setup or at least getting around the 35Gbps+ mark.
 
try with bigger iodepth, numjob=1 and direct write

"fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --runtime=240"

if you have good datacenter ssd (not consumer ssd), you should be able to do 10-20k iops (4k) by ssd, and cpu generally limit iops.

(I'm able to reach 200-300k iops 4k write with 3 nodes (3x 24cores 3ghz) , and 700kiops randread 4k), clients also have 3ghz cpu.
 
  • Like
Reactions: starkruzr
try with bigger iodepth, numjob=1 and direct write

"fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --runtime=240"

if you have good datacenter ssd (not consumer ssd), you should be able to do 10-20k iops (4k) by ssd, and cpu generally limit iops.

(I'm able to reach 200-300k iops 4k write with 3 nodes (3x 24cores 3ghz) , and 700kiops randread 4k), clients also have 3ghz cpu.
Thanks for this. I'm getting about 7100 IOPS write with that benchmark and 30MB/s throughput at that rate. That's really not bad at all, since these are definitely NOT good datacenter SSDs :) I think adding OSDs to my NVMe drives helped.
 
Thanks for this. I'm getting about 7100 IOPS write with that benchmark and 30MB/s throughput at that rate. That's really not bad at all, since these are definitely NOT good datacenter SSDs :) I think adding OSDs to my NVMe drives helped.
Just be carefull of consumer ssd (you really need to bench them), and try to avoid samsung evo, they are pretty shitty for sync writes. (something like 200iops, like a 7,2k hard drive)
see
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/
 
If you bond them you can double the bandwidth. But the point of dedicated public+private networking on ceph is the private network acts as the backend for data transfer between OSD's and the public is your cephfs/radosgw network.

I would test both out. Bonding would probably be more rewarding but with the amount of IOPS you are trying to get it might be better to do a public+private network.

You keep mentioning traffic but that's not relevant. Ceph has to talk to every OSD on the network - it's not just the bandwidth you see on cephfs. You are using NVME drives which are fast enough to clog the pipes. Requests get slowed down and ceph is unable to keep up.

I would not be surprised at all if a FDR 56Gbps infiniband setup would be maxing out on your setup or at least getting around the 35Gbps+ mark.
I have a 56Gbit infiniband setup, rocev2, three nodes 3 osd ssd per node. I get the same iops. Links are fine, I have both the throughput and latency as best as they can get. Think you're wrong bud, its not a network performance issue, its a ceph issue.
 
I have a 56Gbit infiniband setup, rocev2, three nodes 3 osd ssd per node. I get the same iops. Links are fine, I have both the throughput and latency as best as they can get. Think you're wrong bud, its not a network performance issue, its a ceph issue.
optane cache can help, but not as much as you'd hope. Effectively, replication with zfs is _WAY_ better performance than ceph could ever muster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!