Concern with Ceph IOPS despite having enterprise NVMe drives

bsinha

Member
May 5, 2022
70
0
11
Hi,

We have a 3 node cluster, where we have configured Proxmox with Ceph. Following are the specifications of the each node:

Memory (RAM): 384GB
vCPU: 80 (Total)
NIC card: 2 x25G (Dedicated to ceph traffic only) - However the DAC cables are of 10G
Other NIC cards: are for management network, cluster etc.
OSD: 4 xMicron 7450Max participating as OSD.

We tested each drive with the following command and got around 55K to 60K IOPS for block size 4k.

fio --ioengine=libaio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
1743770821676.png

We have run fio simultaneously on 3 separate disks and got the above result.

Now, we have configured CEPH networking with (2 x25G Ports) in a full-mesh Routed with fallback mechanism by following the link - https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

As we mentioned earlier, these NIC ports are connected with 10G DAC. So effectively, we are getting 20g traffic when we do iperf test.

After setting up Ceph with 4 NVMe drives on each of the 3 nodes, and when we do the rados benchmarking, we get the following result, which is shocking:

rados bench -t 32 -b 4096 -p NVMepool1 60 write
1743772635189.png

The IOPS we are getting is around 45K. However, we have a total of 12 NVMe drives. If 55K IOPS we are getting from each. Therefore, the total IOPS is 55K x12 = 660K. Should we not get 20% to 30% of the total IOPS, which is 130K?

Why are we getting too low iops?

Let me share one more detail about the pool:
1743773123128.png

I tried making pool size from 3 to 2. but did not find any significant change.
 
do you have tried with mutlple rados bench in // or increasing the -t value ? (maybe are you cpu limited on client ? )
but 40000 iops seem to be quite low.

(you could increase the pg_num to 1024 , the new recommandations for nvme is 200 pg by osd)

maybe can you try to run rados bench with :

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=256MB rados bench ....

in past, it was helping it a lot
 
Yes I tried with increasing the t value of RADOS to 1000 and I got around 112K write iops. I had to stick to the number of pgs to 512. I destroyed the old pool and tried to create a new pool with pg number 1024, it gave me the following error: error with 'osd pool create': mon_cmd failed - pg_num 1024 size 3 for this pool would result in 256 cumulative PGs per OSD (3072 total PG replicas on 12 'in' root OSDs by crush rule) which exceeds the mon_max_pg_per_osd value of 250

I tried with size 2. But did not work.
 
you can increase the mon_max_pg_per_osd value
ceph devs are going to increase the value for next version anyway.

ceph config set mon mon_max_pg_per_osd 500

for example.

With a low number of osd, it's really recommand to increase it, becaue of pg_lock contention. (or you can also create multiple osd by nvme , 2~4 , you can do it command line currently)

#ceph-volume lvm batch --osds-per-device 4 /dev/nvme0n1

(I don't remember if the pveceph command support it)


Last thing: double check that your server cpu profile is configured in max performance, this can impact a lot ceph latency on small block size.
 
e have a total of 12 NVMe drives.
You are running a test with a queue depth of one, and no thread count (which means 1.) It doesn't matter how many drives you have, you can only realize a high fraction of a single disk's capability since that is all you are testing.

The question I would be asking is- do you have an idea what workload you are trying to simulate? that would be the benchmark to run.
 
  • Like
Reactions: UdoB
you can increase the mon_max_pg_per_osd value
ceph devs are going to increase the value for next version anyway.

ceph config set mon mon_max_pg_per_osd 500

for example.

With a low number of osd, it's really recommand to increase it, becaue of pg_lock contention. (or you can also create multiple osd by nvme , 2~4 , you can do it command line currently)

#ceph-volume lvm batch --osds-per-device 4 /dev/nvme0n1

(I don't remember if the pveceph command support it)


Last thing: double check that your server cpu profile is configured in max performance, this can impact a lot ceph latency on small block size.
Hi I shall set "ceph config set mon mon_max_pg_per_osd 500". Also can you tell how can I check that my 'cpu profile is configured in max performance'?
 
You are running a test with a queue depth of one, and no thread count (which means 1.) It doesn't matter how many drives you have, you can only realize a high fraction of a single disk's capability since that is all you are testing.

The question I would be asking is- do you have an idea what workload you are trying to simulate? that would be the benchmark to run.
We have tried with iodepth of 16 and thread count 4. We got around 300K write IOPS with the block size 4K.

Why I mentioned with a single thread and single iodepth, because the iops we are getting from this test is representing the worst case scenario. In this scenario we are at least getting around 50K write iops with 4k block size.

I wanted to mention that the entire ceph NVMe pool is not even close to the IOPS that a single NVMe drive is providing with a worst case scenario.


To answer your question, we are trying to create a pool that will provide an ample amount of write IOPS with 4k block size (let us assume the entire pool provides 400K write iops with 4k block size). Now if we create database servers, which are IO intensive. we shall create these servers and bind them to utilize below 50K write iops. By which we can provide 8 of such servers at ease.

The end goal is to get a reasonable amount of write IOPS from the ceph pool built out of the 12 NVMe enterprise disks. So that the underlying proxmox Windows 2016 and 2022 VMs can get all of those IOPS at ease.
 
We got around 300K write IOPS with the block size 4K.
sounds about right.

(let us assume the entire pool provides 400K write iops with 4k block size). Now if we create database servers, which are IO intensive. we shall create these servers and bind them to utilize below 50K write iops. By which we can provide 8 of such servers at ease.
so you're pretty close. does this mean you're happy with the result? I'm a bit lost. Depending on how many PGs were engaged in the above test, you probably have room for improvement which would be realized if the initiators were fully separate.

Why I mentioned with a single thread and single iodepth, because the iops we are getting from this test is representing the worst case scenario. In this scenario we are at least getting around 50K write iops with 4k block size.
also sounds right :)

we are trying to create a pool that will provide an ample amount of write IOP
The end goal is to get a reasonable amount of write IOPS
"ample" and "reasonable" arent numbers. One can assume 300k is more then ample. or not.
 
You have to account that Ceph is making 3 copies of the data and sending it to each drive. Each drive then has to respond over whatever network fabric you have, the 'slowest' disk is then indicating how fast you can push your data.

You are limited by your latency and bandwidth, not necessarily the individual NVMe drive. If you want better IOPS, you'll need to distribute your data more over lower latency links.

For even better 'stats' you'll also need to generate more traffic, because your benchmark requires a single client to generate enough requests, presumably that client is also one of the Ceph OSD hosts and a single client may only use a single thread on a single CPU (depending on your benchmark). Hence why Ceph is usually benchmarked with a 'relevant' load, eg. 400 hypervisors need to access 100 OSDs which is when you get to the gigantic IOPS and throughput rates that Ceph CAN do. A single client, as with any distributed file system is limited to what it and a single request thread on each disk system can do.
 
so you're pretty close. does this mean you're happy with the result? I'm a bit lost. Depending on how many PGs were engaged in the above test, you probably have room for improvement which would be realized if the initiators were fully separate.
No, I am not at all close. Each NVMe drive is giving me 300K write IOPS. I have total 12 NVMe drives. Total IOPS is: 300K x12 = 3600K.

I have altogether 3600K write IOPS. With these same drives in a ceph pool what we are getting is only 44K. I am concerned about the performance of the Ceph pool with 4K IOPS. I am ok with the individual performance of the NVMe drives.

We have 3 nodes with the specification of 384GB RAM, 80 vCPU and 4 NVMe drives. I am wondering what physical hardware do I need yet. or what configuration changes do I need to do to get at least 400K out of this ceph pool.

"ample" and "reasonable" arent numbers. One can assume 300k is more then ample. or not.
I tried to mean that if each NVMe is giving me 300K IOPS. then why not the entire ceph pool giving me at least 400K. What am I missing here?
 
With 1 client, you only need 1 drive. You can't sequentially give more IOPS than a single drive because you only go 3x to 1 drive.

So right now, your program needs to put a block on a disk, you write the block in the disk, you wait for it to return, you write the next block etc. - it doesn't stripe that 4Kb across 12 disks, that is impossible, it pushes the same data to exactly 3 disks, but that is the same data 3x. That's what it means to do a sync write benchmark.

There are some optimizations a client "can" do (eg. go deeper with the queue size and coalesce many smaller 4Kb requests into 1 larger requests) and your OS already does some of that if you're not asking for a sync write, but that has its own tradeoffs.

Now if you say, you're going to launch 16 clients to 16 servers and they each have 10G available and make their own independent requests, and they have a deep queue size that they can schedule work in advance, now you're going to see faster throughput, although you're still limited to the aggregate of your 10G network / 3.

A single network link can only do ~300,000 4kB packets per second (and that is fully optimized) - there is just no more you can push through (30,000 * 4kBps = 0.96Gbps) - provided there is no compression or deduplication. You have an aggregate of 20Gbps in your network between 3 hosts, you can't expect magic.

Also what is your minimum block size? If it’s 1M for example, each write results in a 1M write regardless of blocks written/changed. Is 300k IOPS a real random benchmark and are you using the same settings for that benchmark? Can your CPU & kernel handle ~1M network packets in flight?

If you really need the bottom of the can, look into RDMA - that way you avoid the kernel handling the packets through CPU and you basically move packets from NIC to NVMe instead of NIC-RAM-NVMe
 
Last edited:
  • Like
Reactions: fitful
Each NVMe drive is giving me 300K write IOPS
I'm curious. is that using the same benchmark?

I have total 12 NVMe drives. Total IOPS is: 300K x12 = 3600K.
IF you were to sustain 300k IOPs per drive, the only way you'd be able to realize that aggregate is using 12 initiators each talking directly to the drive in question. There is no free lunch.

I have altogether 3600K write IOPS. With these same drives in a ceph pool what we are getting is only 44K
Marketing, meet reality, but I'm still confused. didnt you say you were getting 300k iops from the pool?
 
The benchmarks on that particular drive shows QD256 is approximately 250K IOPS for pure 4K random benchmarks. Note that this is a synthetic benchmark, so if you have WAL on the same disk, and you are making single requests at QD1 (simple benchmark), you probably get ~8k IOPS per drive, and then again, we haven’t considered network overhead and other things that go on.
 
Last edited:
300k iops at 4k is around 10Gbits/s. I don't known if the full mesh network is able to balance correctly traffic across both nics ?

but anyway, 10gbits is pretty low . (1nvme can reach 10gbit/s). so you need ~50gbits minimum for full speed.

also note that read use less cpu than write, so it should be faster too. (and for write, you can enable writeback cache in the vm, it's help a lot too)
 
  • Like
Reactions: alexskysilk
With 1 client, you only need 1 drive. You can't sequentially give more IOPS than a single drive because you only go 3x to 1 drive.

So right now, your program needs to put a block on a disk, you write the block in the disk, you wait for it to return, you write the next block etc. - it doesn't stripe that 4Kb across 12 disks, that is impossible, it pushes the same data to exactly 3 disks, but that is the same data 3x. That's what it means to do a sync write benchmark.

There are some optimizations a client "can" do (eg. go deeper with the queue size and coalesce many smaller 4Kb requests into 1 larger requests) and your OS already does some of that if you're not asking for a sync write, but that has its own tradeoffs.

Now if you say, you're going to launch 16 clients to 16 servers and they each have 10G available and make their own independent requests, and they have a deep queue size that they can schedule work in advance, now you're going to see faster throughput, although you're still limited to the aggregate of your 10G network / 3.

A single network link can only do ~300,000 4kB packets per second (and that is fully optimized) - there is just no more you can push through (30,000 * 4kBps = 0.96Gbps) - provided there is no compression or deduplication. You have an aggregate of 20Gbps in your network between 3 hosts, you can't expect magic.

Also what is your minimum block size? If it’s 1M for example, each write results in a 1M write regardless of blocks written/changed. Is 300k IOPS a real random benchmark and are you using the same settings for that benchmark? Can your CPU & kernel handle ~1M network packets in flight?

If you really need the bottom of the can, look into RDMA - that way you avoid the kernel handling the packets through CPU and you basically move packets from NIC to NVMe instead of NIC-RAM-NVMe
Thanks. This is valuable insights for me. So in a nutshell with all these 12 NVMes I can only expect 44K IOPS. right? is there any way I can optimize it better?
 
I'm curious. is that using the same benchmark?
I used: fio --ioengine=libaio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=4 --iodepth=16 --runtime=60 --time_based --name=fio


IF you were to sustain 300k IOPs per drive, the only way you'd be able to realize that aggregate is using 12 initiators each talking directly to the drive in question. There is no free lunch.
I absolutely do not expect that I would get 3600K IOPS from the ceph cluster. What I meant to say is, I have the IOPS intensive drives, however I cannot get at least 300K IOPS out of the ceph cluster. I expect 300K from these 12 NVMe. That is my end goal. What configuration I am missing here?
 
Marketing, meet reality, but I'm still confused. didnt you say you were getting 300k iops from the pool?
My expectation is to get 300K from the pool. Where as I am only getting 44K. the command I am using is mentioned below:
rados bench -t 32 -b 4096 -p NVMepool1 60 write
 
The benchmarks on that particular drive shows QD256 is approximately 250K IOPS for pure 4K random benchmarks. Note that this is a synthetic benchmark, so if you have WAL on the same disk, and you are making single requests at QD1 (simple benchmark), you probably get ~8k IOPS per drive, and then again, we haven’t considered network overhead and other things that go on.
Ok. What is the rados command to check the iops with QD256? And one more thing - If I run a windows VM on this ceph pool, will this windows VM be considered as one single client which is equivalent to QD1?