CEPH read performance

Hi Q-wulf,
sure? I started also with such bad read performance on ceph and doing a lot at the config / systems / upgrades help me to reach better (not perfect) values.
Well, kinda, i'm assuming e100 has less then 32 GB/ram per node (as thats benchmark examples from my 3-node Cluster with <= 32 GB). On top of that the read values is what i'd expect with 7 HDD based OSD's on 3 nodes. where the single OSD benchmarks are what he posted.


sure, you can probably get some 2-3% out of this (ceph-subsystem) by fine-tuning your pg's per OSD on a read_speed/Capacity forumula via primary-affinities (as he has different speeds and different capacities - much less then on my Cluster). It however does not change the large difference between synthetic benchmark and VM-Based bench results. we are talking of a 2-3x discrepancy here.

I can not speak to 3.x clients (i only have 4.x hammer based clients at work and at home)

to be 100% sure you can use bigger datasets.
E.g. 300,, 450, 700 or 1500 second long write/reads. That should give you very realistic results.

Example:
On my 3-node Cluster i have 16, 24 and 32 GB of ram on the nodes. and 450 second read does not differ from 900 second read, whereas a 300 second read gives me 1,6 GB/s reads.
I have a bench on single stock debian in KVM, using virtio and iotread=on + no cache i was able to do 147 MB/s of sustained reads (155 MB/s synthetic benchmark) on a 30 GB dataset where the VM has only 2 GB of ram assigned. Thats using my 3-node cluster described above.

My experience so far is that you ought to be able to receive around 90% of your synthetic benchmark with a virtio iothread=on vm.
Can iothread option be enabled on 3.x by editing config or does it only work in 4.x?
Can not help you there. Never used ceph before proxmox 4.x

This whole cluster was built mostly from decomissioned production stuff so its older.
The three ceph nodes are:

That might explain a lot.
  • I am assuming here, that without your SSD-Journals, your writes would be in the ballpark of (Readspeed / 2).
  • I am also assuming that your journals are big enough to house a couple disks.
Lets assume you have 7 OSDs running per journal, and each journal has 5 GB capacity, you'd need to write at least 30 GB of data. ON a 3-Node setup with size 2 pool you'd need to write 30GB * 3 / 2 = 45 GB to produce writes outside of the SSD.

That similarly goes for Reading.
Lets assume you have 16GB of Ram on that node, you probably wanna to a read benchmark with 32 GB (twice the ram) to ensure you are producing reads outside of your potential cache range.


This goes back to what udo asked about earlier.
You probably need to run a rados bench with a bigger set and you will see your read/write numbers drop to what happens outside cache ranges.

I'd try a rados bench with 900 write + read in that case.

dd if=/dev/vda bs=1M
3384803328 bytes (3.4 GB) copied, 46.031 s, 73.5 MB/s

Same thing really, try read/writing data that is (VM-Ram x 2), so you do reads/writes outside of the cache of your vm's os.
 
Last edited:
The write speeds are greatly improved with the SSD journals but even without them the write speeds have always been acceptable.

I have two SSD per CEPH node, they are a mirror for the OS and have partitions for the journals. Not all journals are moved to the SSD yet, couple of disks suck with dsync, awaiting replacements.

Each CEPH node has 16GB RAM.

Thank you both for your insights, this has been a great help.

When I upgrade production to 4.x I plan to do fresh installs giving me the opportunity to redo my storage layout. Most of my prod servers have Areca 188x cards with 4 cache RAM.

Do you think it would be best to run CEPH OSDs off the Areca using pass through or would I be better off using a non-raid HBA?

We have been moving prod to SSD so I should have at least 100 SSD by the time I start installing Proxmox 4.x.

I might ditch DRBD and go CEPH only but most likely do both.
 
When I upgrade production to 4.x I plan to do fresh installs giving me the opportunity to redo my storage layout. Most of my prod servers have Areca 188x cards with 4 cache RAM.

If you have battery backups for em, sure, go for it. If no battery backups, be sure to turn the cache off. Should be do-able on every Areca Raid-Controller. At least i can do it on my old Areca arc 1231-Mil 12-ports i have at home. and the 16-Port ones we have at work.

We have been moving prod to SSD so I should have at least 100 SSD by the time I start installing Proxmox 4.x.

If you do not have the bare minimum setup of 3-node Cluster and 1x 10G, then sure go for for all out SSD's. But be sure you know at what point you are overtaxing your Network. I had touched upon this on this post https://forum.proxmox.com/threads/ssd-ceph-and-network-planning.25687/#post-128696, but basically a single 10G network means 1,25 GB/s or 4 SSD's that can do 312.5 MB/s in sustained read/writes. For most people that is no-good Storage-Capacity wise.


The more Nodes you have, the more sense it makes in my POV to use Cache-Tiering.
I am a big fan and proponent of using a custom crush location hook to split SSD from HDD journals. Then you basically place your HDD's without SSD-journals into your normal pool-layouts. Then you create a cache_Tier pool that gets layed over said "normal pool". For Replicated Pools i'd use a SSD-Pool in cache-mode "readforward". For EC-Pools i'd use cache-mode "writeback".
Lets say you have a 6-node cluster with 3x 512 GB SSD's and 3x 4TB HDD's, and your doing a size 3 replicated pool (i will never go below that, from experience, personally) You end with 3 TB of available "Hot-Storage" (its more like 3TB * 0.8 - as you wanna make sure at 0.8 it starts evicting - and you start flushing not before 0.5). And 24 TB "Cold-Storage" - 36 TB if you use an EC-Pool using a K=3 and M=3 scheme)


I use it at home (3-node + 2x 1-Node Ceph-setups) and we use it at work on 3 Clusters with 26-32 Nodes each (we just hit 32 nodes for that one Cluster. 8x NVME (SSD 950 Pro 512GB) + 48x HDD (4-8 TB Disks, 10TB in evaluation) per node. All on a 2x 10G + 2x 40G network. We are easily able to max out the 80G ovs-Bond (balance-tcp) for ceph public and ceph cluster networks.
I would NEVER go back again, EVER. not only does it safe cash, it also utilizes your resources more efficiently and speeds stuff up by a magnitude.

We have been moving prod to SSD so I should have at least 100 SSD by the time I start installing Proxmox 4.x.
How many nodes are that ? and what type of network for ceph public+ceph cluster ?
 
Last edited:
How many nodes are that ? and what type of network for ceph public+ceph cluster ?

We have 10G Infiniband for, each node has two ports for redundancy. I would likely add another dual port card so I have redundant public and private networks.
Currently we only use DRBD so thats sufficient, not so sure it would be with CEPH and SSDs.

We have 20 Proxmox nodes, I doubt I can dedicate any to only CEPH so some nodes would run VMs and CEPH.
I am thinking of making six nodes CEPH servers, these nodes are dual socket E5-2650 sandy bridge with 128GB RAM each.
Without adding any expanders I have room for about 100 2.5" disks across those six nodes.

All of our Areca cards have BBU, never deploy without them.

Our plan was to go SSD for DRBD only.
Adding CEPH I need to reconsider everything.

Any experience running high IO databases on CEPH?
I don't have a lot of DB VMs but I do have a few.
 
...
All of our Areca cards have BBU, never deploy without them.
Hi,
depends on your workload, but with the areca + cache perhaps you don't need journal-SSDs.
Our plan was to go SSD for DRBD only.
Adding CEPH I need to reconsider everything.

Any experience running high IO databases on CEPH?
I don't have a lot of DB VMs but I do have a few.
DRBD is much faster than ceph for me - esp. the latency is much less on DRBD.

Udo
 
DRBD is much faster than ceph for me - esp. the latency is much less on DRBD.
Latency is the largest problem CEPH has.

Back to my original issue.
The 70MB/sec I got was due to cache on CEPH server nodes.
Did some more testing where I dropped cache on CEPH servers and performance drops down to 40MB/sec.

Do you think that maybe the issue is latency introduced by slower CPUs?
 
It seems to me following this thread that the problem is not disk related at all. This might be a stupid question, but what is your cluster network set to in your ceph.conf? You mentioned that you have dual port IB HBAs; how are your IPs configured? are you using connected mode or datagram? As for your IB config, whats providing SM functionality?
 
Regarding your future plans:

We have 10G Infiniband for, each node has two ports for redundancy. I would likely add another dual port card so I have redundant public and private networks.
Disclaimer: never worked with infiniband (only 10/40 G Gbase-T)

Especially when using Ceph consider using a Public and a Cluster Ceph network in your ceph.conf and then using a openvswitch based bond in balance-tcp over all your dedicated "ceph network links". Why ? it (private+public ceph networks) increases the amount of IP/TCP-port Combinations, making it really easy for you to 100% utilize your links, the more bandwidth you get, the more you can benefit from your SSD-only plan. And you can use 'QOS' by tweaking the backfilling and rebalancing rules on the ceph side.

Thats how we do it with 52 Disks per Node (8xNVME + 48x HDD), we dedicate 80G to "ceph" and 20G to "VMs". But we run proxmox+ceph on all nodes.



We have 20 Proxmox nodes, I doubt I can dedicate any to only CEPH so some nodes would run VMs and CEPH.
I am thinking of making six nodes CEPH servers, these nodes are dual socket E5-2650 sandy bridge with 128GB RAM each.
Without adding any expanders I have room for about 100 2.5" disks across those six nodes.

If only these Nodes are running ceph, consider ramping em up with enough network bandwidth to support your 100 SSD scheme. If you do not have ANY AMD based CPU's in your ceph-cluster you can also use the ISA plugin for your erasure coded pools. Its more efficient on your CPU-Cycles, but works only with intel based cpus.


I am also not sure wether your plan is to split your 20 Nodes into 6 Clusters, or just create a SSD-Cluster of 6 "ssd-only nodes" and then a Cluster of 14 "normal ceph nodes". The way you phrased it, it works both ways :)

You also never stated if Erasure coding is anything you need or would consider, nor if you use Ceph for big datasets or just to keep your redundancy up.
 
And also what MTU for your IB?

You both bring up good points.
I agree, this is not a disk problem, the rados benchmark proves that.

The MTU is maxed out, connected mode is enabled.
The switches have built in subnet manager.
On this CEPH cluster one port is used for the public network, the other for the private.
The Proxmox client nodes are only connected to the public network.

If the network was the issue I would expect the rados benchmark to be low but as already demonstrated it's not.

I've used iperf to test the network and all is well.
Do you have any suggestions on tests I could perform that might reveal the issue?
 
I've not looked into erasure coding, would like to get this test cluster working well first.
HA storage is the goal.
We don't have what I would consider big data sets.

Then Cache-Tiering is probably not for you. And neither will be Erasure coding.

Cache-tiering you use when you have lots of "Slow" and a finite amount of "Fast" storage media and you want to optimize the usage of said media.
Basically data that is considered "Hot" (has been used recently - where recently is a value you define) you store on your fast media (e.g. SSDs) and "cold" data you store on your slow media. That way you always get best performance for data you excess often.

Erasure-Coding you use if you have large amounts of Data that is important, but hardly ever utilised. It stores data in a tolerant manner, by not increasing the used space all that much. It does this by doing parity calculations, like e.g. Raid 5 or raid 6 does, just with as much parity as you specify.

Example: a 120 Disk system of 1 TB disks on a Erasure coded pool with k=100 and m=20 can take the loss of 20 disks. The overhead is 120% for every byte written. To achieve the same fault tolerance with replication you'd need to use a size=20 pool. That increases your overhead to 2000% for every byte written.
Or in other words, with the above example, you can store the following:
EC-Pool (k100M20) = 100 TB
Replicated (size=20) = 6 TB

You'd typically use this for storing media, or backups in lieu of a NAS.
 
Can iothread option be enabled on 3.x by editing config or does it only work in 4.x?

for proxmox 3.X, il's possible to enable 1 iothread for all disks.
(iothread: 1 ) in config. So it's helping a little but not too much.

in promox 4.X, you can define 1 iothread by disk. So ceph will scale with more disks.


The qemu plan is to be able to use multiple iothreads by disk. (should be avaialble this year)
 
(iothread: 1 ) in config. So it's helping a little but not too much.
This does help in 3.x and 4.x but you are right its not a large difference. Seems like a 25-40% improvement.

Here is something that I discovered that might help pinpoint the issue.

I start a VM on Proxmox 3.x using iothreads, in it I read some large file using dd outputting to /dev/null
This gets 35-40MB/sec, pretty crappy.

I shutdown the VM and start it back up, ensuring that the VM has no local cache of data.
Then run the exact same command and get 127MB/sec
If I then drop cache on all the CEPH nodes the io immediately drops to 35-40MB/sec

Did same test without stopping the VM.
Read file for the first time inside the VM:
Code:
1073741824 bytes (1.1 GB) copied, 35.1943 s, 30.5 MB/s
dropped cache in the VM and reread same file
Code:
1073741824 bytes (1.1 GB) copied, 5.61629 s, 191 MB/s
Now I drop cache in VM and on CEPH nodes, re-read same file:
Code:
1073741824 bytes (1.1 GB) copied, 27.2442 s, 39.4 MB/s

So if the data needed is in the cache on the CEPH node I can read fast.
But if the data needs read from the disk on the CEPH node my reads are really slow.
 
There are afaik multiple caches involved.
  1. The cache on your VM's OS
  2. The Cache on the Ceph-Client (in this case librdb - see http://docs.ceph.com/docs/hammer/rbd/rbd-config-ref/)
  3. The Cache of your OSD-Daemon
  4. The Cache of Your Raid-Controller
  5. The Cache of your physical Drives
ps.: there is a reason why most of them are there.

Thats why i said before that in order to gain a true insight into your Ceph-substsystems speed you might need to run a rados bench with a longer run time (especially on read) and that also counts for doing any benchmarks on the VM. Simple rule of thumb: Do benchmarks that are (System-Memory x2)

But, your Rados bench looks like what i'd expect it to look like speed wise based on your amount of Disks, ceph settings and Replication size.

The Question is, why those speeds are not realised on your VM's, unless you read em the second time around.


Any Chance you can recreate your test-result from post #35 with a larger Test-file ??
You say you have 16 GB ram per node, pick a file that is in the 20-32 GB Range for Read testing. Should tell you real quick where your true limits are.
 
...
Any Chance you can recreate your test-result from post #35 with a larger Test-file ??
You say you have 16 GB ram per node, pick a file that is in the 20-32 GB Range for Read testing. Should tell you real quick where your true limits are.
Hi,
but this isn't large enough. All primarys are spread over all nodes, so you have app. 11GB data on each node... fit in Ram.

But dropping the cache on the osd-nodes should be the right before measurement.
BTW. I had also the effect, that cached data (on the OSD-Node) is fast on the VM - so the network should be not the bottleneck (or not so much).

@e100: which scheduler do you use on the OSD-Nodes? How powerful is your monitoring-host with the lowest IP (this one is used)?

Udo
 
which scheduler do you use on the OSD-Nodes? How powerful is your monitoring-host with the lowest IP (this one is used)?
I think its set to deadline, I will try different schedulers and see if that helps.
The monitors are on the same nodes as the OSDs.

But, your Rados bench looks like what i'd expect it to look like speed wise based on your amount of Disks, ceph settings and Replication size.
While running the rados read benchmark I could see the total read IO is really high on the CEPH nodes.
When reading from inside the VM each CEPH node reads between 10-30MB/sec
 
I think its set to deadline, I will try different schedulers and see if that helps.
The monitors are on the same nodes as the OSDs.


While running the rados read benchmark I could see the total read IO is really high on the CEPH nodes.
When reading from inside the VM each CEPH node reads between 10-30MB/sec
Hi e100,
this is because the VM use one thread only - if you start rados bench with one thread only (default is 16) the result looks perhaps similiar.

Testing on my cluster:
Code:
# 1 thread 
rados -p test -t 1 bench 60 seq --no-cleanup
...
Total time run:  60.120947
Total reads made:  985
Read size:  4194304
Bandwidth (MB/sec):  65.535

Average Latency:  0.0610333
Max latency:  0.422892
Min latency:  0.0132725

# 16 threads
rados -p test  bench 60 seq --no-cleanup
...
Total time run:  55.967685
Total reads made:  12118
Read size:  4194304
Bandwidth (MB/sec):  866.071

Average Latency:  0.0738265
Max latency:  1.31581
Min latency:  0.0130502
ceph isn't optimal for single thread IO...

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!