Ceph - High I/O wait on OSD add/remove

Marius Matei

Renowned Member
Jun 23, 2014
13
0
66
Bucharest, Romania, Romania
Hello,

So the time has arrived to upgrade our ceph cluster because of degrading I/O performance.
I believe we've stretched our 6 OSDs quite enough :)

Huge problem!
When I added a new OSD in the mix the cluster immediately started doing back fill and recovery on placement groups in order to populate the new OSD.
This caused disastrous I/O usage across the cluster and the VMs became unmanageable.

At this point I've decided to actually read the manual :)

Here are some of my best practice recommendations:
1. Always use a separate physical network for recovery.
In the ceph config file specify the two networks like so:

Code:
[COLOR=#000000][FONT=tahoma]public network = 172.16.1.0/24
[/FONT][/COLOR][COLOR=#000000][FONT=tahoma]cluster network = 172.16.2.0/24
[/FONT][/COLOR]
If you have replication of 3 the cluster network will use ~3x the bandwidth the public network will, so keep that in mind.

2. Don't spend money on SSDs for journaling. I've failed to see any improvements doing this.
If you have money to spend, I'd suggest investing in LSI CacheCade or something similar.

3. This is how I solved the I/O crisis when adding/removing OSDs:
In the [osd] section of the ceph config file add these parameters:
Code:
osd max backfills = 1
osd recovery max active = 1
The defaults are 10 for backfills and 15 I believe for recovery.

I've tried these settings and compared to the default settings on various setups and I've seen no difference in recovery speeds to account for the huge I/O usage.
The only time I've seen benefits in raising these was when I've provisioned the new CacheCade storage nodes. You only need to look out for network bottlenecks in this case.

Also a good practice when adding a new OSD is to start with a weight of 0.2 and increase after rebuild is done.
This only minimizes impact in case you introduce a faulty or very slow drive in the cluster by mistake and you decide to remove it shortly.
It's also good to know this if you plan to slowly move data to an OSD node and you want to keep an eye on bandwidth consumption.

Here is an example:

Adding a 2TB hard drive, proxmox will see it as an 1.8 TB OSD and assign a default weight of 1.8
Imediately after activating the new OSD do this in the terminal:

Code:
ceph osd crush reweight osd.7 0.2

After rebuild is done you can increase the weight until you reach 1.8

It would be nice if proxmox would permit setting a custom weight before starting the OSD and also it would be nice to be able to modify the weight in the GUI.
Some drop-down or text field config tools for ceph config would also help a lot of unskilled admins.

Also it would be nice to see a dedicated section for ceph in the forums.

Best Regards,
Marius
 
Last edited:
Regarding SSD based journals I think that it is dependent on the use case. The journal is used for 2 purposes: 1) gathering small writes and flushes them in batches to improve seek time and overall performance of the backed OSD (which should be a single spinning disk per recommendation); 2) commit log for the backed OSD to be used in case of recovery. Have you had SSD journals when you've added your next OSD during the upgrade? Backfilling is a recovery process which uses the journal extensively which could kill the performance of a spinning disk. I have a 3 node cluster with 18 OSDs in total 6 in each node and every 3 OSD has it's own SSD journal. The system is working quite nicely and the SSDs are utilized. Also there's CacheCade, 2 RAID1 arrays in each node, so 6 SSDs in total. This setup is quite balanced budget and performance wise and running in production over a year now.
 
Hello ScOut3R,

I've used SSD journals without CacheCade and it didn't help with recovery.
I think this is due to default I/O usage of the cluster even without recovery.

On the new nodes with CacheCade I see no performance increase using SSD for journal.
Inktank does not recommend using RAID1 SSD arrays for journaling since it could decrease performance.
So in my case using one SSD for 3 or more OSDs has the big downside of losing all the OSDs in case of SSD failure.
Also at the time I've decided to give up on SSD journaling I've read that journal drives do not offer a significant performance increase since firefly.

I plan to implement SSD cache pool in the future release of ceph if proxmox will support it or if it would be easy to implement and maintain.
I think this is the direction that Inktank is going: SSD cache pool that commits to spinning disk pool without SSD journal.

Regarding CacheCade, I've seen no significant degradation in I/O performance under SSD values when previously using these nodes with open-e and iSCSI. Maybe sometimes on reads, but that is to be expected since the data has to be accessed at least once for it to be cached.
I think CacheCade treats all data passing through the controller as "hot data" as long as storage space in the array is not exhausted, and since commits are fast with the added 1GB memory of the controller, the array never fills up.
We are using Intel S3700 SSDs @200GB.

Regards,
Marius
 
Thanks Marius for the detailed response! Could you please point me to the article regarding Firefly and the SSD journals? That would be an interesting update for me. :) Getting rid of those 6 SSDs and putting spinning disks in would give us a small storage space boost which would be handy.

Also I'm using single SSDs for journals, the CacheCade drives are in RAID1. :)

The cache tiering is a nice new feature! I'm playing around with it and looks promising. I'm thinking of putting SSDs in the proxmox hosts and use them as the cache tier but that would require extensive testing which for I don't have the equipment now.

Best regards,
Mate
 
High I/O during ceph recovery or add/remove osds are very prominent in smaller scale ceph. Specially with smaller number of nodes. This issue somewhat goes away in large scale ceph which is what it was designed for initially. But this issue can be mitigated with settings like you have used, such as backfills, max active etc. The speed of recovery also gets faster as you go higher on network bandwidth, such as gigabit vs 10gb.

Proxmox itself do not change or directly interact with ceph in any way. The script pveceph is just a shortcut of common ceph commands. So directly proxmox would not be able to assign different weight to different osd. In just my humble opinion, i dont think starting with smaller weight then grow while adding a osd is a good idea. You will be micro managing too much. Now if it purely home use, extremely small scale and will be that way for long time then weight limit would work. Instead of weight, changing configuration in ceph.conf would be much more automated.

From Ceph Firefly, journaling is no longer necessary, but it is still in testing phase i believe. For now options are to stay with ssd journal while this matures in next couple versions. I myself dotn use SSD for journaling. My journals are on the OSDs themselves. One of my cluster got 20 OSD and the other one 46.Once past the 8 OSDs per node, it is wise to put journal on same OSD.
 
Also at the time I've decided to give up on SSD journaling I've read that journal drives do not offer a significant performance increase since firefly.
Hi,
that's not right - journal free writes are in firefly experimental only (just forgot the name of the backend). One topic is limit IOPS on the journal, which will be much better in giant.

At now, a SSD-journal is recomended until you have an lot of spindles.

The decision to go up with the weight in 0.2-steps is the normal behavior - I do the same when i expand our ceph-cluster.

Udo
 
The decision to go up with the weight in 0.2-steps is the normal behavior - I do the same when i expand our ceph-cluster.

Will it still apply for a ceph cluster with 10+gbps netwrk bandwidth and large number of OSDs with max backfills, max active recovery configured? Is 0.2 steps normal behavior for an unconfigured ceph.conf only using defaults?
 
Will it still apply for a ceph cluster with 10+gbps netwrk bandwidth and large number of OSDs with max backfills, max active recovery configured? Is 0.2 steps normal behavior for an unconfigured ceph.conf only using defaults?
Hi Wasim,
how do you define a "large number of OSDs"? Our cluster has 60 OSDs now (next expansion will be additional 24 hdds soon - in two parts) and connected with 10GB-Ethernet (one ceph-cluster network and the pve-network).
But I had the defaults for osd_max_backfills and osd_recovery_max_active since now. I will add the next disks with both values changed and will reported if the perfomance drop with bigger steps.

Udo
 
how do you define a "large number of OSDs"? Our cluster has 60 OSDs now (next expansion will be additional 24 hdds soon - in two parts) and connected with 10GB-Ethernet (one ceph-cluster network and the pve-network).
I personally consider 50+ OSDs as beginning of large ceph cluster. When adding new OSD, 0.2 step works fine, but how do you deal with OSD failure? Cluster would want to go into rebalancing as soon as OSD goes out. You can add new OSD with a replacement and follow 0.2 increments, but cluster still has to deal with teh objects from dead OSD. No?

Off topic, whats the result of the following command in your ceph cluster? How many replicas are you using and the number of pg?
#rados -p <pool> bench -b 4096 100 write
#rados -p <pool> bench -b 131072 100 write
#rados -p <pool> bench -b 4194304 100 write

Just want to compare with my 20 OSDs cluster.
 
I personally consider 50+ OSDs as beginning of large ceph cluster. When adding new OSD, 0.2 step works fine, but how do you deal with OSD failure? Cluster would want to go into rebalancing as soon as OSD goes out. You can add new OSD with a replacement and follow 0.2 increments, but cluster still has to deal with teh objects from dead OSD. No?
right - if an OSD die, ther are no "soft" increment.
I must look, which is the best strategy to change an failed OSD - (remove in crushmap and add again use doubled traffic)
But the cluster don't "deal" with the object of an dead OSD - the second weight go to 0, so all data are moved to other osds on the same host (because the crush weight of the host is the same)
Looks like this
Code:
ceph osd tree
# id    weight  type name       up/down reweight
-1      218.3   root default
-3      218.3           rack unknownrack
-2      43.68                   host ceph-01
52      3.64                            osd.52  up      1
53      3.64                            osd.53  down    0
Off topic, whats the result of the following command in your ceph cluster? How many replicas are you using and the number of pg?
At this time I use an replica of 2, but with the next expansion I change one pool to 3 replicas.
The pgs at this time:
rbd: 1600
pve: 2048
test: 1700
#rados -p <pool> bench -b 4096 100 write
Code:
Total time run:         106.176028
Total writes made:      85102
Write size:             4096
Bandwidth (MB/sec):     3.131 

Stddev Bandwidth:       2.15169
Max bandwidth (MB/sec): 8.51953
Min bandwidth (MB/sec): 0
Average Latency:        0.0190996
Stddev Latency:         0.133197
Max latency:            7.17382
Min latency:            0.001238
#rados -p <pool> bench -b 131072 100 write
Code:
Total time run:         101.130716
Total writes made:      67905
Write size:             131072
Bandwidth (MB/sec):     83.932 

Stddev Bandwidth:       63.8228
Max bandwidth (MB/sec): 411.375
Min bandwidth (MB/sec): 0
Average Latency:        0.0238262
Stddev Latency:         0.146065
Max latency:            4.33875
Min latency:            0.002057
#rados -p <pool> bench -b 4194304 100 write
Code:
Total time run:         100.598768
Total writes made:      15101
Write size:             4194304
Bandwidth (MB/sec):     600.445 

Stddev Bandwidth:       110.825
Max bandwidth (MB/sec): 732
Min bandwidth (MB/sec): 0
Average Latency:        0.106506
Stddev Latency:         0.110887
Max latency:            2.41247
Min latency:            0.031203
How looks your performance? Esp. the latency?

Udo
 
Hello,So the time has arrived to upgrade our ceph cluster because of degrading I/O performance.I believe we've stretched our 6 OSDs quite enough :)Huge problem!When I added a new OSD in the mix the cluster immediately started doing back fill and recovery on placement groups in order to populate the new OSD.This caused disastrous I/O usage across the cluster and the VMs became unmanageable.At this point I've decided to actually read the manual :)Here are some of my best practice recommendations:1. Always use a separate physical network for recovery.In the ceph config file specify the two networks like so:
Code:
[COLOR=#000000][FONT=tahoma]public network = 172.16.1.0/24[/FONT][/COLOR][COLOR=#000000][FONT=tahoma]cluster network = 172.16.2.0/24[/FONT][/COLOR]
If you have replication of 3 the cluster network will use ~3x the bandwidth the public network will, so keep that in mind.2. Don't spend money on SSDs for journaling. I've failed to see any improvements doing this.If you have money to spend, I'd suggest investing in LSI CacheCade or something similar.3. This is how I solved the I/O crisis when adding/removing OSDs:In the [osd] section of the ceph config file add these parameters:
Code:
osd max backfills = 1osd recovery max active = 1
The defaults are 10 for backfills and 15 I believe for recovery.I've tried these settings and compared to the default settings on various setups and I've seen no difference in recovery speeds to account for the huge I/O usage.The only time I've seen benefits in raising these was when I've provisioned the new CacheCade storage nodes. You only need to look out for network bottlenecks in this case.Also a good practice when adding a new OSD is to start with a weight of 0.2 and increase after rebuild is done.This only minimizes impact in case you introduce a faulty or very slow drive in the cluster by mistake and you decide to remove it shortly.It's also good to know this if you plan to slowly move data to an OSD node and you want to keep an eye on bandwidth consumption.Here is an example:Adding a 2TB hard drive, proxmox will see it as an 1.8 TB OSD and assign a default weight of 1.8Imediately after activating the new OSD do this in the terminal:
Code:
ceph osd crush reweight osd.7 0.2
After rebuild is done you can increase the weight until you reach 1.8It would be nice if proxmox would permit setting a custom weight before starting the OSD and also it would be nice to be able to modify the weight in the GUI.Some drop-down or text field config tools for ceph config would also help a lot of unskilled admins.Also it would be nice to see a dedicated section for ceph in the forums.Best Regards,Marius
Use: ceph tell osd.* injectargs '--osd_recovery_delay_start 10'
 
How looks your performance? Esp. the latency?
Udo

Following are benchmark result from my Ceph cluster. Compare to yours my results pretty much sucks, even though i have bigger network bandwidth.
Ceph cluster: 4 Nodes, 20 OSDs, 3 Replicas, 20Gb Infiniband network
#rados -p test2 bench -b 4096 100 write
Code:
Total time run:         100.820150
Total writes made:      14869
Write size:             4096
Bandwidth (MB/sec):     0.576

Stddev Bandwidth:       0.571735
Max bandwidth (MB/sec): 3.64844
Min bandwidth (MB/sec): 0
Average Latency:        0.108246
Stddev Latency:         0.239024
Max latency:            2.58293
Min latency:            0.00136

#rados -p test2 bench -b 131072 100 write
Code:
Total time run:         100.533110
Total writes made:      7466
Write size:             131072
Bandwidth (MB/sec):     9.283

Stddev Bandwidth:       7.27833
Max bandwidth (MB/sec): 36.25
Min bandwidth (MB/sec): 0
Average Latency:        0.214715
Stddev Latency:         0.294463
Max latency:            2.1366
Min latency:            0.00243


#rados -p test2 bench -b 4194304 100 write
Code:
Total time run:         101.544540
Total writes made:      2797
Write size:             4194304
Bandwidth (MB/sec):     110.178

Stddev Bandwidth:       54.5212
Max bandwidth (MB/sec): 272
Min bandwidth (MB/sec): 0
Average Latency:        0.580118
Stddev Latency:         0.605534
Max latency:            4.22724
Min latency:            0.043024

Same test for Read speed
#rados -p test2 bench -b 4096 100 seq
Code:
Total time run:        1.090537
Total reads made:     14869
Read size:            4096
Bandwidth (MB/sec):    53.260

Average Latency:       0.00117111
Max latency:           0.009328
Min latency:           0.000425

#rados -p test2 bench -b 131072 100 seq
Code:
 Total time run:        1.185228
Total reads made:     7466
Read size:            131072
Bandwidth (MB/sec):    787.401

Average Latency:       0.00253544
Max latency:           0.251536
Min latency:           0.00068

#rados -p test2 bench -b 4194304 100 seq
Code:
Total time run:        15.641758
Total reads made:     2797
Read size:            4194304
Bandwidth (MB/sec):    715.265

Average Latency:       0.0891388
Max latency:           1.65734
Min latency:           0.004777
 
Same test for Read speed
#rados -p test2 bench -b 4096 100 seq
Code:
Total time run:        1.090537
Total reads made:     14869
Read size:            4096
Bandwidth (MB/sec):    53.260

Average Latency:       0.00117111
Max latency:           0.009328
Min latency:           0.000425

#rados -p test2 bench -b 131072 100 seq
Code:
 Total time run:        1.185228
Total reads made:     7466
Read size:            131072
Bandwidth (MB/sec):    787.401

Average Latency:       0.00253544
Max latency:           0.251536
Min latency:           0.00068

#rados -p test2 bench -b 4194304 100 seq
Code:
Total time run:        15.641758
Total reads made:     2797
Read size:            4194304
Bandwidth (MB/sec):    715.265

Average Latency:       0.0891388
Max latency:           1.65734
Min latency:           0.004777
Hi Wasim,
this looks that you don't flush the buffer on the host and the nodes! Otherwise you aren't able to get such nice latencies.

If you do on all nodes (+ OSDs) an "echo 3 > /proc/sys/vm/drop_caches" before an rados bench read, you should get more realistic values.

BTW, which jounal do you use (it's important for the write-performance)? I use SSD-journal (raw partions!) on Intel DC S3700 (one SSD on each OSD-Node for 12 OSDs).


Udo
 
Hi Wasim,
how do you define a "large number of OSDs"? Our cluster has 60 OSDs now (next expansion will be additional 24 hdds soon - in two parts) and connected with 10GB-Ethernet (one ceph-cluster network and the pve-network).
But I had the defaults for osd_max_backfills and osd_recovery_max_active since now. I will add the next disks with both values changed and will reported if the perfomance drop with bigger steps.

Udo
Hi,
to warm up an older thread... just reconfigure our ceph-cluster and move some osds from one host to another.
with "osd max backfills = 1" and "osd recovery max active = 1" the impact are much much less to the client, but nevertheless noticable - so I still use steps to reduce the crush map.

Udo
 
Hi Wasim,
this looks that you don't flush the buffer on the host and the nodes! Otherwise you aren't able to get such nice latencies.


I made some changes to the Ceph cluster. I dropped the number of replicas from 3 to 2 and ran the benchmark with cache flushed this time. Following is the comparison of benchmarks before and after flushing cache and changing replicas. For sure 2 replicas made difference in performance.
ceph-bench-2-rep.PNG
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!