ceph tuning

mpopgun · Jun 2, 2023

First a disclaimer, this is a lab, definitely not a reference design, the point was to do weird things, learn how cephs reacts and then learn how to get myself out of whatever weird scenario I ended up with.

I've spent a few days on the forum, seems many of the resolutions were people replacing bad hardware, or they tried bonding nics, or they were already maxing out their system...so I've tried to rule that out with the methods below...

In trying to understand what all settings will throttle the ceph recovery/rebalance I understand the default profile is 50/50, then the client profile is 60% of the resources to client requests and 40% to recover and rebalance. Then the high recovery profile is 30/70. It's also my understanding this all is based on IOPS. I know I can override the detected IOPS per OSD. My slowest HDD reports back around 400iops, and as you'll see below, i don't think i'm getting anywhere near that. At least not with the monitoring commands I know about. Proxmox seems to only report client data IOPS and now recovery/rebalance, and in one of my tests described below, i was able to get 1000 iops. So i haven't tried overriding IOPS yet...if that setting where too high i would expect to see high OSD latency which would validate the IOPS limit...but we can do that test if somebody says that would be a useful test.

My current scenario is that I added a 4th node to the cluster, and then added 2 HDDs (one on node 1 and 2) to go with my 4 SSD (one in each node). Proxmox OS is on a standalone SSD. Then i created a new pool and a rule so my new pool "slow_rpm" is for the HDDs and created a rule so the the default "pool1" would only be on SSD. So now I have about 5tb of data and none of it is in the right place....perfectly fine and expected behavior, except for proxmox says it's going to take 5 years. Clearly I've done something wrong. The rebalance is only reporting about 1mB/s (8mbps).

OSDs latency
- SSD ~1/1 ms
- HDD ~10/10 ms
IOPS
- ceph iostat only seems to show client iops, not recovery
CPU
- all nodes <5% in proxmox (i have monitored a spike to 145% in top...so ceph can use multiple cores, is that a configurable option of how many cores ceph uses?)
Network
- During recovery and rebalance, ran iperf between each of the nodes, able to fill the 2.5gb ports to about 2.3gbps. (back network is physically seperated from front network)
- using iftop to monitor each node's NIC assigned to ceph

Because osd latency was low, I am assuming the storage isn't the bottle neck.
Because the CPUs are low during recovery and rebalance, I don't believe the CPUs are the bottleneck.
and Because i can iperf and nearly fill the network, i believe i ruled that out as well.

Even with the default performance profile, i expect faster rebalancing, but I changed the profile to high_high_recovery_ops. No appreciably change.
Next I started modifying osd max backfills, max active settings, and slowly worked up to this:


[FONT=Calibri]ceph tell 'osd.*' injectargs --osd-recovery-op-priority=2  --osd-max-backfills=2000 --osd-recovery-max-active=2000 --osd-recovery-max-active-hdd=2000 --osd_recovery_max_active_ssd=2000 --osd-mclock-override-recovery-settings=true --osd-mclock-profile=high_recovery_ops[/FONT]

With this, I now have a blistering fast recovery rate of ~4mB/s (32mbps) and cpu stays below 6% now..slight increase. I also watched the counter of millions of objects count down by about 1000 per refresh. Not the 2000 limit I specified, but better.

I did notice that osd-recovery-op-priority won't take affect until i restart the OSD, and restarting the OSD causes it to read from the config file and go back to defaults. But surely with almost 0 load on the system recovery would by default be able to do more than 30mbps. Before I start modifying the ceph config files, I thought I would consult you guys.

In another attempt to find a bottle neck, I moved some VHDs from one pool to the other. I was finally able to see about 1gb of traffic...and the recovery still continued on at 32mbps in the background. So if all the hardware (iops, cpu, ram, nics, switch) can support ~1032mbps I believe i should be able to get recovery to be much faster.

My thinking is that if one of my HDD or SSD was problematic, I would see high latency on that OSD
If i maxed out the CPU that would be a limiter, but 6% in proxmox isn't even half of one core.
the network was proven with iperf and with moving a VHD from one pool to the other, and proves the whole system in my mind...but pls correct me if any of these aren't accurate.

All this to ask, is there another limiter I can change?
Other than IOPS is there anything else that CEPH benchmarks to determine how "fast" the system potentially is, since it's using percentages of iops and not just going as fast as it can. I would prefer a behavior more like QoS on the network side where the recovery/rebalance just uses all available resources until the client has a request, then it throttles.
I'm assuming this benchmark is run on installation or OSD start?

maybe a hidden command like ceph_unleash_all_your_potential=true?

I'm hoping it's something as simple as the cluster was bulit with 3 nodes and 3 SSDs, and this command will make it re-evaluate the 4 nodes and 6 hdds.

Thanks in advance!
Mike

gurubert · Jun 6, 2023

Is the RocksDB of the HDD-OSDs located on the HDD or on faster SSDs?
If on HDD this drastically limits recovery operations as the RocksDB manages the objects on disk.

mpopgun · Jun 20, 2023

gurubert said:
Is the RocksDB of the HDD-OSDs located on the HDD or on faster SSDs?
If on HDD this drastically limits recovery operations as the RocksDB manages the objects on disk.

I haven't been able to find a command to confirm the actual location...but they are defined to be on the ssd pool.

ceph tuning

mpopgun

New Member

gurubert

Distinguished Member

mpopgun

New Member

We value your privacy