Ceph performance troubleshooting

troycarpenter

Renowned Member
Feb 28, 2012
100
7
83
Central Texas
Hi all,

I've been running Proxmox for a number of years and now have a 13 node cluster where last year I added Ceph to the mix (after a 5.2 upgrade) using the empty drive bays in some of the Proxmox nodes. Last Friday I upgraded all nodes to version 5.3.

The Ceph system has always felt slow and I've never really figured out what the read/write/IOPS numbers are supposed to be telling me with respect to performance of the cluster. Now I'm starting to regularly see slow requests blocked messages, but my concern is that when it happens, the blocked number climbs sometimes up to 2000, before dropping back to zero (meaning the warning message goes away). Then earlier today it seems to have gotten way out of hand up in the 7k range and not just slow requests but [whatever the next step is above slow]. After about 30 minutes it all settled back down.

I have three servers each hosting 5 OSDs of 5TB, and three other servers each hosting 3 OSDs of 4TB. All nodes have a dedicated 10G network for Ceph traffic. Using iperf, the network appears to be running at full speed (with 9000MTU). Across all the nodes, I have about 60 guests all using the Ceph system as their disk image storage. I don't want to get too specific with my setup to avoid going down rabbit holes and try to stick with good troubleshooting techniques for this problem, but if there is something important to know, I can answer questions.

What can I monitor to troubleshoot this issue? What things do I need to consider? I have not really found any good documentation laying out solid troubleshooting steps, just lots of posts dealing with specific issues, usually followed up with "I fixed it with a server reboot".

Thanks.
Troy
 
This could be related to many reasons, including a hardware problem.
Can you check the output of "ceph health detail" or post some of the error messages from the log?
How many PGs (placement groups) do you have, you can check that via GUI if you prefer (Node->Cheph-> status section)
 
There are two pools, one RBD with 512 PGs, and now a cephFS with 128 PGs.

The health reports as HEALTH_OK, but it changes as the problem emerges:
Code:
2018-12-11 10:34:02.014813 mon.belle mon.0 192.168.201.241:6789/0 114595 : cluster [WRN] Health check failed: 13 slow requests are blocked > 32 sec. Implicated osds 20 (REQUEST_SLOW)
2018-12-11 10:34:07.714649 mon.belle mon.0 192.168.201.241:6789/0 114600 : cluster [WRN] Health check update: 146 slow requests are blocked > 32 sec. Implicated osds 0,5,20 (REQUEST_SLOW)
2018-12-11 10:34:15.725495 mon.belle mon.0 192.168.201.241:6789/0 114601 : cluster [WRN] Health check update: 204 slow requests are blocked > 32 sec. Implicated osds 20 (REQUEST_SLOW)
2018-12-11 10:34:21.837139 mon.belle mon.0 192.168.201.241:6789/0 114607 : cluster [WRN] Health check update: 255 slow requests are blocked > 32 sec. Implicated osds 12,20,22 (REQUEST_SLOW)
2018-12-11 10:34:26.837391 mon.belle mon.0 192.168.201.241:6789/0 114609 : cluster [WRN] Health check update: 283 slow requests are blocked > 32 sec. Implicated osds 12,20,22 (REQUEST_SLOW)
2018-12-11 10:34:31.837645 mon.belle mon.0 192.168.201.241:6789/0 114613 : cluster [WRN] Health check update: 299 slow requests are blocked > 32 sec. Implicated osds 12,20,22 (REQUEST_SLOW)
2018-12-11 10:34:36.837888 mon.belle mon.0 192.168.201.241:6789/0 114614 : cluster [WRN] Health check update: 439 slow requests are blocked > 32 sec. Implicated osds 9,12,14,20,22 (REQUEST_SLOW)
2018-12-11 10:34:41.838109 mon.belle mon.0 192.168.201.241:6789/0 114617 : cluster [WRN] Health check update: 429 slow requests are blocked > 32 sec. Implicated osds 12,14,20,22 (REQUEST_SLOW)
2018-12-11 10:34:46.838361 mon.belle mon.0 192.168.201.241:6789/0 114621 : cluster [WRN] Health check update: 423 slow requests are blocked > 32 sec. Implicated osds 12,14,20,22 (REQUEST_SLOW)
2018-12-11 10:34:51.838613 mon.belle mon.0 192.168.201.241:6789/0 114624 : cluster [WRN] Health check update: 320 slow requests are blocked > 32 sec. Implicated osds 20 (REQUEST_SLOW)
2018-12-11 10:34:56.838897 mon.belle mon.0 192.168.201.241:6789/0 114625 : cluster [WRN] Health check update: 37 slow requests are blocked > 32 sec. Implicated osds 20 (REQUEST_SLOW)
2018-12-11 10:35:02.705705 mon.belle mon.0 192.168.201.241:6789/0 114630 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 37 slow requests are blocked > 32 sec. Implicated osds 20)
2018-12-11 10:35:02.705761 mon.belle mon.0 192.168.201.241:6789/0 114631 : cluster [INF] Cluster is now healthy

It does look like OSD.20 is almost always implicated, but I can't tell from the log what should be happening. Here's the output during the last episode:
Code:
2018-12-11 10:36:34.335485 7f0498a72700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_write.cc:684] reusing log 64204 from recycle list

2018-12-11 10:36:34.335541 7f0498a72700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_write.cc:725] [default] New memtable created with log file: #64219. Immutable memtables: 0.

2018-12-11 10:36:34.336038 7f0489253700  4 rocksdb: (Original Log Time 2018/12/11-10:36:34.335803) [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:1158] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2018-12-11 10:36:34.336048 7f0489253700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/flush_job.cc:264] [default] [JOB 556] Flushing memtable with next log file: 64219

2018-12-11 10:36:34.336062 7f0489253700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1544542594336053, "job": 556, "event": "flush_started", "num_memtables": 1, "num_entries": 203188, "num_deletes": 45904, "memory_usage": 260057400}
2018-12-11 10:36:34.336067 7f0489253700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/flush_job.cc:293] [default] [JOB 556] Level-0 flush table #64220: started
2018-12-11 10:36:34.476576 7f0489253700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1544542594476566, "cf_name": "default", "job": 556, "event": "table_file_creation", "file_number": 64220, "file_size": 9774452, "table_properties": {"data_size": 9383149, "index_size": 140698, "filter_size": 249616, "raw_key_size": 4816862, "raw_average_key_size": 52, "raw_value_size": 7839733, "raw_average_value_size": 85, "num_data_blocks": 2150, "num_entries": 91361, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "44057", "kMergeOperands": "524"}}
2018-12-11 10:36:34.476589 7f0489253700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/flush_job.cc:319] [default] [JOB 556] Level-0 flush table #64220: 9774452 bytes OK
2018-12-11 10:36:34.477118 7f0489253700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:242] adding log 64217 to recycle list

2018-12-11 10:36:34.477126 7f0489253700  4 rocksdb: (Original Log Time 2018/12/11-10:36:34.476598) [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:360] [default] Level-0 commit table #64220 started
2018-12-11 10:36:34.477130 7f0489253700  4 rocksdb: (Original Log Time 2018/12/11-10:36:34.477080) [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:383] [default] Level-0 commit table #64220: memtable #1 done
2018-12-11 10:36:34.477131 7f0489253700  4 rocksdb: (Original Log Time 2018/12/11-10:36:34.477090) EVENT_LOG_v1 {"time_micros": 1544542594477086, "job": 556, "event": "flush_finished", "lsm_state": [2, 0, 10, 0, 0, 0, 0], "immutable_memtables": 0}
2018-12-11 10:36:34.477132 7f0489253700  4 rocksdb: (Original Log Time 2018/12/11-10:36:34.477107) [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 268435456 files[2 0 10 0 0 0 0] max score 0.50

2018-12-11 10:36:34.477138 7f0489253700  4 rocksdb: [/mnt/npool/a.antreich/ceph/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:388] [JOB 556] Try to delete WAL files size 255906228, prev total WAL file size 255955140, number of live WAL files 2.

I shut down osd.20 and things seem to have stabilized, but I don't know if that's due to less traffic or if the OSD down helped.
 
Ok, to narrow this down you need to check if the osds share a disk or any other piece of hardware (switch, rack, host...).
If you have gathered that info you should take further steps:
- check the health of the drive and/or if there is heavy load on it
- if they share a host, check the host (ram, cpu..)
- same rack -> check network (switch, nics..)

If you take out an osd, there is in fact more traffic for the time of rebalancing, but if there are no blocked requests anymore, than its probably related to something which belongs to that osd.
 
Thanks for the input and big message to follow.

It would be nice if there were a concise guide specifically for troubleshooting ceph, especially what to check in what order. Like how to verify ceph network, osd, and physical hardware performance. How to read the dashboard (Reads/Writes/IOPS) to know what the numbers mean and if they tell me anything useful.

For this system, there are 6 total Ceph nodes in two distinct groups. The first group of three has hardware pretty much like the following and hosts OSDs 9 through 26:
Code:
CPU usage 2.39% of 24 CPU(s)
IO delay 20.83%
RAM usage 32.74% (41.15 GiB of 125.70 GiB)
CPU(s) 24 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (1 Socket)
Kernel Version Linux 4.15.18-9-pve #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
PVE Manager Version pve-manager/5.3-5/97ae681d
System drive (/dev/sda) Intel S3510 SSD
Ceph DB (/dev/sdb) Intel S3510 SSD
Ceph drives, 5 total in chassis ST5000LM000-2AN170 (bad choice in hindsight)
Each node is also running various VMs as well.

The second group of three hosts OSDs 0 through 8:
Code:
CPU usage 2.39% of 12 CPU(s)
IO delay 12.67%
RAM usage 40.74% (25.31 GiB of 62.66 GiB)
CPU(s) 12 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (1 Socket)
Kernel Version Linux 4.15.18-9-pve #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
PVE Manager Version pve-manager/5.3-5/97ae681d
System drive (/dev/sda) HP spinner
THERE IS NO Ceph journal drive!!!  (I know it's bad and I want to be sure you see that)
Ceph drives, 3 total in chassis WDC-WD40EFRX
Each node is also running various VMs as well.

My initial thought is to replace the WD REDS in the second group with SM863a MZ-7KM1T9N SSDs and journal directly on the SSD. However, at over $6k for the drives, I want to be sure I've exhaustively tried to figure out the problem and if those drives will even solve anything. Those three servers only have 4 drive bays each, so I don't have room for a separate journal drive.

I've also not found conclusive guidance with respect to the KRBD setting in the storage setting. When it was set, the VMs felt slower, but the slow request errors went down to almost nothing. Removing the setting made the bootup on certain VMs feel faster, but the number of slow requests now fluctuate on a regular basis like the progression below.

One question I have...is it possible that one OSD performance issue can cause requests to back up on other OSDs? For instance:
Code:
2019-01-18 12:38:01.742550 mon.gatimaan [INF] Cluster is now healthy
2019-01-18 12:38:12.810271 mon.gatimaan [WRN] Health check failed: 49 slow requests are blocked > 32 sec. Implicated osds 23 (REQUEST_SLOW)
2019-01-18 12:38:18.053586 mon.gatimaan [WRN] Health check update: 121 slow requests are blocked > 32 sec. Implicated osds 0,15,20,23 (REQUEST_SLOW)
2019-01-18 12:38:23.303980 mon.gatimaan [WRN] Health check update: 202 slow requests are blocked > 32 sec. Implicated osds 0,5,8,10,11,12,13,15,20,23 (REQUEST_SLOW)
2019-01-18 12:38:28.304355 mon.gatimaan [WRN] Health check update: 258 slow requests are blocked > 32 sec. Implicated osds 0,5,6,8,10,11,12,13,14,15,20,23 (REQUEST_SLOW)
2019-01-18 12:38:33.304687 mon.gatimaan [WRN] Health check update: 362 slow requests are blocked > 32 sec. Implicated osds 0,5,6,8,9,10,11,12,13,14,15,16,18,19,20,23 (REQUEST_SLOW)
2019-01-18 12:38:38.305054 mon.gatimaan [WRN] Health check update: 413 slow requests are blocked > 32 sec. Implicated osds 0,3,5,6,8,9,10,11,12,15,16,18,19,20,23 (REQUEST_SLOW)
2019-01-18 12:38:43.305326 mon.gatimaan [WRN] Health check update: 374 slow requests are blocked > 32 sec. Implicated osds 0,5,6,8,10,11,12,13,16,17,19,20,23 (REQUEST_SLOW)
2019-01-18 12:38:48.305643 mon.gatimaan [WRN] Health check update: 328 slow requests are blocked > 32 sec. Implicated osds 0,5,6,10,11,12,16,19,20,23 (REQUEST_SLOW)
2019-01-18 12:38:53.305995 mon.gatimaan [WRN] Health check update: 270 slow requests are blocked > 32 sec. Implicated osds 0,5,10,11,12,20,23 (REQUEST_SLOW)
2019-01-18 12:38:57.285793 mon.gatimaan [INF] Health check cleared: REQUEST_SLOW (was: 252 slow requests are blocked > 32 sec. Implicated osds 0,10,11,12,20,23)
2019-01-18 12:38:57.285814 mon.gatimaan [INF] Cluster is now healthy
Where OSD 20 and 23 are common throughout this incident.
 
Last edited:
One technique would be to stop OSD 20 and 23 for a while and see if there are any issues after that. And then reenable until you know which OSD is the culprit.

I would also head over to the ceph mailing armed with logs (often they need osd logging set to 20) in order for expert help!

Myself am running ceph on a bunch of old HDDs with similar issues but I am fine with it since cache tiering was implemented.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!