Analyzing Ceph load

Paspao

Active Member
Aug 1, 2017
69
2
28
55
Hello,

I want to understand if I am reaching a bottleneck in my hyperconverged Proxmox+ Ceph cluster.

In moments of high load (multiple LXC with high I/O on small files) I see one node with:

- IO delay at 40%
- around 50% CPU usage
- load at 200 (40 total cores / 2 x Intel(R) Xeon CPU E5-2660 v3 @ 2.60GHz)
- Cpu cores assigned to Ceph OSD daemon > 100%
- Ceph writing around 30 MB/s

The disks are Intel S3520 (12 OSD / 6 nodes)

Iostat output has moments showing 100% usage:

avg-cpu: %user %nice %system %iowait %steal %idle

22.29 0.00 7.62 36.19 0.00 33.90


Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sdc 0.00 27.50 21.00 54.50 40.50 9574.00 254.69 0.01 0.11 0.19 0.07 0.11 0.80

sda 0.50 48.00 34.50 952.50 204.00 10768.00 22.23 0.22 0.22 0.35 0.22 0.11 10.60

sdb 0.00 75.00 35.00 945.50 210.00 12916.00 26.77 0.34 0.34 0.23 0.35 0.11 10.60

dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-1 0.00 0.00 0.00 50.00 0.00 9574.00 382.96 0.00 0.04 0.00 0.04 0.04 0.20

dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

rbd0 0.00 111.00 1.50 32.00 18.00 876.00 53.37 5.84 210.33 1.33 220.12 29.85 100.00

rbd1 0.00 69.00 1.00 18.00 12.00 446.00 48.21 6.56 392.95 0.00 414.78 50.74 96.40

rbd2 0.00 87.00 0.00 64.50 0.00 702.00 21.77 13.81 209.05 0.00 209.05 15.47 99.80

rbd3 0.00 17.00 0.00 10.00 0.00 110.00 22.00 1.00 121.00 0.00 121.00 70.80 70.80

rbd4 0.00 48.00 0.00 8.50 0.00 332.00 78.12 3.86 620.00 0.00 620.00 117.65 100.00

rbd5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

rbd6 0.00 30.00 0.00 35.50 0.00 304.00 17.13 17.18 666.08 0.00 666.08 28.17 100.00

rbd7 0.00 86.50 0.00 63.00 0.00 870.00 27.62 20.20 475.68 0.00 475.68 15.87 100.00

rbd8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

rbd9 0.00 58.00 0.00 14.50 0.00 490.00 67.59 5.63 490.62 0.00 490.62 68.00 98.60

rbd10 0.00 3.50 0.00 4.50 0.00 42.00 18.67 1.62 963.11 0.00 963.11 131.11 59.00

rbd11 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.15 864.00 0.00 864.00 308.00 15.40

rbd12 0.00 81.50 0.00 7.50 0.00 356.00 94.93 1.43 194.40 0.00 194.40 124.27 93.20

rbd13 0.00 1.50 0.00 1.50 0.00 12.00 16.00 0.16 289.33 0.00 289.33 104.00 15.60

rbd14 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.00 4.00 0.00 4.00 4.00 0.20

rbd15 0.00 31.50 0.00 8.50 0.00 240.00 56.47 2.80 328.24 0.00 328.24 110.35 93.80

rbd16 0.00 55.00 2.00 22.00 14.00 402.00 34.67 5.08 227.17 18.00 246.18 41.67 100.00

rbd17 0.00 123.00 0.00 32.50 0.00 818.00 50.34 6.45 193.23 0.00 193.23 30.77 100.00

rbd18 0.00 72.50 0.00 25.50 0.00 432.00 33.88 9.48 329.02 0.00 329.02 39.22 100.00

rbd19 0.00 90.00 0.00 26.50 0.00 638.00 48.15 8.15 474.19 0.00 474.19 37.51 99.40

rbd20 0.00 87.00 0.50 19.00 6.00 438.00 45.54 7.65 483.59 0.00 496.32 48.92 95.40

rbd21 0.00 2.50 0.00 2.00 0.00 18.00 18.00 0.03 17.00 0.00 17.00 17.00 3.40

rbd22 0.00 28.00 0.50 13.50 2.00 310.00 44.57 8.24 322.57 4.00 334.37 71.43 100.00

rbd23 0.00 179.50 0.50 44.50 14.00 1194.00 53.69 2.58 64.18 0.00 64.90 17.91 80.60

rbd24 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.21 460.00 0.00 460.00 416.00 20.80

rbd25 0.00 19.00 0.00 2.50 0.00 76.00 60.80 1.65 456.00 0.00 456.00 400.00 100.00

rbd26 0.00 0.00 0.00 0.50 0.00 2.00 8.00 0.15 864.00 0.00 864.00 308.00 15.40

rbd27 0.00 1.50 0.00 1.00 0.00 10.00 20.00 0.45 446.00 0.00 446.00 446.00 44.60

rbd28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

rbd29 0.00 0.50 0.00 1.00 0.00 6.00 12.00 0.01 10.00 0.00 10.00 10.00 1.00

rbd30 0.00 0.50 0.00 1.00 0.00 6.00 12.00 0.06 62.00 0.00 62.00 62.00 6.20

rbd31 0.00 71.00 0.00 24.00 0.00 406.00 33.83 3.85 179.08 0.00 179.08 41.67 100.00

What does it mean 100% on rbdX devices but low value on sda/sdb ?

Am I hitting both OSD CPU and disk throughput limit?

I see both ceph-osd are running on the same core. Is it possible to assign osd daemons to separate cores?

Any suggestions for optimisations?

Thank you,
P.
 
Last edited:
I want to understand if I am reaching a bottleneck in my hyperconverged Proxmox+ Ceph cluster.
See our Ceph benchmark paper for reference.
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

What does it mean 100% on rbdX devices but low value on sda/sdb ?
These are the mapped RBD images and it shows how much they are utilized. As the writes are distributed, it is only partially reflecting on each physical disk.

Any suggestions for optimisations?
Please be more specific with your hardware (eg. model/type).
 
  • Like
Reactions: Paspao
Hello Alwin,

thanks for your reply, after some days I am back at troubleshooting I/O wait.

Cluster:
- 6 Nodes with Ceph with PVE 5.4
- Wan network on NIC 1 1Gb
- Cluster network on NIC 2 1Gb

CEPH:
- 12 OSD
- 3 monitors
- 256 pgs
- data replication x3 (osd pool default size=3)
- osd journal size = 5120
- 10Gb separate network

Servers Specs:
- Dell R630
- 64 GB RAM PC4-2133
- 2 x Intel(R) Xeon CPU E5-2660 v3 @ 2.60GHz (40 total cores per node)
- 2 x Intel S3520 1,2 TB SSD (not in Raid)
- H730 controller
- Ethernet 10G 2P X520 Adapter

LXC specs:
- every node around 30 LXC running with
- 1 Core
- 1,7 GB RAM
- 12 GB Disk
- Debian OS

Load stats example:

GUI: https://www.dropbox.com/s/cdem23xbruqift2/GUI.png?dl=0

Ceph GUI: https://www.dropbox.com/s/etp4zsbd54bmfqf/ceph_stats.png?dl=0

Ceph net throughput: https://www.dropbox.com/s/2nzlmpmxit30qys/Ceph_net.png?dl=0

Cluster net throughput: https://www.dropbox.com/s/bufjlhth7kptdb4/cluster_net.png?dl=0

Htop: https://www.dropbox.com/s/ff87hefh82wta4d/htop.png?dl=0

Iostat: https://www.dropbox.com/s/wtj0d1grhzgaueq/iostat.png?dl=0

From what I see:
- it is not a network bottleneck
- Could be related to CPU for ceph processes that go over 100%
- Disks should support >50K iops so they should not be the bottleneck

How can I troubleshoot further?

Thank you.
P
 
- Disks should support >50K iops so they should not be the bottleneck
The specs says, 17500 IOPS random write and 67500 IOPS random read. So, they are read-intensive SSDs. IIRC, these are not the greatest for the ceph workload.
https://ark.intel.com/content/www/d...20-series-1-2tb-2-5in-sata-6gb-s-3d1-mlc.html

CEPH:
- 12 OSD
- 3 monitors
- 256 pgs
- data replication x3 (osd pool default size=3)
- osd journal size = 5120
- 10Gb separate network
Does the Ceph public and cluster network live on the 10 Gb network?
Can you please post a rados bench, see the Ceph benchmark paper for it.

- 2 x Intel S3520 1,2 TB SSD (not in Raid)
Are they connected by HBA?
 
Thanks again Alwin.

The specs says, 17500 IOPS random write and 67500 IOPS random read. So, they are read-intensive SSDs. IIRC, these are not the greatest for the ceph workload.
https://ark.intel.com/content/www/d...20-series-1-2tb-2-5in-sata-6gb-s-3d1-mlc.html

So this could be a reason.

How can I measure the total IOPS ongoing on a OSD ?

Do you think adding more OSDs of same disk type could help?


Does the Ceph public and cluster network live on the 10 Gb network?

Ceph net is on 10Gb Nic
Cluster net and WAN net (bridge) are on two separate 1Gb NICs

Can you please post a rados bench, see the Ceph benchmark paper for it.

Is rados bench destructive?

I need to stop all LXC before testing?


Are they connected by HBA?

Not sure what you mean, in Dell H730 setup I selected NON RAID.

Thanks
P.
 
How can I measure the total IOPS ongoing on a OSD ?
You can get the perf counters from Ceph, some performance monitoring tools can do that out of the box.
https://access.redhat.com/documenta...tml/administration_guide/performance_counters

Do you think adding more OSDs of same disk type could help?
Yes, increasing OSDs and/or nodes will increase the surface where writes can go. But really check out the Ceph benchmark paper on the form thread posted.

Ceph net is on 10Gb Nic
Cluster net and WAN net (bridge) are on two separate 1Gb NICs
Which Ceph network is on the 10 Gb? Ceph public, Ceph cluster or both? As I am not sure if you mean that with the cluster net, as it could be also corosync.

Is rados bench destructive?

I need to stop all LXC before testing?
No, it will write object to the pool with its own prefix and delete them afterwards. If in doubt, create a new pool and use that one. ;) The test of course will add additional load on the cluster.

Not sure what you mean, in Dell H730 setup I selected NON RAID.
Well, non-raid is not really a HBA. In many cases the RAID controller still influences read/write operations.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
 
Which Ceph network is on the 10 Gb? Ceph public, Ceph cluster or both? As I am not sure if you mean that with the cluster net, as it could be also corosync.

- Corosync is on net 10.10.10.0/24 on 1Gb NIC

- Ceph public network is on net 10.10.20.0/24 on 10Gb NIC

Well, non-raid is not really a HBA. In many cases the RAID controller still influences read/write operations.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

I will check better the raid controller options.

Thank you
P.
 
Does the fact that even with no LXC running the ceph-osd daemon has > 100% cpu usage gives any hints ?
 
- Ceph public network is on net 10.10.20.0/24 on 10Gb NIC
And Ceph's cluster network?

Does the fact that even with no LXC running the ceph-osd daemon has > 100% cpu usage gives any hints ?
Not really, as it is a distributed system many factors local or foreign can contribute to this. Check the ceph and system logs.
 
One strange update.

I had to restart all nodes one by one and after that the I/O delay is now normal with the same VM usage.

newio.png


Could it be a VM creating unusual load?
 
Could it be a VM creating unusual load?
Is also a possibility but I guess you would see more network bandwidth used.

I had to restart all nodes one by one and after that the I/O delay is now normal with the same VM usage.
It is very likely coming back.
 
Yes but this excludes both disk slowness and HBA/Raid mode performance as causes.
Not necessarily, as caches are cleared and devices are re-initialized. It could be also some firmware issue, manifesting during run time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!