Ceph Tunning Performance in cluster with all NVMe

np33kf

New Member
Jan 23, 2024
1
0
1
Hi, My setup:

Proxmox cluster with 3 nodes with this hardware:
  • EPYC 9124
  • 128Gb DDR5
  • 2x M2 boot drive
  • 3x NVMe drives Gen5 (Kioxya CM7-R 1,9TB)
    2x NIC Intel 710 with 2x40Gbe
  • 1x NIC Intel 710 with 4x10Gbe

Configuration:
  • 10Gbe NIC for Management and Client side
  • 2 x NIC 40Gbe for Ceph network in full mesh - since I have two NIC with 2 ports 40Gbe each I made a bond with 2 ports in each NIC to connect to one node, and the other two ports, to the other node (also in a bond). For making the mesh work, I made a broadcast bond of the 2 bonds.
  • All physical interfaces and logical interfaces with 9000 MTU and Layer 3+4
  • Ceph running in this 3 nodes with 9 OSD (3x3 Kioxya drives).
  • Ceph pool with size 2 and PG 16 (autoscale on).

Running with no problems except for the performance.

Rados Bench (write):

Code:
    Total time run:         10.4534
    Total writes made:      427
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     163.392
    Stddev Bandwidth:       21.8642
    Max bandwidth (MB/sec): 200
    Min bandwidth (MB/sec): 136
    Average IOPS:           40
    Stddev IOPS:            5.46606
    Max IOPS:               50
    Min IOPS:               34
    Average Latency(s):     0.382183
    Stddev Latency(s):      0.507924
    Max latency(s):         1.85652
    Min latency(s):         0.00492415

Rados Bench (read seq):

Code:
    Total time run:       10.4583
    Total reads made:     427
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   163.315
    Average IOPS:         40
    Stddev IOPS:          5.54677
    Max IOPS:             49
    Min IOPS:             33
    Average Latency(s):   0.38316
    Max latency(s):       1.35302
    Min latency(s):       0.00270731

Ceph tell (Similar results in all drives):

Code:
osd.0: {
        "bytes_written": 1073741824,
        "blocksize": 4194304,
        "elapsed_sec": 0.306790426,
        "bytes_per_sec": 3499919596.5782843,
        "iops": 834.44585718590838
    }

iperf3 (Similar result in all nodes);

Code:
    [SUM]   0.00-10.00  sec  42.0 GBytes  36.0 Gbits/sec  78312             sender
    [SUM]   0.00-10.00  sec  41.9 GBytes  36.0 Gbits/sec                  receiver

I can only achive 130MB/sec write/read speed in ceph, when each disk is capable of supporting +2GB/sec, and the network can support also +4GB/sec.

I tried tweaking with:
  • PG number (more and less)
  • Ceph configuration options of all sorts
  • sysctl.conf kernel settings

without understanding what is caping the performance.

The fact that the read and write speed are the same make me think that the problem is in the network.

It must be some kind of configuration/setting that i am missing out. Can you guys give me some help/pointers?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!