Ceph configuration - best practices

itret · Mar 26, 2024

Long story short... not possible.
I'm planning to install and use ceph and HA in my working cluster enviroment.
I have 4 nodes with 256GB RAM each and 2x10G NICs dedicated for ceph cluster network traffic.
I already know this may be not enough for perfect performance so I'm planning to swap NICs with at least 40G cards but this is not a case.
I did some tests myself, first thing I found out is that you cannot have your proxmox ve installed on usb drives this will cause a lot of issues as system will be terribly slow no matter how good rest of your hardware is. I tried to install ceph on my nodes equipped with nvme m.2 drives and performance was terrible.
I already changed these 2TB NVMe drives from WD RED to Micron 7450 PRO MAX.
I decided to rebuild my cluster from scratch. Ceph is already installed I got rid of USB for main instllation, I didn't notice any bottlenecks like before, Everything works smooth, no timeouts in GUI now it's time to go further with configuration.
I stucked on OSD creation. Shall I divide my nvme drives to partitions to create more OSDs?
What about DB disk? by default proxmox suggests to use OSB disk. Would you suggest to use another physical drive? And if yes, how large should it be?
The rest I think I know but any advice I will respect. Thank you.

Alwin Antreich · Apr 2, 2024

itret said:
Long story short... not possible.

Does that mean you were unsuccessful in using Proxmox in this constellation?

Or do you still have some unanswered questions?

itret · Apr 2, 2024

Alwin Antreich said:
Does that mean you were unsuccessful in using Proxmox in this constellation?

Or do you still have some unanswered questions?

So finally I got Ceph configuration up and running. 3 nodes with 3 NVMe m2 2TB Micron 7450 drives each divided to equal 4 OSDs. This is 12OSDs in total. Everything running on separate 10Gb NIC cards (I'm thinking of using 2nd port for balance-alb but for now I don't have enough empty SFP+ ports in my Cisco CSB350 switches).
To be honest I'm kinda dissapointed with performance I mean speed as Ceph is working fine no errors.
All I can get from this config is 780MB/s read and 240MB/s write.
I checked network speed with iperf on all nodes with each other and I' m getting transfer 10,5GBytes and bandwidth 9.03 Gbits/sec

Is there anything I can do to improve this? At this moment I don't have any important VMs moved to ceph so I can redo the config.

Alwin Antreich · Apr 2, 2024

itret said:
Is there anything I can do to improve this? At this moment I don't have any important VMs moved to ceph so I can redo the config.

Increase your network speed and bandwidth (and use LACP). See the benchmark [0] for comparison and the one [1] for the high performance cluster.

[0] https://proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark
[1] https://proxmox.com/en/downloads/pr...cumentation/proxmox-ve-ceph-benchmark-2020-09

VictorSTS · Apr 2, 2024

IMHO, doesn't make much sense to create 4 OSD per drive as you will be limited by the network bandwidth wise, and you already have 9 drives contributing IOPs that should be plenty.
There's information missing regarding the hardware used, exact network configuration, Ceph configuration and how are you doing your bandwidth tests.

itret · Apr 2, 2024

Alwin Antreich said:
Increase your network speed and bandwidth (and use LACP). See the benchmark [0] for comparison and the one [1] for the high performance cluster.

[0] https://proxmox.com/en/downloads/proxmox-virtual-environment/documentation/proxmox-ve-ceph-benchmark
[1] https://proxmox.com/en/downloads/pr...cumentation/proxmox-ve-ceph-benchmark-2020-09

LACP on Cisco CBS350 series is a bit strange, I tried this setup but couldn't get it working really. Anyway if you take a look into tech specs this LAG on Cisco will still be only 10G, which is quite funny.

About docs you've attached. Yes, I've seen these. thank you. According to this documentation with my setup I should get at least around 800MB/sec write.

Here's an output from rados bench (I'm quite happy with read speed but write is crap)

Code:

rados bench -p <mypool> 600 write -b 4M -t 16 --no-cleanup

Total writes made:      45150
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     300.937
Stddev Bandwidth:       29.3551
Max bandwidth (MB/sec): 404
Min bandwidth (MB/sec): 224
Average IOPS:           75
Stddev IOPS:            7.33876
Max IOPS:               101
Min IOPS:               56
Average Latency(s):     0.212653
Stddev Latency(s):      0.0774387
Max latency(s):         1.06968
Min latency(s):         0.0297046

rados -p <mypool> bench 600 seq -t 16

Total time run:       176.108
Total reads made:     45150
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1025.51
Average IOPS:         256
Stddev IOPS:          34.3833
Max IOPS:             333
Min IOPS:             170
Average Latency(s):   0.0616288
Max latency(s):       0.951984
Min latency(s):       0.00599401

Alwin Antreich · Apr 2, 2024

itret said:
About docs you've attached. Yes, I've seen these. thank you. According to this documentation with my setup I should get at least around 800MB/sec write.

That depends. You've created 4x OSDs out of 1x NVMe, did you use namespaces or did you go with LVM? And try to reduce the number to 2x, usually that yields better results.

How big is the osd_memory_target? I hope it is >= 8 GB.

What is the replica size of your pool? And how does the crush map look like (can be found in the UI)?

itret said:
LACP on Cisco CBS350 series is a bit strange, I tried this setup but couldn't get it working really. Anyway if you take a look into tech specs this LAG on Cisco will still be only 10G, which is quite funny.

Which is still true, each port will have 10 G. Only the bandwidth will increase, not the latency. With 3x nodes and a rados bench you won't see much improvement, as there aren't enough sessions (layer 3+4) to balance across the two links.

If you don't intent do get more nodes, then cross connect them into a full-mesh. This way you'll eliminate the switch setup.

alexskysilk · Apr 2, 2024

Alwin Antreich said:
If you don't intent do get more nodes, then cross connect them into a full-mesh. This way you'll eliminate the switch setup.

Its worth noting that a switched configuration has double the effective bandwidth then a mesh config for the same two ports. That said, it probably wouldnt matter unless you change the network config away from LACP. to wit-

itret said:
Anyway if you take a look into tech specs this LAG on Cisco will still be only 10G, which is quite funny.

LACP doesnt do what you think it does. Its doesnt "double" your speed- it allows twice as many packets. while the difference might not be easy to comprehend, the consequences are that you cant just make a single stream into twice the speed of a link. since multithreaded IO isnt available in ceph yet (we're all waiting for Crimson) you wont get THAT kind of benefit.

Instead, what you can do is create two active/passive bonds with alternate masters, and use each for ceph public and private traffic respectively.

which leads to the next point: rados bench only tests the OSD subsystem, not the storage system as a whole. Since traffic from the guest has to travel to a monitor, and subsequently to your OSDs, the end result is that if you keep your ceph public and private traffic comingled your effective throughput halves- making the above comment more relevant.

itret said:
Bandwidth (MB/sec): 1025.51

I dont know what you're complaining about. you're saturating your network.

itret · Apr 3, 2024

Alwin Antreich said:
That depends. You've created 4x OSDs out of 1x NVMe, did you use namespaces or did you go with LVM? And try to reduce the number to 2x, usually that yields better results.

How big is the osd_memory_target? I hope it is >= 8 GB.

What is the replica size of your pool? And how does the crush map look like (can be found in the UI)?

Which is still true, each port will have 10 G. Only the bandwidth will increase, not the latency. With 3x nodes and a rados bench you won't see much improvement, as there aren't enough sessions (layer 3+4) to balance across the two links.

If you don't intent do get more nodes, then cross connect them into a full-mesh. This way you'll eliminate the switch setup.

Well Proxmox GUI doesn't give you too many options to configure Ceph, my setup apart from 4x OSDs out of each NVMe is all on defaults.
I used really one command to create these

Code:

ceph-volume lvm batch --osds-per-device 4 /dev/nvme0n1

and because RADOS was giving me an error I also had to add a keyring with

Code:

/usr/bin/ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

I cannot find osd_memory_target settings... and this could be quite important. I've read somewhere I should put about 3/4GB per OSD and in this case it would be probably better to use less OSDs on single NVMe to save some RAM.

My ceph config looks like this:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.222.11/24
     fsid = 24b39425-ce01-4166-8ae3-a6cc9043da2f
     mon_allow_pool_delete = true
     mon_host = 192.168.222.11 192.168.222.13 192.168.222.12
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.222.11/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.node01]
     public_addr = 192.168.222.11

[mon.node02]
     public_addr = 192.168.222.12

[mon.node03]
     public_addr = 192.168.222.13

And my crush Map looks like this:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host node02 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 1.74640
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.43660
    item osd.1 weight 0.43660
    item osd.2 weight 0.43660
    item osd.3 weight 0.43660
}
host node01 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 1.74640
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.43660
    item osd.5 weight 0.43660
    item osd.6 weight 0.43660
    item osd.7 weight 0.43660
}
host node03 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 1.74640
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.43660
    item osd.9 weight 0.43660
    item osd.10 weight 0.43660
    item osd.11 weight 0.43660
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 5.23920
    alg straw2
    hash 0    # rjenkins1
    item node02 weight 1.74640
    item node01 weight 1.74640
    item node03 weight 1.74640
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

itret · Apr 3, 2024

alexskysilk said:
I dont know what you're complaining about. you're saturating your network.

I'm not complaining about READ speed as I think it's all I can get from 10G network, I'm complainig about WRITE which seems to be more than twice slower compared to other benchmarks on similiar hardware configuration.

Alwin Antreich · Apr 3, 2024

itret said:
Well Proxmox GUI doesn't give you too many options to configure Ceph, my setup apart from 4x OSDs out of each NVMe is all on defaults.
I used really one command to create these

This means, it is LVM on top of one namespace. You can test with reducing the number of OSDs on the single namespace or use multiple namespaces (I suggest two).

itret said:
I cannot find osd_memory_target settings... and this could be quite important. I've read somewhere I should put about 3/4GB per OSD and in this case it would be probably better to use less OSDs on single NVMe to save some RAM.

It is 4 GiB by default, see [0]. I recommend to go with 8 GiB (or more). You can set them in the ceph config DB or in the config file. Either restart the OSDs or injectargs to set them live.

itret said:
I'm not complaining about READ speed as I think it's all I can get from 10G network, I'm complainig about WRITE which seems to be more than twice slower compared to other benchmarks on similiar hardware configuration.

Well, in Ceph reads will (in most cases) always be better [1]. The reason for this is that Ceph will only ACK a write when all OSDs of a particular PG have written the objects from the client. This adds latency on top and also makes network key. A read on the other hand is handled by the primary OSD directly, no wait time.

[0] https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#confval-osd_memory_target
[1] https://docs.ceph.com/en/latest/architecture/#smart-daemons-enable-hyperscale

itret · Apr 3, 2024

Alwin Antreich said:
This means, it is LVM on top of one namespace. You can test with reducing the number of OSDs on the single namespace or use multiple namespaces (I suggest two).

Please ignore my previous answer. I've already edited it. Is there any manual how to create and use multiple namespaces in Proxmox?

Alwin Antreich · Apr 3, 2024

itret said:
What do you mean by the namespace? A phycial node?

The NVME protocol [0] (since v1.3) allows to namespace (slice) an NVMe into multiple devices. Some vendors have their own tooling, others can be set through the nvme cli.

If this is an advantage greatly depends on the NVMe controller, as you are getting distinct devices in the kernel. And I am not sure how the controller handles these. This is still some magic to me.

[0] https://nvmexpress.org/resource/nvme-namespaces/

itret · Apr 3, 2024

Alwin Antreich said:
The NVME protocol [0] (since v1.3) allows to namespace (slice) an NVMe into multiple devices. Some vendors have their own tooling, others can be set through the nvme cli.

If this is an advantage greatly depends on the NVMe controller, as you are getting distinct devices in the kernel. And I am not sure how the controller handles these. This is still some magic to me.

[0] https://nvmexpress.org/resource/nvme-namespaces/

Ok. I don't think on 10G network I will be able to increase significally performance by creating these addidional namespaces.
I think I will consider buying extra 2x40G NIC cards and connect the nodes together without any switches I hope this will solve my writing speed issue. I have also some spare phycial server so maybe adding 4th node to this cluster will speed it up a bit.

Alwin Antreich · Apr 3, 2024

itret said:
I think I will consider buying extra 2x40G NIC cards and connect the nodes together without any switches I hope this will solve my writing speed issue. I have also some spare phycial server so maybe adding 4th node to this cluster will speed it up a bit.

I'd go for 2x 25 GbE, they usually have the better latency, compared to 40 GbE NICs. I also believe that the LACP bond isn't distributing the load evenly with a small number of destinations + sessions. But anyway, the NICs will help.

itret · Apr 5, 2024

Alwin Antreich said:
I'd go for 2x 25 GbE, they usually have the better latency, compared to 40 GbE NICs. I also believe that the LACP bond isn't distributing the load evenly with a small number of destinations + sessions. But anyway, the NICs will help.

I've Just ordered Mellanox Infiniband cards and switch, I hope this will solve the issue

VoIP-Ninja · 2024-04-26T16:45:29+0200

Infiniband? Well, no - as CEPH is ethernet and everything needs to be "translated" to Ethernet...

Alwin Antreich · 2024-04-26T18:05:28+0200

VoIP-Ninja said:
Infiniband? Well, no - as CEPH is ethernet and everything needs to be "translated" to Ethernet...

Ceph speaks IP, which IPoIB provides. Though it doesn't seem to be such a common setup.

(and yeah, sadly somewhere on this forum I wrote incorrectly Ethernet)

Darkk · 2024-04-26T18:32:41+0200

You can actually change the mode on those Mellanox Infiniband to Ethernet. You'd be paying more to have Infiniband support.

alexskysilk · 2024-04-26T20:48:45+0200

Alwin Antreich said:
Ceph speaks IP, which IPoIB provides. Though it doesn't seem to be such a common setup.

Not for a while. this was an attractive setup some years ago when 10gb+ ethernet switch ports were really expensive, but EDR/FDR switch ports were given away for basically free (this was when 25gbit ports were introduced.) The biggest issue with IPOIB as an alternative to ethernet is that IPOIB is strictly a layer 3 topology- IB offers pretty limited L2 functionality (eg, no LACP, cant be used as bridges, etc.)

Darkk said:
You can actually change the mode on those Mellanox Infiniband to Ethernet. You'd be paying more to have Infiniband support.

Not always. it depends on your HCA's part number; this only applies to VPI cards. But cost is not really the issue here- its what type of switch ports you have available.

Ceph configuration - best practices

Member

Active Member

Member

Active Member

Renowned Member

Member

Active Member

Distinguished Member

Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Active Member

Active Member

Distinguished Member