CEPH performance and configuration

lstrunjak

New Member
Jul 6, 2020
9
0
1
42
hello

I have 6 host and every host has 4 NIC 2x10G and 2x40G.

On 10G there is bond with LACP and bridge over bond0.
On 40G there is bond with active/backup and bridge over bond1. this network is for CEPH. On 40G interfaces I set MTU 9000.

auto lo
iface lo inet loopback

auto ens6f0
iface ens6f0 inet manual

auto ens6f1
iface ens6f1 inet manual

auto ens1f0
iface ens1f0 inet manual
mtu 9000

auto ens1f1
iface ens1f1 inet manual
mtu 9000

auto bond0
iface bond0 inet manual
bond-slaves ens6f0 ens6f1
bond-miimon 100
bond-mode 802.3ad

auto bond1
iface bond1 inet manual
bond-slaves ens1f0 ens1f1
bond-miimon 100
bond-mode active-backup
bond-primary ens1f0
mtu 9000

auto vmbr0
iface vmbr0 inet static
address 192.168.100.10/24
gateway 192.168.100.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-4094


auto vmbr1
iface vmbr1 inet static
address 10.15.10.10/24
bridge-ports bond1
bridge-stp off
bridge-fd 0




When I created CEPH I've chosen for public network 40G 10.15.10.10 address from master node) and for cluster network I've choose 192.168.100.10/24 (address from master node)


[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.100.10/24
fsid = fcb7cb0b-0444-4006-b65e-3c1bc7910b68
mon_allow_pool_delete = true
mon_host = 10.15.10.10 10.15.10.11 10.15.10.12 10.15.10.13 10.15.10.14 10.15.10.15
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.15.10.10/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.master]
public_addr = 10.15.10.10

[mon.slave01]
public_addr = 10.15.10.11

[mon.slave02]
public_addr = 10.15.10.12

[mon.slave03]
public_addr = 10.15.10.13

[mon.slave04]
public_addr = 10.15.10.14

[mon.slave05]
public_addr = 10.15.10.15





My disks for ceph are: Intel SSD DC P4510 4.0TB, 2.5in PCIe 3.1 x4, 3D2, TLC (SSDPE2KX040T8) and like I said every nod has 2 of them.



I am just curious because today we have tested and remove only one disk and put another one to see how fast rebuild of data will last. It took more then 2H.





Do I need to change something every proposal is good.


Maybe to remove cluster network and leave only public network or should I keep this kind of configuration.
 
Hello.
you have a dedicated network for ceph and that's good.
Rebuild times depends on the usage % of your ceph pool.

what is the read/write/iops performance on your OSDs ?
you can test if with "ceph tell osd.X bench -f plain"
how many backfills slots ?

you should monitor iops on your OSDs disks with 'iostat -xd 1' and see if disks reach 100%.
If not, increase backfills with :
'ceph tell 'osd.*' injectargs '--osd-max-backfill N',
"N" beeing 4,8,16,...
 
On 10G there is bond with LACP and bridge over bond0.
On 40G there is bond with active/backup and bridge over bond1. this network is for CEPH. On 40G interfaces I set MTU 9000.

So you say that 40G (10.15.10.X) should be for CEPH, but...

When I created CEPH I've chosen for public network 40G 10.15.10.10 address from master node) and for cluster network I've choose 192.168.100.10/24 (address from master node)

Here you say that the public network is on the 40G (10.15.10.X). Normally one puts the private (= cluster?) network on the faster one, as there the replication of objects, rebalancing and such happens. Public network is for the clients (for example VMs, or those which mount a CephFS).
 
So i should use faster network 40G for cluster network in CEPH and for public network I should use slower network 10G. What if I use only 40G for public network without cluster network?

my setup now is:

40G public
10H cluster
 
So i should use faster network 40G for cluster network in CEPH and for public network I should use slower network 10G.

Faster for cluster network. Think of it that way, a ceph pool normally is set to 3 replicates per data object, so if a client writes one object through the public network, that object will get replicated two additional times by ceph on the private network so that there are three copies total. So you may see up to three times the traffic on the cluster network.
Further, as said, if a OSD fails or new ones get plugged in ceph rebalances the data so that it's spread optimally and as safe as possible, this rebalancing happens also on the cluster network.

What if I use only 40G for public network without cluster network?

There's always a cluster network, if it is not specified extra, ceph re-uses one network for both, public and cluster traffic.
 
so is it better to use only 40G network for public network not specified cluster network? will ceph use 40G then and for public and for cluster network? Or should I switch just networks like:

40G fo cluster network and 10G for public network?
 
I've removed from ceph.conf file line for cluster network and only public network left. I got 4xIOPS, rebuild data is much faster. I will leave it like this without cluster network only public network which is 40G.
 
40G fo cluster network and 10G for public network?

Would normally the fastest and recommended. But, the clients may get maximal 10G (but that guaranteed) whereas with 40G shared you may have a better "synergy" but a rebalancing job can throw off and block your clients (VMs) talking with the ceph cluster more easily.

I think that either off:
* ceph public and cluster shared on one fast network
* ceph private on the fastest network and public on another network
will work out OK performance wise in many situations.
Just never put the private explicitly on the slower network, that will only hurt performance (as you have seen).
 
ok just one more thing ... would it be ok to remove bond1 and vmbr1 (40G NIC) and dedicate one nic from server for public network and second for cluster network. what if on node2 public interface fails will ceph continuo to work on cluster network? or should I leave it like this. I've removed cluster network from configuration file and now everything works over 40G public network. Like I said I had only 5000 IOPS when using 4096 block:


root@master:~# rados -p CEPH bench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_master.horizon.loc_198081
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 2 2 0 0 0 - 0
1 16 23042 23026 89.931 89.9453 0.000584592 0.000694414
2 15 46639 46624 91.0512 92.1797 0.000518542 0.000685909
3 16 70930 70914 92.3255 94.8828 0.000468876 0.000676538
4 15 92225 92210 90.039 83.1875 0.000487598 0.000693725
5 16 113121 113105 88.354 81.6211 0.000585868 0.000707014
6 15 133227 133212 86.7178 78.543 0.000611348 0.000720365
7 16 152593 152577 85.1349 75.6445 0.000456684 0.000733783
8 16 173027 173011 84.469 79.8203 0.00061629 0.000736787
9 16 192657 192641 83.6024 76.6797 0.000455015 0.000747263
Total time run: 10.0008
Total writes made: 216359
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 84.5084
Stddev Bandwidth: 7.03829
Max bandwidth (MB/sec): 94.8828
Min bandwidth (MB/sec): 75.6445
Average IOPS: 21634
Stddev IOPS: 1801.8
Max IOPS: 24290
Min IOPS: 19365
Average Latency(s): 0.000739264
Stddev Latency(s): 0.00211287
Max latency(s): 0.0803144
Min latency(s): 0.000364863


I think that those values are good.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!