Yet another CEPH tuning question (comparing to dell san)

MoreDakka

Active Member
May 2, 2019
58
13
28
45
Heyo.
Let me know if I need to add more information/stats to this.
Here is my cluster, proxmox fully updated:

4 node c6220
each node:
128gb # 1333mhz
E5-2650v0
dual gige - bonded (network access) - connected to two different 10g switches
dual 10g - Failover (HA and migration) - connected to two different 40g switches
dual 40g infiniband - Bonded (running at 20g...stupid infiniband - CEPH)
OS drive 500gb SU800

Each node also has 4 - 1.92 Tb SM863a drives for OSDs, journaling on themselves.

I'm comparing this to an md3220i with 23 500gb SAS drives (it has quad gig to storage net but the ESX hosts only have dual gig for storage net).
I did an ATTO test on a windows VM both on Proxmox/CEPH and ESX/md3220i:

CEPH-md3220-compare01.jpg

Why are the write speeds between 512B - 128K so weak on CEPH and so strong on the md3220i?

Thanks!
 
Is there any way to decrease that latency and increase the IO with smaller packets to get better results?
 
Does CEPH do lots over the HA network? We have Ceph on a 40g network.
As you describe it, your "HA" network is for corosync. in that environment, it doesnt do anything over that interface.

We have Ceph on a 40g network.
It sounds counterintuitive, but 25gbit EN has 2.5x lower latency then 40gbit- this has to do with the architecture- 40g is just 10x4, so your latency is of a single 10gb link.

That said, hardware raid will often perform better with small blocks because caching effectiveness.
 
  • Like
Reactions: gurubert
Ah right 4x10g, forgot about that. And InfiniBand only runs at 20g cause of overhead or something.
So the only way to increase the awesomeness of CEPH is a 25g/100g system? and or NVMe drives as well.

The bond is also in failover as it doesn't cooperate otherwise.
Ugh, new hardware :(
 
And InfiniBand only runs at 20g cause of overhead or something.
No clue. are you sure you're using 40gbit parts? could also be a switch setting or poor cabling?

The bond is also in failover as it doesn't cooperate otherwise.
Thats all you can do with ipoib- and it wouldnt get you anywhere if you could lacp anyway, you're not bandwidth constrained.
 
Hi @MoreDakka ,
Before tearing your networking apart, I recommend looking into the latency difference between your SAN and Ceph:
Run a QD1 workload and look at the I/O sizes that fit within your network MTU (likely 512 or 1024). The transfer latency differences at this scale are negligible.
This data will give us an accurate comparison of latency between Ceph and your SAN. Report back on that, and we can discuss it further. If it's possible report IOPS instead of bandwidth, that would be helpful.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
1677554671410.png
Here is my CEPH over 10G with 5 Nodes, 6 x 2TB Evo 870 SSD on each node, is this the expected performance for this setup? Thanks
 
@_o
Gah, why is my cluster maxing at 2GB/s, I'm down by 14 drives but when I was testing with EVOs I had terrible performance so we went with enterprise SSDs instead which cost a whole bunch more. But my 2GB/s to rowell's 8-11 GB/sec is a massive difference, specially with a 10Gb network. Should I be using the 10Gb ethernet instead of the 40Gb InfiniBand for CEPH?
 
dual 40g infiniband - Bonded (running at 20g...stupid infiniband - CEPH)
Gah, why is my cluster maxing at 2GB/s,
Should I be using the 10Gb ethernet instead of the 40Gb InfiniBand for CEPH?
There are a couple of things to consider-

1. if your interconnect bandwidth is 20gbit, that translates to a MAXIMUM rate of 2.5GB/s (20/8)
2. Infiniband is not limited to 20gb. My guess is that your nics are actually 10gb IB- are they REALLY REALLY OLD? or maybe your switch is?
3. stop obsessing with "teh speed." the vast majority of use cases dont gain any benefit from those big numbers. What are you using this for? why not benchmark THAT?
 
  • Like
Reactions: gurubert
@alexskysilk
1 - Math is hard for me this early. ha, so almost hitting the speeds of our 1/2 speed 40Gb Infinniband on the CEPH network However it's 20Gb bidirectional.
2 - Yes, this hardware is old ConnectX-3 connected to SX6015 switches
3 - You are correct, and I hate obsessing over this one stat but I'm trying to tune it to the best I'm able to with the tools available before putting anything production on it. It's tough to do real world testing without a client machine or something generating the IOs that you would normally see.

A weird question though. How does rowell get 8-11 GB/s on a 10Gb network, that seems impossible?
 
network However it's 20Gb bidirectional.
Yes, this hardware is old ConnectX-3 connected to SX6015 switches
20gbit is not a supported connection type/speed on a connectx3. its either 10,40, or 56. Where are you seeing the 20 number?

trying to tune it to the best I'm able to with the tools available before putting anything production on it. It's tough to do real world testing without a client machine or something generating the IOs that you would normally see.
Well this is a paradox. you're saying that you dont have the ability to benchmark, so instead you are using a metric that doesn't have any apparent relation to your application. Why bother then? you're not actually establishing any relevant data...

Question- why do you not have a client machine? its like you're trying to answer a question that hasn't been asked.
 
20gbit is not a supported connection type/speed on a connectx3. its either 10,40, or 56. Where are you seeing the 20 number?
Sorry I should have specified, they are connected but I could never get these NICs to talk faster than 20Gb:

Both ports are:

root@pve1-cpu1:~# ethtool ibp5s0
Settings for ibp5s0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 40000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Other
PHYAD: 255
Transceiver: internal
Link detected: yes

root@pve1-cpu3:~# iperf3 -c 192.168.1.81
Connecting to host 192.168.1.81, port 5201
[ 5] local 192.168.1.83 port 56138 connected to 192.168.1.81 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.30 GBytes 19.8 Gbits/sec 0 1.37 MBytes
[ 5] 1.00-2.00 sec 2.29 GBytes 19.7 Gbits/sec 0 1.37 MBytes
[ 5] 2.00-3.00 sec 2.45 GBytes 21.0 Gbits/sec 0 1.37 MBytes
[ 5] 3.00-4.00 sec 2.40 GBytes 20.6 Gbits/sec 0 1.37 MBytes
[ 5] 4.00-5.00 sec 2.36 GBytes 20.2 Gbits/sec 0 1.37 MBytes
[ 5] 5.00-6.00 sec 2.45 GBytes 21.0 Gbits/sec 0 1.37 MBytes
[ 5] 6.00-7.00 sec 2.37 GBytes 20.4 Gbits/sec 0 1.37 MBytes
[ 5] 7.00-8.00 sec 2.41 GBytes 20.7 Gbits/sec 0 1.37 MBytes
[ 5] 8.00-9.00 sec 2.46 GBytes 21.1 Gbits/sec 0 1.37 MBytes
[ 5] 9.00-10.00 sec 2.38 GBytes 20.4 Gbits/sec 0 1.37 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 23.9 GBytes 20.5 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 23.9 GBytes 20.4 Gbits/sec receiver

While building the cluster I tried to get this working faster and at the time I couldn't figure it out. Reading something about some major overhead between linux and InfiniBand I left it at 20Gb. I don't remember the exact findings.

We don't specialize in one thing but VM hosting, web hosting, monitoring, desktops, etc. It's a broad net and throughput covers lots of it. I've put some of our stuff on this cluster and it performs fast. However if there is something else I would definitely want to push the cluster through it.
 
ahh ok. the answer is probably pretty simple.

You say your using both ports in an active/passive bond, correct? that likely means you're using the same bond for both public and private interfaces- which would explain how you are getting such a neatly HALF number, since that traffic is flowing twice (once over public interface, once over private.)

There are a number of ways to overcome this- you can separate the bond into two separate interfaces, add more NICs, or create multiple bonds with alternate masters.
However if there is something else I would definitely want to push the cluster through it.
Undoubtedly. but any tuning carries consequences- if you dont have a specific goal in mind, you are just as likely to end up harming your apparent performance. GENERALLY speaking, It is likely that you're not going to get anywhere near network saturation under your normal load even with your currently available bandwidth- the only place bandwidth really comes into play is during a rebalance. The real factor is number of nodes, OSDs per node, and number of guest connections. There is little to be learned with a single guest connection- its the aggregate that define the system performance. the more of the above, the greater the aggregate performance as any one IO is only dealing with 3 OSDs at any given time (for a replication pool; EC is another can of worms alltogether.)
 
@alexsky
I really appreciate this back and forth. Diving into lots of good stuff here :-D

I'm confused on your public/private network point
The two 40Gb infinniband ports (going to two different switches that are cross connected) are in an active/backup config with the primary NIC being NIC 1 in all of the servers. This should mean that as long as NIC1 is alive all the network traffic should go through NIC1.
Just for testing and fun times on 2 of the nodes I got rid of the bond and configured the IP directly on the interface. There was a slight increase of throughput:

root@pve1-cpu4:~# iperf3 -c 192.168.1.83
Connecting to host 192.168.1.83, port 5201
[ 5] local 192.168.1.84 port 41264 connected to 192.168.1.83 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.33 GBytes 20.0 Gbits/sec 0 1.25 MBytes
[ 5] 1.00-2.00 sec 2.38 GBytes 20.5 Gbits/sec 0 1.25 MBytes
[ 5] 2.00-3.00 sec 2.40 GBytes 20.6 Gbits/sec 0 1.25 MBytes
[ 5] 3.00-4.00 sec 2.43 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 4.00-5.00 sec 2.35 GBytes 20.2 Gbits/sec 0 1.25 MBytes
[ 5] 5.00-6.00 sec 2.37 GBytes 20.3 Gbits/sec 0 1.25 MBytes
[ 5] 6.00-7.00 sec 2.43 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 7.00-8.00 sec 2.42 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 8.00-9.00 sec 2.41 GBytes 20.7 Gbits/sec 0 1.25 MBytes
[ 5] 9.00-10.00 sec 2.42 GBytes 20.8 Gbits/sec 0 1.25 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 24.0 GBytes 20.6 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 24.0 GBytes 20.5 Gbits/sec receiver

Maybe I'm misunderstanding your public/private point there?
Or is there something CEPH specific that give CEPH only 10Gb/s that I might have configured incorrectly?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!