Yet another CEPH tuning question (comparing to dell san)

MoreDakka · Feb 10, 2023

Heyo.
Let me know if I need to add more information/stats to this.
Here is my cluster, proxmox fully updated:

4 node c6220
each node:
128gb # 1333mhz
E5-2650v0
dual gige - bonded (network access) - connected to two different 10g switches
dual 10g - Failover (HA and migration) - connected to two different 40g switches
dual 40g infiniband - Bonded (running at 20g...stupid infiniband - CEPH)
OS drive 500gb SU800

Each node also has 4 - 1.92 Tb SM863a drives for OSDs, journaling on themselves.

I'm comparing this to an md3220i with 23 500gb SAS drives (it has quad gig to storage net but the ESX hosts only have dual gig for storage net).
I did an ATTO test on a windows VM both on Proxmox/CEPH and ESX/md3220i:

Why are the write speeds between 512B - 128K so weak on CEPH and so strong on the md3220i?

Thanks!

gurubert · Feb 11, 2023

Because smaller data packets mean more IO operations that introduce latency because Ceph is networked.
A RAID box like the Dell MD3220i does not have network latency between the disks.

MoreDakka · Feb 27, 2023

Is there any way to decrease that latency and increase the IO with smaller packets to get better results?

gurubert · Feb 27, 2023

Get 25G networking gear. It has lower latencies than the 10G technology.

MoreDakka · Feb 27, 2023

Does CEPH do lots over the HA network? We have Ceph on a 40g network.

alexskysilk · Feb 27, 2023

MoreDakka said:
Does CEPH do lots over the HA network? We have Ceph on a 40g network.

As you describe it, your "HA" network is for corosync. in that environment, it doesnt do anything over that interface.

MoreDakka said:
We have Ceph on a 40g network.

It sounds counterintuitive, but 25gbit EN has 2.5x lower latency then 40gbit- this has to do with the architecture- 40g is just 10x4, so your latency is of a single 10gb link.

That said, hardware raid will often perform better with small blocks because caching effectiveness.

VictorSTS · Feb 27, 2023

alexskysilk said:
It sounds counterintuitive, but 25gbit EN has 2.5x lower latency then 40gbit- this has to do with the architecture- 40g is just 10x4, so your latency is of a single 10gb link.

Is that also true for 40G infiniband? OP mentioned that he was using a dual bonded 40G infiniband interfaces instead of ethernet.

alexskysilk · Feb 28, 2023

Yes. also ipoib should be considered deprecated at this point.

MoreDakka · Feb 28, 2023

Ah right 4x10g, forgot about that. And InfiniBand only runs at 20g cause of overhead or something.
So the only way to increase the awesomeness of CEPH is a 25g/100g system? and or NVMe drives as well.

The bond is also in failover as it doesn't cooperate otherwise.
Ugh, new hardware

alexskysilk · Feb 28, 2023

MoreDakka said:
And InfiniBand only runs at 20g cause of overhead or something.

No clue. are you sure you're using 40gbit parts? could also be a switch setting or poor cabling?

MoreDakka said:
The bond is also in failover as it doesn't cooperate otherwise.

Thats all you can do with ipoib- and it wouldnt get you anywhere if you could lacp anyway, you're not bandwidth constrained.

bbgeek17 · Feb 28, 2023

Hi @MoreDakka ,
Before tearing your networking apart, I recommend looking into the latency difference between your SAN and Ceph:
Run a QD1 workload and look at the I/O sizes that fit within your network MTU (likely 512 or 1024). The transfer latency differences at this scale are negligible.
This data will give us an accurate comparison of latency between Ceph and your SAN. Report back on that, and we can discuss it further. If it's possible report IOPS instead of bandwidth, that would be helpful.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

rowell · Feb 28, 2023

Here is my CEPH over 10G with 5 Nodes, 6 x 2TB Evo 870 SSD on each node, is this the expected performance for this setup? Thanks

MoreDakka · Feb 28, 2023

@_o
Gah, why is my cluster maxing at 2GB/s, I'm down by 14 drives but when I was testing with EVOs I had terrible performance so we went with enterprise SSDs instead which cost a whole bunch more. But my 2GB/s to rowell's 8-11 GB/sec is a massive difference, specially with a 10Gb network. Should I be using the 10Gb ethernet instead of the 40Gb InfiniBand for CEPH?

alexskysilk · Feb 28, 2023

MoreDakka said:
dual 40g infiniband - Bonded (running at 20g...stupid infiniband - CEPH)

MoreDakka said:
Gah, why is my cluster maxing at 2GB/s,

MoreDakka said:
Should I be using the 10Gb ethernet instead of the 40Gb InfiniBand for CEPH?

There are a couple of things to consider-

1. if your interconnect bandwidth is 20gbit, that translates to a MAXIMUM rate of 2.5GB/s (20/8)
2. Infiniband is not limited to 20gb. My guess is that your nics are actually 10gb IB- are they REALLY REALLY OLD? or maybe your switch is?
3. stop obsessing with "teh speed." the vast majority of use cases dont gain any benefit from those big numbers. What are you using this for? why not benchmark THAT?

MoreDakka · Feb 28, 2023

@alexskysilk
1 - Math is hard for me this early. ha, so almost hitting the speeds of our 1/2 speed 40Gb Infinniband on the CEPH network However it's 20Gb bidirectional.
2 - Yes, this hardware is old ConnectX-3 connected to SX6015 switches
3 - You are correct, and I hate obsessing over this one stat but I'm trying to tune it to the best I'm able to with the tools available before putting anything production on it. It's tough to do real world testing without a client machine or something generating the IOs that you would normally see.

A weird question though. How does rowell get 8-11 GB/s on a 10Gb network, that seems impossible?

alexskysilk · Feb 28, 2023

MoreDakka said:
network However it's 20Gb bidirectional.

MoreDakka said:
Yes, this hardware is old ConnectX-3 connected to SX6015 switches

20gbit is not a supported connection type/speed on a connectx3. its either 10,40, or 56. Where are you seeing the 20 number?

MoreDakka said:
trying to tune it to the best I'm able to with the tools available before putting anything production on it. It's tough to do real world testing without a client machine or something generating the IOs that you would normally see.

Well this is a paradox. you're saying that you dont have the ability to benchmark, so instead you are using a metric that doesn't have any apparent relation to your application. Why bother then? you're not actually establishing any relevant data...

Question- why do you not have a client machine? its like you're trying to answer a question that hasn't been asked.

MoreDakka · Feb 28, 2023

alexskysilk said:
20gbit is not a supported connection type/speed on a connectx3. its either 10,40, or 56. Where are you seeing the 20 number?

Sorry I should have specified, they are connected but I could never get these NICs to talk faster than 20Gb:

Both ports are:

root@pve1-cpu1:~# ethtool ibp5s0
Settings for ibp5s0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 40000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Other
PHYAD: 255
Transceiver: internal
Link detected: yes

root@pve1-cpu3:~# iperf3 -c 192.168.1.81
Connecting to host 192.168.1.81, port 5201
[ 5] local 192.168.1.83 port 56138 connected to 192.168.1.81 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.30 GBytes 19.8 Gbits/sec 0 1.37 MBytes
[ 5] 1.00-2.00 sec 2.29 GBytes 19.7 Gbits/sec 0 1.37 MBytes
[ 5] 2.00-3.00 sec 2.45 GBytes 21.0 Gbits/sec 0 1.37 MBytes
[ 5] 3.00-4.00 sec 2.40 GBytes 20.6 Gbits/sec 0 1.37 MBytes
[ 5] 4.00-5.00 sec 2.36 GBytes 20.2 Gbits/sec 0 1.37 MBytes
[ 5] 5.00-6.00 sec 2.45 GBytes 21.0 Gbits/sec 0 1.37 MBytes
[ 5] 6.00-7.00 sec 2.37 GBytes 20.4 Gbits/sec 0 1.37 MBytes
[ 5] 7.00-8.00 sec 2.41 GBytes 20.7 Gbits/sec 0 1.37 MBytes
[ 5] 8.00-9.00 sec 2.46 GBytes 21.1 Gbits/sec 0 1.37 MBytes
[ 5] 9.00-10.00 sec 2.38 GBytes 20.4 Gbits/sec 0 1.37 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 23.9 GBytes 20.5 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 23.9 GBytes 20.4 Gbits/sec receiver

While building the cluster I tried to get this working faster and at the time I couldn't figure it out. Reading something about some major overhead between linux and InfiniBand I left it at 20Gb. I don't remember the exact findings.

We don't specialize in one thing but VM hosting, web hosting, monitoring, desktops, etc. It's a broad net and throughput covers lots of it. I've put some of our stuff on this cluster and it performs fast. However if there is something else I would definitely want to push the cluster through it.

alexskysilk · Feb 28, 2023

ahh ok. the answer is probably pretty simple.

You say your using both ports in an active/passive bond, correct? that likely means you're using the same bond for both public and private interfaces- which would explain how you are getting such a neatly HALF number, since that traffic is flowing twice (once over public interface, once over private.)

There are a number of ways to overcome this- you can separate the bond into two separate interfaces, add more NICs, or create multiple bonds with alternate masters.

MoreDakka said:
However if there is something else I would definitely want to push the cluster through it.

Undoubtedly. but any tuning carries consequences- if you dont have a specific goal in mind, you are just as likely to end up harming your apparent performance. GENERALLY speaking, It is likely that you're not going to get anywhere near network saturation under your normal load even with your currently available bandwidth- the only place bandwidth really comes into play is during a rebalance. The real factor is number of nodes, OSDs per node, and number of guest connections. There is little to be learned with a single guest connection- its the aggregate that define the system performance. the more of the above, the greater the aggregate performance as any one IO is only dealing with 3 OSDs at any given time (for a replication pool; EC is another can of worms alltogether.)

MoreDakka · Feb 28, 2023

@alexsky
I really appreciate this back and forth. Diving into lots of good stuff here :-D

I'm confused on your public/private network point
The two 40Gb infinniband ports (going to two different switches that are cross connected) are in an active/backup config with the primary NIC being NIC 1 in all of the servers. This should mean that as long as NIC1 is alive all the network traffic should go through NIC1.
Just for testing and fun times on 2 of the nodes I got rid of the bond and configured the IP directly on the interface. There was a slight increase of throughput:

root@pve1-cpu4:~# iperf3 -c 192.168.1.83
Connecting to host 192.168.1.83, port 5201
[ 5] local 192.168.1.84 port 41264 connected to 192.168.1.83 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.33 GBytes 20.0 Gbits/sec 0 1.25 MBytes
[ 5] 1.00-2.00 sec 2.38 GBytes 20.5 Gbits/sec 0 1.25 MBytes
[ 5] 2.00-3.00 sec 2.40 GBytes 20.6 Gbits/sec 0 1.25 MBytes
[ 5] 3.00-4.00 sec 2.43 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 4.00-5.00 sec 2.35 GBytes 20.2 Gbits/sec 0 1.25 MBytes
[ 5] 5.00-6.00 sec 2.37 GBytes 20.3 Gbits/sec 0 1.25 MBytes
[ 5] 6.00-7.00 sec 2.43 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 7.00-8.00 sec 2.42 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 8.00-9.00 sec 2.41 GBytes 20.7 Gbits/sec 0 1.25 MBytes
[ 5] 9.00-10.00 sec 2.42 GBytes 20.8 Gbits/sec 0 1.25 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 24.0 GBytes 20.6 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 24.0 GBytes 20.5 Gbits/sec receiver

Maybe I'm misunderstanding your public/private point there?
Or is there something CEPH specific that give CEPH only 10Gb/s that I might have configured incorrectly?

alexskysilk · Feb 28, 2023

please post
/etc/pve/ceph.conf

/etc/network/interfaces

Yet another CEPH tuning question (comparing to dell san)

Active Member

Famous Member

Active Member

Famous Member

Active Member

Distinguished Member

Renowned Member

Distinguished Member

Active Member

Distinguished Member

Distinguished Member

Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member