2x 56GbE Optimization, slow ceph recovery, and MLNX-OS S.O.S.

jw6677 · Jul 12, 2021

I had this as a comment in another thread, but moved it here to it's own thread.

I have a three node Ceph cluster that I am diagnosing the presumed slow throughput of.

Each node has a on a connectx-3 Pro 2 port qsfp+. While I was running them via ib_ipoib, in an attempt to get past the low throughput issue, I've moved away from ib_ipoib, and am using them in ethernet mode alongside an sx6036 switch also in ethernet mode with licence key.

My primary cause for concern is low throughput across the connectx-3's, to the tune of ~50-200MB/s (not Mbit) of real world throughput to and from ramdisks on both ends, depending on the test, and ~14Gbit/s when using tools like iperf3 to avoid disk IO completely.

While I understand that even with LACP, I should not expect the full max throughput of both connections (112Gbit/s), having seen closer to 50Gbit/s with a single IPoIB connection I was expecting a lot more from Ethernet across the board (eliminating the transcoding to and from infiniband).

What prompted the whole situation was an upgrade to ceph pacific, where the long partially unsupported ceph-RDMA appears to be totally broken, forcing my hand.
I am currently backfilling a failed OSD, but at a rate of ~20MB/s (137 Days...), which does a good job of highlighting what I am dealing with, haha.

I am pretty confident the issue is not the hardware, but the configuration on either my linux nodes, or equally likely the SX6036 switch directly.
--> I highlight the switch as it was a struggle getting LACP working. I note the nodes becuase I am deepy unfamiliar with this degree of networking. This whole thing is a learning opportunity for me, but as I am confident others can relate, there are some really steep learning curves when self-teaching this stuff.

At the moment, I have two mlx4_en ethernet ports per node, running lacp X2 linux bond:

Code:

Settings for bond0:
    Supported ports: [  ]
    Supported link modes:   Not reported
    Supported pause frame use: No
    Supports auto-negotiation: No
    Supported FEC modes: Not reported
    Advertised link modes:  Not reported
    Advertised pause frame use: No
    Advertised auto-negotiation: No
    Advertised FEC modes: Not reported
    Speed: 112000Mb/s <------------------------------------------- Pretty confident I should exceed current throughput.
    Duplex: Full
    Auto-negotiation: off
    Port: Other
    PHYAD: 0
    Transceiver: internal
    Link detected: yes

The keen among us will want more details on the ceph setup, so while I am fairly confident I have a networking issue, the high level summary is:
Three nodes, each with 8x 8TB he8 SAS drives direct attached via jbod lsi hba + backplane (Man, the acronyms in this space are nuts, lol)
There is technically a fourth node, but it's not actually hosting ceph related stuff.
Again, I am pretty sure it's a network level issue, as testing is slow outside of ceph too.

Typical bonds look something like:

Code:

Features for bond0:
rx-checksumming: off [fixed]
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [requested on]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: on
    tx-tcp-mangleid-segmentation: on
    tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: off
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off
esp-tx-csum-hw-offload: off
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: on [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]

Following random optimization articles online (so very likely a misconfig here):

Code:

auto enp65s0
iface enp65s0 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000

auto enp65s0d1
iface enp65s0d1 inet manual
    mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000


auto bond0
iface bond0 inet static
    address 192.168.2.2/24
    mtu 9000
    post-up echo 1 > /proc/sys/net/ipv4/ip_forward
    post-up /sbin/ip link set dev bond0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0 txqueuelen 20000
    post-up /sbin/ip link set dev enp65s0d1 txqueuelen 20000
    post-up /sbin/ip link set dev bond0 mtu 9000
    post-up /sbin/ip link set dev enp65s0 mtu 9000
    post-up /sbin/ip link set dev enp65s0d1 mtu 9000
    post-up   iptables -t nat -A POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    post-up ifconfig bond0 192.168.2.2
    post-up /usr/sbin/ethtool -K bond0 lro on
    post-down iptables -t nat -D POSTROUTING -s '192.168.2.0/24' -o bond0 -j MASQUERADE || true
    bond-slaves enp65s0 enp65s0d1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    bond-min-links 1
    bond-lacp-rate 1

Switch also set to 9000MTU, and LACP enabled.

Code:

#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#

#kernel.domainname = example.com

# Uncomment the following to stop low-level messages on console
#kernel.printk = 3 4 1 3

##############################################################3
# Functions previously found in netbase
#

# Uncomment the next two lines to enable Spoof protection (reverse-path filter)
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1

# Uncomment the next line to enable TCP/IP SYN cookies
# See http://lwn.net/Articles/277146/
# Note: This may impact IPv6 TCP sessions too
#net.ipv4.tcp_syncookies=1

# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1

# Uncomment the next line to enable packet forwarding for IPv6
#  Enabling this option disables Stateless Address Autoconfiguration
#  based on Router Advertisements for this host
net.ipv6.conf.all.forwarding=1


###################################################################
# Additional settings - these settings can improve the network
# security of the host and prevent against some network attacks
# including spoofing attacks and man in the middle attacks through
# redirection. Some network environments, however, require that these
# settings are disabled so review and enable them as needed.
#
# Do not accept ICMP redirects (prevent MITM attacks)
#net.ipv4.conf.all.accept_redirects = 0
#net.ipv6.conf.all.accept_redirects = 0
# _or_
# Accept ICMP redirects only for gateways listed in our default
# gateway list (enabled by default)
# net.ipv4.conf.all.secure_redirects = 1
#
# Do not send ICMP redirects (we are not a router)
#net.ipv4.conf.all.send_redirects = 0
#
# Do not accept IP source route packets (we are not a router)
#net.ipv4.conf.all.accept_source_route = 0
#net.ipv6.conf.all.accept_source_route = 0
#
# Log Martian Packets
#net.ipv4.conf.all.log_martians = 1
#

###################################################################
# Magic system request Key
# 0=disable, 1=enable all, >1 bitmask of sysrq functions
# See https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
# for what other values do
#kernel.sysrq=438
kernel.sysrq=1
#vm.swappiness=20



net.ipv4.tcp_window_scaling = 1


vm.overcommit_memory=1
#vm.swappiness=60
#vm.vfs_cache_pressure=200

net.ipv4.ip_forward=1
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv6.conf.default.forwarding=1
#net.ipv4.conf.all.mc_forwarding=1
#net.ipv4.conf.default.mc_forwarding=1

#vm.max_map_count = 262144
#vm.dirty_writeback_centisecs = 1500
#vm.dirty_expire_centisecs = 1500

vm.overcommit_memory=1
vm.nr_hugepages = 1024


net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160

# set minimum size, initial size, and maximum size in bytes
net.ipv4.tcp_rmem="1024000 8738000 2147483647"
net.ipv4.tcp_wmem="1024000 8738000 2147483647"
net.ipv4.tcp_mem="1024000 8738000 2147483647"
net.ipv4.udp_mem="1024000 8738000 2147483647"

net.core.netdev_max_backlog=250000
net.ipv4.conf.all.forwarding=1
net.ipv4.conf.default.forwarding=1
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_low_latency=1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_window_scaling = 1
net.ipv6.conf.default.forwarding=1
net.ipv4.tcp_moderate_rcvbuf=1
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
kernel.sysrq = 1
net.link.lagg.0.use_flowid=1
net.link.lagg.0.lacp.lacp_strict_mode=1
net.ipv4.tcp_sack=1

Switch Setup, where the issue is likely preset:

Code:

Mellanox MLNX-OS Switch Management

Password:
Last login: Sun Jul 11 13:33:25 2021 from 192.168.10.132

Mellanox Switch

switch-625810 [standalone: master] > enable
switch-625810 [standalone: master] # show running-config
##
## Running database "initial"
## Generated at 2021/07/12 01:03:16 +0000
## Hostname: switch-625810
##

##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable

##
## License keys
##
license install REDACTED (Ethernet and all the bells and whistles)

##
## Interface Ethernet configuration
##
interface port-channel 1-4
interface ethernet 1/1-1/2 speed 56000 force
interface ethernet 1/13-1/14 speed 56000 force
interface ethernet 1/27-1/30 speed 56000 force
interface ethernet 1/1-1/2 mtu 9000 force
interface ethernet 1/7-1/8 mtu 9000 force
interface ethernet 1/13-1/14 mtu 9000 force
interface ethernet 1/25-1/30 mtu 9000 force
interface port-channel 1-4 mtu 9000 force
interface ethernet 1/1-1/2 channel-group 1 mode active
interface ethernet 1/7-1/8 channel-group 4 mode active
interface ethernet 1/13-1/14 channel-group 2 mode active
interface ethernet 1/25-1/26 channel-group 3 mode active

##
## LAG configuration
##
lacp
interface port-channel 1-4 lacp-individual enable force
port-channel load-balance ethernet source-destination-port
interface ethernet 1/1-1/2 lacp rate fast
interface ethernet 1/7-1/8 lacp rate fast
interface ethernet 1/13-1/14 lacp rate fast
interface ethernet 1/25-1/26 lacp rate fast

##
## STP configuration
##
no spanning-tree

##
## L3 configuration
##
ip routing vrf default


##
## DCBX PFC configuration
##
dcb priority-flow-control priority 3 enable
dcb priority-flow-control priority 4 enable
interface ethernet 1/1-1/36 no dcb priority-flow-control mode on force


##
## LLDP configuration
##
lldp

##
## IGMP Snooping configuration
##
ip igmp snooping proxy reporting
ip igmp snooping
vlan 1 ip igmp snooping querier

##
## PIM configuration
##
protocol pim

##
## IP Multicast router configuration
##
ip multicast-routing

##
## Network interface configuration
##
interface mgmt0 ip address REDACTED /16

##
## Local user account configuration
##
username admin password REDACTED

##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********


##
## Network management configuration
##
# web proxy auth basic password ********

##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID REDACTED
# (public-cert config omitted since private-key config is hidden)

##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Any chance some kind soul can point out my mistake, or point me in the right direction?

spirit · Jul 12, 2021

I don't known how rdma is working wih lacp bond, but with tcp/udp, an lacp bond always use 1link for each tcp connections. As with ceph, all osd is the connected to all others osd with multiple connections, it's scaling well. (now with rdma in the midlle, I don't known).

now about your backfilling speed.

It's possible to speedup backfilling with:

ceph config set osd osd_max_backfills X
ceph config set osd osd_recovery_max_active_hdd X
ceph config set osd osd_recovery_max_active_ssd X

Code:

osd_max_backfills
The maximum number of backfills allowed to or from a single OSD. Note that this is applied separately for read and write operations.

default : 1

osd_recovery_max_active_hdd
The number of active recovery requests per OSD at one time, if the primary device is rotational.

default: 3

osd_recovery_max_active_ssd
The number of active recovery requests per OSD at one time, if the primary device is non-rotational (i.e., an SSD).

default:10

also in pacific, they are a new qos algorithm, to avoid impact vm workload during backfilling/recevory. (I don't known if your current usage is already high without backfillig ?)
https://ceph.io/en/news/blog/2021/qos-study-with-mclock-and-wpq-schedulers/

jw6677 · Jul 12, 2021

Thanks Spirit, I've already been playing with ceph's configuration, and while the new QOS is new to me, it doens't appear to have effectived recovery speed at all. This further leads me to beleive the issue is likely in the network itself, not ceph, or IO limitations.

checking iostat for this node, the highest disk usage on the node is 10%-20%, futher backing this line of thinking. :S

Usage isn't particularly high outside of backfilling, particularly right now when everything is so slow anyways, but even on a normal day, it's mostly containers setup "just to try new software and things" which aren't actually running most of the time, and the couple of nodes which are always active tend to be periodicly loaded. (Proxmox backup server for example)

jw6677 · Jul 12, 2021

While I know this counts for very little, iperf3 running server and client on the same node looks awesome, and confirms that bonding appears to be functioning properly

Code:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   132 GBytes   113 Gbits/sec  3509             sender
[  5]   0.00-10.00  sec   132 GBytes   113 Gbits/sec                  receiver

jw6677 · Jul 12, 2021

Actually, that was a really useful excercise.

So, one node was showing as above, super high throughput, but a second not. The second one, I brought down one of two connections no change in iperf3 speed.

Thats odd.

Beginning to suspect internal routing issues

jw6677 · Oct 27, 2021

Closing the loop here. Watch closely for the bandwidth limits of the pcie gen 3 x4 slots, which were a limited factor in my system.

Search

Search

2x 56GbE Optimization, slow ceph recovery, and MLNX-OS S.O.S.

jw6677

Active Member

spirit

Distinguished Member

jw6677

Active Member

jw6677

Active Member

jw6677

Active Member

jw6677

Active Member