[SOLVED] Ceph & output discards (queue drops) on switchport

cipwurzel

Active Member
Sep 8, 2017
17
4
43
I have a fresh Proxmox installationon 5 servers (Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph.

The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000)
  • backend network for Ceph: Mellanox ConnectX-4 Lx dual-port 25 GBit/s
  • frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port
Ceph is quite a default installation with size=3.

My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.

First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists.

I tried quite a lot and nothing help:
  • changed the MTU from 9000 to 1500
  • changed bond_xmit_hash_policy from layer3+4 to layer2+3
  • deactivated the bond and just used a single link
  • disabled offloading
  • disabled power management in BIOS
  • perf-bias 0
I analyzed the traffic via tcpdump and got some of those "errors":
  • TCP Previous segment not captured
  • TCP Out-of-Order
  • TCP Retransmission
  • TCP Fast Retransmission
  • TCP Dup ACK
  • TCP ACKed unseen segment
  • TCP Window Update
Is that behavior normal for ceph or has anyone ideas how to solve that problem?

With iperf I can reach full 50 GBit/s on the bond with zero output drops.


Ceph statistics:
Code:
  cluster:
    id:     bc8d51e7-e62d-44f0-91ee-90f0e1a784e5
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum nethcn-b1,nethcn-b3,nethcn-b5
    mgr: nethcn-b1(active), standbys: nethcn-b5, nethcn-b3
    osd: 40 osds: 40 up, 40 in
 
  data:
    pools:   1 pools, 2048 pgs
    objects: 14973 objects, 54911 MB
    usage:   199 GB used, 35566 GB / 35766 GB avail
    pgs:     2048 active+clean
 
  io:
    client:   6477 B/s rd, 192 MB/s wr, 1 op/s rd, 206 op/s wr

Output discards:
Code:
sw-nic-1.10#show interfaces Et17/1 | grep "output discards"
     0 late collision, 0 deferred, 6979 output discards
sw-nic-1.10#show interfaces Et17/2 | grep "output discards"
     0 late collision, 0 deferred, 1189 output discards
sw-nic-1.10#show interfaces Et18/1 | grep "output discards"
     0 late collision, 0 deferred, 4297 output discards
sw-nic-1.10#show interfaces Et18/2 | grep "output discards"
     0 late collision, 0 deferred, 2936 output discards
sw-nic-1.10#show interfaces Et17/3 | grep "output discards"
     0 late collision, 0 deferred, 17244 output discards
Code:
sw-nic-2.11#show interfaces Et17/1 | grep "output discards"
     0 late collision, 0 deferred, 2174 output discards
sw-nic-2.11#show interfaces Et17/2 | grep "output discards"
     0 late collision, 0 deferred, 1378 output discards
sw-nic-2.11#show interfaces Et18/1 | grep "output discards"
     0 late collision, 0 deferred, 1739 output discards
sw-nic-2.11#show interfaces Et18/2 | grep "output discards"
     0 late collision, 0 deferred, 3397 output discards
sw-nic-2.11#show interfaces Et17/3 | grep "output discards"
     0 late collision, 0 deferred, 18089 output discards

One interface with details:
Code:
sw-nic-1.10#show int Et18/1
Ethernet18/1 is up, line protocol is up (connected)
  Hardware is Ethernet, address is 444c.a8e9.3694 (bia 444c.a8e9.3694)
  Description: nethcn-b3 mlx0 trunk po1181
  Member of Port-Channel1181
  Ethernet MTU 9214 bytes , BW 25000000 kbit
  Full-duplex, 25Gb/s, auto negotiation: off, uni-link: n/a
  Up 48 minutes, 15 seconds
  Loopback Mode : None
  0 link status changes since last clear
  Last clearing of "show interface" counters 0:31:57 ago
  5 seconds input rate 2.30 Mbps (0.0% with framing overhead), 190 packets/sec
  5 seconds output rate 2.07 Mbps (0.0% with framing overhead), 187 packets/sec
     936771 packets input, 4962160459 bytes
     Received 0 broadcasts, 1847 multicast
     0 runts, 0 giants
     0 input errors, 0 CRC, 0 alignment, 0 symbol, 0 input discards
     0 PAUSE input
     1115445 packets output, 6727066114 bytes
     Sent 0 broadcasts, 2882 multicast
     0 output errors, 0 collisions
     0 late collision, 0 deferred, 4297 output discards
     0 PAUSE output
 
Flow control was active on the NIC but not on the switch.

Enabling flowcontrol for both direction solved the problem:
flowcontrol receive on
flowcontrol send on

Code:
Port        Send FlowControl  Receive FlowControl  RxPause       TxPause     
            admin    oper     admin    oper                                   
----------  -------- -------- -------- --------    ------------- -------------
Et17/1      on       on       on       on          0             64500       
Et17/2      on       on       on       on          0             33746       
Et17/3      on       on       on       on          0             17126       
Et18/1      on       on       on       on          0             36948       
Et18/2      on       on       on       on          0             39628
 
thanks for sharing