[SOLVED] Ceph & output discards (queue drops) on switchport

cipwurzel

Active Member
Sep 8, 2017
17
4
43
I have a fresh Proxmox installationon 5 servers (Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph.

The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000)
  • backend network for Ceph: Mellanox ConnectX-4 Lx dual-port 25 GBit/s
  • frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port
Ceph is quite a default installation with size=3.

My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.

First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists.

I tried quite a lot and nothing help:
  • changed the MTU from 9000 to 1500
  • changed bond_xmit_hash_policy from layer3+4 to layer2+3
  • deactivated the bond and just used a single link
  • disabled offloading
  • disabled power management in BIOS
  • perf-bias 0
I analyzed the traffic via tcpdump and got some of those "errors":
  • TCP Previous segment not captured
  • TCP Out-of-Order
  • TCP Retransmission
  • TCP Fast Retransmission
  • TCP Dup ACK
  • TCP ACKed unseen segment
  • TCP Window Update
Is that behavior normal for ceph or has anyone ideas how to solve that problem?

With iperf I can reach full 50 GBit/s on the bond with zero output drops.


Ceph statistics:
Code:
  cluster:
    id:     bc8d51e7-e62d-44f0-91ee-90f0e1a784e5
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum nethcn-b1,nethcn-b3,nethcn-b5
    mgr: nethcn-b1(active), standbys: nethcn-b5, nethcn-b3
    osd: 40 osds: 40 up, 40 in
 
  data:
    pools:   1 pools, 2048 pgs
    objects: 14973 objects, 54911 MB
    usage:   199 GB used, 35566 GB / 35766 GB avail
    pgs:     2048 active+clean
 
  io:
    client:   6477 B/s rd, 192 MB/s wr, 1 op/s rd, 206 op/s wr

Output discards:
Code:
sw-nic-1.10#show interfaces Et17/1 | grep "output discards"
     0 late collision, 0 deferred, 6979 output discards
sw-nic-1.10#show interfaces Et17/2 | grep "output discards"
     0 late collision, 0 deferred, 1189 output discards
sw-nic-1.10#show interfaces Et18/1 | grep "output discards"
     0 late collision, 0 deferred, 4297 output discards
sw-nic-1.10#show interfaces Et18/2 | grep "output discards"
     0 late collision, 0 deferred, 2936 output discards
sw-nic-1.10#show interfaces Et17/3 | grep "output discards"
     0 late collision, 0 deferred, 17244 output discards
Code:
sw-nic-2.11#show interfaces Et17/1 | grep "output discards"
     0 late collision, 0 deferred, 2174 output discards
sw-nic-2.11#show interfaces Et17/2 | grep "output discards"
     0 late collision, 0 deferred, 1378 output discards
sw-nic-2.11#show interfaces Et18/1 | grep "output discards"
     0 late collision, 0 deferred, 1739 output discards
sw-nic-2.11#show interfaces Et18/2 | grep "output discards"
     0 late collision, 0 deferred, 3397 output discards
sw-nic-2.11#show interfaces Et17/3 | grep "output discards"
     0 late collision, 0 deferred, 18089 output discards

One interface with details:
Code:
sw-nic-1.10#show int Et18/1
Ethernet18/1 is up, line protocol is up (connected)
  Hardware is Ethernet, address is 444c.a8e9.3694 (bia 444c.a8e9.3694)
  Description: nethcn-b3 mlx0 trunk po1181
  Member of Port-Channel1181
  Ethernet MTU 9214 bytes , BW 25000000 kbit
  Full-duplex, 25Gb/s, auto negotiation: off, uni-link: n/a
  Up 48 minutes, 15 seconds
  Loopback Mode : None
  0 link status changes since last clear
  Last clearing of "show interface" counters 0:31:57 ago
  5 seconds input rate 2.30 Mbps (0.0% with framing overhead), 190 packets/sec
  5 seconds output rate 2.07 Mbps (0.0% with framing overhead), 187 packets/sec
     936771 packets input, 4962160459 bytes
     Received 0 broadcasts, 1847 multicast
     0 runts, 0 giants
     0 input errors, 0 CRC, 0 alignment, 0 symbol, 0 input discards
     0 PAUSE input
     1115445 packets output, 6727066114 bytes
     Sent 0 broadcasts, 2882 multicast
     0 output errors, 0 collisions
     0 late collision, 0 deferred, 4297 output discards
     0 PAUSE output
 
Flow control was active on the NIC but not on the switch.

Enabling flowcontrol for both direction solved the problem:
flowcontrol receive on
flowcontrol send on

Code:
Port        Send FlowControl  Receive FlowControl  RxPause       TxPause     
            admin    oper     admin    oper                                   
----------  -------- -------- -------- --------    ------------- -------------
Et17/1      on       on       on       on          0             64500       
Et17/2      on       on       on       on          0             33746       
Et17/3      on       on       on       on          0             17126       
Et18/1      on       on       on       on          0             36948       
Et18/2      on       on       on       on          0             39628
 
thanks for sharing
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!