I have a fresh Proxmox installationon 5 servers (Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph.
The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000)
My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.
First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists.
I tried quite a lot and nothing help:
With iperf I can reach full 50 GBit/s on the bond with zero output drops.
Ceph statistics:
Output discards:
One interface with details:
The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000)
- backend network for Ceph: Mellanox ConnectX-4 Lx dual-port 25 GBit/s
- frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port
My problem:
I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.
First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists.
I tried quite a lot and nothing help:
- changed the MTU from 9000 to 1500
- changed bond_xmit_hash_policy from layer3+4 to layer2+3
- deactivated the bond and just used a single link
- disabled offloading
- disabled power management in BIOS
- perf-bias 0
- TCP Previous segment not captured
- TCP Out-of-Order
- TCP Retransmission
- TCP Fast Retransmission
- TCP Dup ACK
- TCP ACKed unseen segment
- TCP Window Update
With iperf I can reach full 50 GBit/s on the bond with zero output drops.
Ceph statistics:
Code:
cluster:
id: bc8d51e7-e62d-44f0-91ee-90f0e1a784e5
health: HEALTH_OK
services:
mon: 3 daemons, quorum nethcn-b1,nethcn-b3,nethcn-b5
mgr: nethcn-b1(active), standbys: nethcn-b5, nethcn-b3
osd: 40 osds: 40 up, 40 in
data:
pools: 1 pools, 2048 pgs
objects: 14973 objects, 54911 MB
usage: 199 GB used, 35566 GB / 35766 GB avail
pgs: 2048 active+clean
io:
client: 6477 B/s rd, 192 MB/s wr, 1 op/s rd, 206 op/s wr
Output discards:
Code:
sw-nic-1.10#show interfaces Et17/1 | grep "output discards"
0 late collision, 0 deferred, 6979 output discards
sw-nic-1.10#show interfaces Et17/2 | grep "output discards"
0 late collision, 0 deferred, 1189 output discards
sw-nic-1.10#show interfaces Et18/1 | grep "output discards"
0 late collision, 0 deferred, 4297 output discards
sw-nic-1.10#show interfaces Et18/2 | grep "output discards"
0 late collision, 0 deferred, 2936 output discards
sw-nic-1.10#show interfaces Et17/3 | grep "output discards"
0 late collision, 0 deferred, 17244 output discards
Code:
sw-nic-2.11#show interfaces Et17/1 | grep "output discards"
0 late collision, 0 deferred, 2174 output discards
sw-nic-2.11#show interfaces Et17/2 | grep "output discards"
0 late collision, 0 deferred, 1378 output discards
sw-nic-2.11#show interfaces Et18/1 | grep "output discards"
0 late collision, 0 deferred, 1739 output discards
sw-nic-2.11#show interfaces Et18/2 | grep "output discards"
0 late collision, 0 deferred, 3397 output discards
sw-nic-2.11#show interfaces Et17/3 | grep "output discards"
0 late collision, 0 deferred, 18089 output discards
One interface with details:
Code:
sw-nic-1.10#show int Et18/1
Ethernet18/1 is up, line protocol is up (connected)
Hardware is Ethernet, address is 444c.a8e9.3694 (bia 444c.a8e9.3694)
Description: nethcn-b3 mlx0 trunk po1181
Member of Port-Channel1181
Ethernet MTU 9214 bytes , BW 25000000 kbit
Full-duplex, 25Gb/s, auto negotiation: off, uni-link: n/a
Up 48 minutes, 15 seconds
Loopback Mode : None
0 link status changes since last clear
Last clearing of "show interface" counters 0:31:57 ago
5 seconds input rate 2.30 Mbps (0.0% with framing overhead), 190 packets/sec
5 seconds output rate 2.07 Mbps (0.0% with framing overhead), 187 packets/sec
936771 packets input, 4962160459 bytes
Received 0 broadcasts, 1847 multicast
0 runts, 0 giants
0 input errors, 0 CRC, 0 alignment, 0 symbol, 0 input discards
0 PAUSE input
1115445 packets output, 6727066114 bytes
Sent 0 broadcasts, 2882 multicast
0 output errors, 0 collisions
0 late collision, 0 deferred, 4297 output discards
0 PAUSE output