Ceph slower after upgrading all nodes from 10 to 40 gig ethernet

Nathan Stratton

Well-Known Member
Dec 28, 2018
43
3
48
48
I have been upgrading my customer node by node from 10 gig to 40 gig, now that all nodes are 40 gig I am seeing very very slow ceph. Setup is Dual E5-2690v2 with dual 40 gig bond into a cisco 3132q. OSDs are Samsung 960 EVO 256G NVMe with 2 in each server.

Total time run: 92.179118
Total writes made: 130
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 5.64119
Stddev Bandwidth: 32.0926
Max bandwidth (MB/sec): 304
Min bandwidth (MB/sec): 0
Average IOPS: 1
Stddev IOPS: 8
Max IOPS: 76
Min IOPS: 0
Average Latency(s): 11.3449
Stddev Latency(s): 29.3919
Max latency(s): 92.002
Min latency(s): 0.0157386

Reeds look fine:

root@virt0:/home# rados bench -p fast 60 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
Total time run: 0.152010
Total reads made: 130
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 3420.81
Average IOPS: 855
Stddev IOPS: 0
Max IOPS: 0
Min IOPS: 2147483647
Average Latency(s): 0.0175385
Max latency(s): 0.0434704
Min latency(s): 0.00729327

network I/O looks fine:

root@virt0:/home# iperf -c virt4 -P 4
------------------------------------------------------------
Client connecting to virt4, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 6] local 10.88.64.120 port 38350 connected with 10.88.64.124 port 5001
[ 3] local 10.88.64.120 port 38344 connected with 10.88.64.124 port 5001
[ 5] local 10.88.64.120 port 38348 connected with 10.88.64.124 port 5001
[ 4] local 10.88.64.120 port 38346 connected with 10.88.64.124 port 5001
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 12.0 GBytes 10.3 Gbits/sec
[ 3] 0.0-10.0 sec 12.0 GBytes 10.3 Gbits/sec
[ 5] 0.0-10.0 sec 12.0 GBytes 10.3 Gbits/sec
[ 4] 0.0-10.0 sec 10.2 GBytes 8.73 Gbits/sec
[SUM] 0.0-10.0 sec 46.1 GBytes 39.6 Gbits/sec

latency looks ok:

root@virt4:~# ceph osd perf | sort -n
0 3 3
osd commit_latency(ms) apply_latency(ms)
1 2 2
2 4 4
3 2 2
4 3 3
5 2 2
6 2 2
7 2 2
8 3 3
9 6 6
10 1 1
11 6 6
12 3 3
13 4 4
14 2 2
15 2 2
16 3 3
17 3 3
18 3 3
19 4 4
20 2 2
21 2 2
22 3 3
23 3 3
 
Technically 40GbE are equal to 10GbE, the only different is, on 40GbE you have 4x 10GbE. 40GbE is generally older than SFP28 with 25GbE, that's one channel of a 100GbE port.

What Hashing algorithm you use? Did you use Jumbo Frames? Is all correct configured? Did you check the Bond status? How do you know that the storage is slower than before?
 
allow-vmbr0 bond0
iface bond0 inet manual
ovs_bonds enp129s0 enp129s0d1
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options lacp=active bond_mode=balance-tcp
mtu 9000

jumbo frames tested with ping -s 9000 between all 12 nodes

root@virt11:~# ovs-appctl bond/show bond0
---- bond0 ----
bond_mode: balance-tcp
bond may use recirculation: yes, Recirc-ID : 1
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
next rebalance: 8016 ms
lacp_status: negotiated
active slave mac: 24:be:05:ce:74:22(enp129s0d1)

Before fio to cephfs was:

WRITE: bw=514MiB/s (539MB/s), 514MiB/s-514MiB/s (539MB/s-539MB/s), io=4096MiB (4295MB), run=7974-7974msec

now:

root@virt11:/home# fio --group_reporting --size=256M --bs=512k --rw=randwrite --numjobs=16 --ioengine=sync --direct=1 --name=fio.write.out
fio.write.out: (g=0): rw=randwrite, bs=512K-512K/512K-512K/512K-512K, ioengine=sync, iodepth=1
...
fio-2.16
Starting 16 processes
Jobs: 1 (f=1): [_(13),w(1),_(2)] [99.2% done] [0KB/5120KB/0KB /s] [0/10/0 iops] [eta 00m:02s]
fio.write.out: (groupid=0, jobs=16): err= 0: pid=119540: Wed Jun 12 11:04:45 2019
write: io=4096.0MB, bw=17886KB/s, iops=34, runt=234501msec
clat (msec): min=3, max=30300, avg=316.45, stdev=1615.35
lat (msec): min=3, max=30300, avg=316.46, stdev=1615.35
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 7], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 13],
| 70.00th=[ 14], 80.00th=[ 18], 90.00th=[ 227], 95.00th=[ 1598],
| 99.00th=[ 6783], 99.50th=[10814], 99.90th=[16712], 99.95th=[16712],
| 99.99th=[16712]
lat (msec) : 4=0.40%, 10=37.32%, 20=46.29%, 50=5.03%, 100=0.29%
lat (msec) : 250=1.01%, 500=1.65%, 750=1.00%, 1000=0.84%, 2000=1.81%
lat (msec) : >=2000=4.36%
cpu : usr=0.01%, sys=0.01%, ctx=8877, majf=0, minf=121
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=8192/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=17886KB/s, minb=17886KB/s, maxb=17886KB/s, mint=234501msec, maxt=234501msec

My cephfs fio went from 539MB/s to 17MB/s. Something is wrong, just not sure what....
 
switch# show port-channel load-balance

Port Channel Load-Balancing Configuration:
System: source-dest-ip

Port Channel Load-Balancing Addresses Used Per-Protocol:
Non-IP: source-dest-mac
IP: source-dest-ip

I think the biggest issue is the 0 cur/MB/s

Code:
root@virt0:~# rados -p fast bench 120 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_virt0_1444211
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        70        54    215.99       216   0.0495505    0.134192
    2      16        75        59   117.989        20   0.0571337    0.128761
    3      16        82        66    87.992        28   0.0670793    0.225313
    4      16        95        79   78.9927        52     3.71087    0.245322
    5      16       105        89   71.1931        40   0.0549536    0.371978
    6      16       119       103   68.6599        56    0.044153    0.368798
    7      16       128       112   63.9935        36   0.0438053    0.481255
    8      16       138       122   60.9937        40     2.38462    0.510071
    9      16       141       125   55.5497        12   0.0561537    0.498959
   10      16       149       133   53.1945        32   0.0504102    0.631418
   11      16       150       134   48.7223         4     10.8043    0.707335
   12      16       166       150    49.995        64   0.0506009    0.709349
   13      16       173       157   48.3028        28   0.0516386    0.724443
   14      16       175       159   45.4239         8   0.0606331    0.716031
   15      16       175       159   42.3956         0           -    0.716031
   16      16       178       162   40.4959         6     3.12921    0.814432
   17      16       186       170   39.9959        32   0.0497566    0.894932
   18      16       204       188   41.7735        72   0.0684548    0.983689
   19      16       205       189   39.7854         4   0.0771596    0.978892
2019-06-12 11:37:16.133234 min lat: 0.0293968 max lat: 16.5006 avg lat: 0.943765
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16       219       203   40.5959        56    0.057566    0.943765
   21      16       236       220   41.9005        68   0.0844543      1.0109
   22      16       245       229   41.6322        36   0.0565886    0.973191
   23      16       264       248   43.1261        76    0.147591      1.0454
   24      16       277       261   43.4957        52    0.066835     1.01317
   25      16       277       261   41.7559         0           -     1.01317
   26      16       282       266    40.919        10    0.049275     1.07292
   27      16       284       268   39.6998         8   0.0501575     1.06538
   28      16       291       275   39.2818        28   0.0595312     1.12438
   29      16       300       284   39.1685        36   0.0489746     1.17106
   30      16       308       292   38.9295        32   0.0482276     1.14452
   31      16       313       297   38.3188        20   0.0589515     1.15015
   32      16       313       297   37.1214         0           -     1.15015
   33      16       337       321   38.9053        48   0.0984226     1.15846
   34      16       346       330   38.8198        36   0.0982891     1.16036
   35      16       352       336   38.3963        24       24.67     1.22706
   36      16       357       341   37.8852        20     17.7778     1.26764
   37      16       360       344   37.1856        12   0.0555984     1.26496
   38      16       362       346   36.4175         8   0.0539548      1.2724
   39      16       380       364   37.3297        72    0.047894     1.28284
2019-06-12 11:37:36.135081 min lat: 0.0293968 max lat: 31.3002 avg lat: 1.34859
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16       402       386   38.5963        88   0.0973106     1.34859
   41      16       410       394   38.4353        32   0.0609629       1.324
   42      16       416       400   38.0916        24    0.124389     1.32216
   43      16       416       400   37.2057         0           -     1.32216
   44      16       424       408   37.0873        16   0.0851649      1.4371
   45      16       426       410   36.4409         8   0.0439848     1.48494
   46      16       432       416   36.1704        24     2.41563     1.47008
   47      16       462       446   37.9538       120    0.149083     1.43306
   48      16       468       452    37.663        24      24.397     1.46919
   49      16       469       453    36.976         4     1.67013     1.46964
   50      16       475       459   36.7164        24   0.0802404      1.4974
   51      16       477       461   36.1533         8    0.135325     1.49155
   52      16       481       465   35.7657        16   0.0677755     1.53104
   53      16       481       465   35.0909         0           -     1.53104
   54      16       483       467   34.5892         4   0.0761406     1.55504
   55      16       483       467   33.9603         0           -     1.55504
   56      16       490       474   33.8539        14   0.0464463     1.54446
   57      16       501       485   34.0318        44   0.0715842     1.55072
   58      16       501       485    33.445         0           -     1.55072
   59      16       513       497   33.6917        24   0.0753383     1.61904
2019-06-12 11:37:56.137072 min lat: 0.0293968 max lat: 31.3002 avg lat: 1.5826
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16       525       509     33.93        48   0.0579022      1.5826
   61      16       525       509   33.3738         0           -      1.5826
   62      16       526       510      32.9         2     21.4313     1.62152
   63      16       535       519   32.9492        36   0.0485865      1.6487
   64      16       540       524   32.7468        20   0.0720897     1.67152
   65      16       542       526   32.3661         8   0.0615156     1.66979
   66      16       570       554   33.5725       112   0.0589251     1.64112
   67      16       583       567   33.8475        52   0.0783381     1.60521
   68      16       592       576   33.8791        36   0.0813452     1.58869
   69      16       592       576   33.3881         0           -     1.58869
   70      16       593       577   32.9682         2     5.42268     1.59533
   71      16       595       579   32.6166         8      5.5527      1.6412
   72      16       596       580   32.2191         4    0.100335     1.63854
   73      16       597       581   31.8325         4     6.77777     1.64739
   74      16       602       586   31.6726        20   0.0444466     1.64551
   75      16       602       586   31.2503         0           -     1.64551
   76      16       611       595   31.3128        18   0.0526823     1.66883
   77      16       617       601   31.2178        24     5.83712      1.7149
   78      16       618       602   30.8688         4     8.48216     1.72615
   79      16       618       602   30.4781         0           -     1.72615
2019-06-12 11:38:16.138902 min lat: 0.0293968 max lat: 31.3002 avg lat: 1.75818
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   80      16       624       608   30.3971        12   0.0759222     1.75818
   81      16       642       626   30.9106        72   0.0567973     1.73147
   82      16       650       634   30.9238        32   0.0468982     1.71853
   83      16       658       642   30.9368        32     9.11917     1.71894
   84      16       665       649   30.9018        28   0.0713693     1.71008
   85      16       683       667   31.3852        72   0.0455275     1.76101
   86      16       694       678   31.5318        44     0.06314     1.78358
   87      16       694       678   31.1694         0           -     1.78358
   88      16       701       685   31.1334        14   0.0466101     1.77473
   89      16       705       689   30.9633        16   0.0743958     1.76987
   90      16       713       697   30.9748        32   0.0685679     1.75618
   91      16       717       701   30.8102        16     2.82181      1.7504
   92      16       717       701   30.4753         0           -      1.7504
   93      16       722       706   30.3626        10   0.0514611     1.78926
   94      16       722       706   30.0396         0           -     1.78926
   95      16       728       712    29.976        12   0.0440615     1.78679
   96      16       729       713   29.7055         4   0.0625844     1.78437
   97      16       749       733   30.2239        80   0.0702349      1.7507
   98      16       752       736   30.0379        12   0.0683926     1.79916
   99      16       752       736   29.7345         0           -     1.79916
2019-06-12 11:38:36.140983 min lat: 0.0293968 max lat: 40.884 avg lat: 1.79916
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
  100      16       752       736   29.4371         0           -     1.79916
  101      16       752       736   29.1457         0           -     1.79916
  102      16       764       748   29.3305        12   0.0443146     1.78948
  103      16       769       753   29.2399        20   0.0691472      1.7838
  104      16       785       769    29.574        64   0.0814865     1.81594
  105      16       791       775   29.5209        24   0.0494589     1.82355
  106      16       794       778   29.3556        12     8.69465     1.82787
  107      16       794       778   29.0813         0           -     1.82787
  108      16       797       781   28.9231         6   0.0462485     1.82458
  109      16       804       788   28.9146        28   0.0852259     1.81583
  110      16       805       789   28.6881         4     7.80635     1.82343
  111      16       806       790   28.4657         4     51.3146     1.88607
  112      16       812       796   28.4258        24   0.0435589     1.87241
  113      16       812       796   28.1742         0           -     1.87241
  114      16       814       798   27.9973         4     3.51155     1.87859
  115      16       825       809   28.1364        44    0.107093     1.91295
  116      16       832       816   28.1352        28   0.0606896     1.89709
  117      16       832       816   27.8947         0           -     1.89709
  118      16       832       816   27.6583         0           -     1.89709
  119      16       834       818   27.4931   2.66667   0.0584874     1.89892

Total time run:         174.206459
Total writes made:      837
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     19.2186
Stddev Bandwidth:       27.7114
Max bandwidth (MB/sec): 216
Min bandwidth (MB/sec): 0
Average IOPS:           4
Stddev IOPS:            6
Max IOPS:               54
Min IOPS:               0
Average Latency(s):     3.18114
Stddev Latency(s):      10.8591
Max latency(s):         97.9521
Min latency(s):         0.0293968
 
You rock my world!

I pulled 1 of each of the two links and everything is fast again! So, what should I change on the cisco or proxmox side to get both links?
 
So, this may not be what you want to hear, but IMO LACP is great for fault tolerance and nothing else WRT Ceph. I could just say make the switch policy 'src-dest-port' to match 'balance-tcp', but I don't think it will EVER be as fast as a single 40G NIC in any combination.

Again, IMO it would be better to just make the node bond active/passive if you need the fault tolerance, but remove the bond and run one NIC if performance is your goal.

Does this make sense?
 
  • Like
Reactions: Nathan Stratton

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!