CEPH Performance issues after upgrade from 15.2.8 to 15.2.10

alex97 · Apr 14, 2021

Hello,

A few information about the System:
Its a Hyperconverged Cluster of 5 Supermicro AS -1114S-WN10RT:

4 of the Servers have:
CPU: 128 x AMD EPYC 7702P 64-Core Processor (1 Socket)
RAM: 512 GB
1 of the Servers has:
64 x AMD EPYC 7502P 32-Core Processor (1 Socket)
RAM: 256 GB
Network:
All Servers have a Mellanox Technologies MT28800 Family [ConnectX-5 Ex] with 2x 100GbE for Ceph
I installed the OFED Driver in the newest Version MLNX_OFED_LINUX-5.1-2.5.8.0-debian10.3-x86_64 (on the Server I build a new repo for new Kernel (5.4.106-1-pve) and installed the new Packages form Local repo)
Storage:
Ceph storage on each Server 2 x Micron_9300_MTFDHAL3T2TDR with 3.2TB
Each drive has 4 OSDs. The Cluster has 40 OSDs with 1025 PGs

On Thursday 08.04.2021 I Upgraded from PVE Version 6.3-3 to 6.3-6. I also upgraded the Kernel form 5.4.78-2-pve to 5.4.106-1-pve. Now the CEPH Cluster is very slow. Normally the I/O Wait Time inside the VM is about 7 ms. Now it increased to about 750 ms up to 2000 ms

First I did a Network benchmark with iperf where I got with only one thread:
0.0- 6.7 sec 26.8 GBytes 34.5 Gbits/sec
Normally when doing Backups the Network reaches 2 Gbit/s on the 100GB interfaces
Hence the Network is OK, i guess, I assume that CEPH has a Problem..

Further we did some Benchmarks on the System with RADOS Bench. Both benchmarks were started within 10 Minutes:

1.

Code:

root@__:~#  rados bench -p CEPHStor 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_vs5_83744
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       341       325    1299.9      1300  0.00991154   0.0424592
    2      16       356       340    679.93        60    0.391409   0.0522247
    3      16       363       347   462.616        28   0.0258595   0.0721364
    4      16       387       371   370.958        96     1.57218    0.131835
    5      16       402       386   308.765        60   0.0113969     0.16646
    6      16       408       392   261.303        24   0.0111479    0.178742
    7      16       435       419   239.401       108    0.529584    0.226812
    8      16       452       436   217.974        68   0.0106614    0.266152
    9      16       461       445   197.754        36     0.96406    0.279443
   10      16       466       450   179.979        20   0.0126327    0.286548
   11      16       466       450   163.617         0           -    0.286548
Total time run:         11.7593
Total writes made:      466
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     158.513
Stddev Bandwidth:       378.314
Max bandwidth (MB/sec): 1300
Min bandwidth (MB/sec): 0
Average IOPS:           39
Stddev IOPS:            94.5785
Max IOPS:               325
Min IOPS:               0
Average Latency(s):     0.401447
Stddev Latency(s):      0.917697
Max latency(s):         5.38532
Min latency(s):         0.00813231
root@___:~#  rados bench -p CEPHStor 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       105        89    355.93       356  0.00395163   0.0381291
    2      16       134       118   235.962       116  0.00667274    0.143368
    3      16       160       144   191.972       104  0.00561947     0.25516
    4      16       203       187   186.974       172  0.00612447    0.224298
    5      16       366       350   279.962       652  0.00742376    0.225647
    6      16       386       370   246.633        80  0.00378491    0.217095
    7      16       397       381   217.686        44  0.00523762     0.23806
    8      16       399       383   191.475         8  0.00763485    0.242517
    9      16       516       500   222.194       468  0.00886989    0.251759
   10      16       516       500   199.974         0           -    0.251759
   11      16       516       500   181.794         0           -    0.251759
   12      15       516       501   166.979   1.33333     6.47211    0.264174
   13      13       516       503    154.75         8      3.7695    0.278351
   14      13       516       503   143.696         0           -    0.278351
   15      12       516       504   134.383         2      6.6732    0.291039
   16      11       516       505   126.234         4     9.95319    0.310172
Total time run:       16.0488
Total reads made:     516
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   128.607
Average IOPS:         32
Stddev IOPS:          49.1677
Max IOPS:             163
Min IOPS:             0
Average Latency(s):   0.470387
Max latency(s):       11.0187
Min latency(s):       0.00254663

2.

Code:

root@___:~#  rados bench -p CEPHStor 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_vs5_87907
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0      16        16         0         0         0           -           0
    1      16        30        14   55.9724        56    0.492942    0.266121
    2      16        36        20   39.9879        24    0.273857    0.340842
    3      16        41        25   33.3256        20     2.16337    0.661045
    4      16        55        39   38.9924        56    0.492101    0.779455
    5      16        61        45   35.9937        24   0.0132808    0.991254
    6      16        67        51   33.9946        24     2.99239     1.12742
    7      16        69        53   30.2811         8     6.43563     1.22443
    8      16        72        56    27.996        12   0.0119884     1.25493
    9      16        72        56   24.8852         0           -     1.25493
   10      16        72        56   22.3968         0           -     1.25493
   11      14        72        58   21.0879   2.66667     7.47855     1.42587
   12      10        72        62   20.6637        16      5.5864      1.6983
   13       4        72        68   20.9201        24     6.16283     2.28117
   14       4        72        68   19.4259         0           -     2.28117
   15       4        72        68   18.1309         0           -     2.28117
   16       1        72        71   17.7477         4     7.74045     2.66913
Total time run:         16.5578
Total writes made:      72
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     17.3936
Stddev Bandwidth:       18.045
Max bandwidth (MB/sec): 56
Min bandwidth (MB/sec): 0
Average IOPS:           4
Stddev IOPS:            4.54927
Max IOPS:               14
Min IOPS:               0
Average Latency(s):     2.80683
Stddev Latency(s):      3.53249
Max latency(s):         13.6394
Min latency(s):         0.00975463

In older benchmarks with an empty CEPH Cluster I reached about 6GB/s. Now its instable in Bandwidth and slow.

In addition I noticed that one Server has a High load with up to 300% CPU load on its osd Processes. That Server has also a higher apply and Commit Latency. The Server with high load changed once the previous Server crashed with a max. Latency of 88ms on one OSD.

Maybe this Output helps:

Code:

:~# pveversion -V
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
ceph: 15.2.10-pve1
ceph-fuse: 15.2.10-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.13-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-5
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-10
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

I hope Somebody can help me figure it out.

Is there a possibility to downgrade back to CEPH Version 15.2.8?

aaron · Apr 16, 2021

Things that you can try:

Reboot and choose the previous kernel. If that helps, you could also try the latest 5.11 kernel (needs to be installed manually) apt install pve-kernel-5.11
Check if there are firmware updates for the mellanox NICs available (they usually help a lot)

Downgrading Ceph will most likely not help and should not be done as there are always a lot of bug fixes in newer versions which you most likely want to avoid running into them.

alex97 · Apr 29, 2021

Hi,

thank you for your response.

Now I tested many different scenarios:

Kernel	Driver	Result
5.4.106-1-pve	MLNX_OFED_LINUX-5.3-1.0.0.1-debian10.5-x86_64	slow performance
5.4.106-1-pve	without Driver	slower performance than used to be. about 10% slower
kernel-5.11	without Driver	slower performance than used to be. about 10% slower
5.4.78-2-pve	without Driver	First did not work at all. Kernel had to be reinstalled because of missing bonding drivers
5.4.78-2-pve	MLNX_OFED_LINUX-5.1-2.5.8.0-debian10.3-x86_64	Good Performance as before

First we rescued the VMs into an older NFS Storage and started our tests.

We tested with Iperf3 and rados bench. We noticed that the system was fine after changing something. Normally we rebooted the whole cluster because of driver changes. But after some time radosbench went from 3500 MB/sec to 100-200MB/sec

During our tests we noticed that the RX dropps and TX dropps are increasing on the Switch interfaces. We are running two Mellanox SN2100 Switches in MLAG on all 5 Servers.

I think the solution to our problem is not to just load the old kernel and the old Driver. Otherwise we would not be able to be up to date with our Proxmox.

Hence we started some tests:

We installed the 5.4.106-1-pve kernel and tried to install the old driver MLNX_OFED_LINUX-5.1-2.5.8.0-debian10.3-x86_64 but the build Process failed. We tried the newest Version MLNX_OFED_LINUX-5.3-1.0.0.1-debian10.5-x86_64 which worked. But the Performance wasn't as good as before and also went down over time.

At the end we asked the Mellanox support for help. Sadly they told us that they dont support Debian 10.6+. I later asked If the OFED Driver is even the right Driver for our ConnectX-5 EN mcx516a-cdat Card because the documetation is not that specific about that either. They Responded with:

MLNX_OFED driver is for Infiniband/Ethernet.

MLNX_EN driver is for Ethernert only

MLNX_OFED LTS is for ConnectX-3 NIC (from case description I see that you are using ConnectX-5 EN)

All the above don't support Debian 10.6, they were not tested and certified with this OS.
The latest Debian version supported is v10.5.

So apperently Mellanox does not support new OS versions. At least Debian.

Right now we try to get the 5.4.106-1-pve kernel running without any extra Drivers. Because even if the Drivers work its still an effort to upgrade the machines quite regulary every 3 months at least.

We changed some sysctl values which are not persistent for now.

ethtool -G enp197s0f0 rx 8192 tx 8192
ethtool -G enp197s0f1 rx 8192 tx 8192

sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=4194304
sysctl -w net.core.rmem_default=4194304
sysctl -w net.core.wmem_default=4194304
sysctl -w net.core.optmem_max=4194304
sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1

We also changed the /etc/defaut/grup values

GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt pcie_aspm=off"

We also notices that the RX Dropps are not increasing anymore and the tx dropps are increasing not that fast as in our previous tests.

The Performance of the CEPH Cluster still is about 10% less then before.
Do you have further hints to improve the Cluster Performance?

netm-de · Aug 11, 2021

Same problem here:

3-Node Setup (1114S-WN10RT) with
EPYC 7702p CPU
512GB RAM
2x 3.84TB WD SN630 U.2 SSD
Mellanox CX-6 100GbE Meshed Setup

Only getting 300-400 Mb/s write speeds at 4M BS in rados-bench after the latest updates. Raw speed of the drives are 1200 MB/s each. iperf shows 90 Gbit/s throughput.

Any suggestions?
@alex97 Did you solved the problem?

alex97 · Aug 26, 2021

Hi,

i solved the problem by deinstalling all mlnx Drivers and useing the sysctl values:

Code:

#!/bin/sh

ethtool -G enp197s0f0 rx 8192 tx 8192
ethtool -G enp197s0f1 rx 8192 tx 8192

sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.core.rmem_max=536870912
sysctl -w net.core.wmem_max=536870912
sysctl -w net.core.rmem_default=536870912
sysctl -w net.core.wmem_default=536870912
sysctl -w net.core.optmem_max=4194304

sysctl -w net.ipv4.udp_rmem_min=8192
sysctl -w net.ipv4.udp_wmem_min=8192


sysctl -w net.ipv4.tcp_congestion_control=htcp
sysctl -w net.ipv4.tcp_mtu_probing=1

sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.core.netdev_budget=900
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1

sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.ipv4.tcp_no_metrics_save=1


# forced setting parameters above - not into sysctl.conf
sysctl -w net.ipv4.route.flush=1
sysctl -w net.ipv6.route.flush=1

I hope I can help..

netm-de · Oct 4, 2021

alex97 said:

Hi,

i solved the problem by deinstalling all mlnx Drivers and useing the sysctl values:

Code:

#!/bin/sh

ethtool -G enp197s0f0 rx 8192 tx 8192
ethtool -G enp197s0f1 rx 8192 tx 8192

sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.core.rmem_max=536870912
sysctl -w net.core.wmem_max=536870912
sysctl -w net.core.rmem_default=536870912
sysctl -w net.core.wmem_default=536870912
sysctl -w net.core.optmem_max=4194304

sysctl -w net.ipv4.udp_rmem_min=8192
sysctl -w net.ipv4.udp_wmem_min=8192


sysctl -w net.ipv4.tcp_congestion_control=htcp
sysctl -w net.ipv4.tcp_mtu_probing=1

sysctl -w net.ipv4.tcp_low_latency=1
sysctl -w net.core.netdev_budget=900
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.ipv4.tcp_sack=1

sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.ipv4.tcp_no_metrics_save=1


# forced setting parameters above - not into sysctl.conf
sysctl -w net.ipv4.route.flush=1
sysctl -w net.ipv6.route.flush=1

I hope I can help..

Solved it, thx! Could you provide some informations how to fully uninstall the mlnx drivers? Using the script isn't an option due to the side-effects on proxmox. Using it will result in uninstalling some pve-critical packages.

alex97 · Oct 8, 2021

I did not use the install script in the first place. I created a custom file repo where I installed the Packages from.

In my case I just removed the packages. Checked with dpkg -l if there are still mlnx or mlx packages and removed them too.
In case apt removed "pve" Packages I installed those Packages afterwards.

In my memory apt remove should not stop e.g. pve-manager.

I hope I could help you..

ITT · Oct 8, 2021

Identical Setup:
3 x Supermicro AS -1114S-WN10RT Cluster
3 x AMD Epyc 7443P / 512GB
3 x Mellanox ConnectX-5 Ex (active-backup bond)

Proxmox v7, no modification at all (stock drivers), MTU 9000
(some VMs already active while benchmarking)

Code:

root@c01-n01:~# rados bench -p VM_NVMe 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for                                                                    up to 10 seconds or 0 objects
Object prefix: benchmark_data_c01-n01_462799
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16      1412      1396   5583.65      5584  0.00794112   0.0114082
    2      16      2833      2817   5633.54      5684  0.00771726   0.0113263
    3      16      4243      4227   5635.51      5640    0.008909   0.0113256
    4      16      5654      5638   5637.51      5644  0.00605411   0.0113245
    5      16      7057      7041   5632.31      5612  0.00808078   0.0113496
    6      16      8482      8466    5643.5      5700   0.0100589   0.0113272
    7      16      9916      9900   5656.64      5736   0.0125638   0.0113041
    8      16     11333     11317   5657.98      5668    0.021977   0.0113014
    9      16     12711     12695   5641.71      5512  0.00887514   0.0113328
   10      16     14123     14107   5642.28      5648  0.00574926   0.0113337
Total time run:         10.0084
Total writes made:      14123
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     5644.47
Stddev Bandwidth:       63.1063
Max bandwidth (MB/sec): 5736
Min bandwidth (MB/sec): 5512
Average IOPS:           1411
Stddev IOPS:            15.7766
Max IOPS:               1434
Min IOPS:               1378
Average Latency(s):     0.0113337
Stddev Latency(s):      0.00361725
Max latency(s):         0.036372
Min latency(s):         0.00467315

Code:

root@c01-n01:~# rados bench -p VM_NVMe 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       969       953   3811.39      3812   0.0146323    0.016316
    2      16      1891      1875   3749.53      3688   0.0230791   0.0166874
    3      15      2855      2840   3785.88      3860   0.0383327   0.0165741
    4      16      3860      3844   3843.31      4016  0.00976979   0.0163371
    5      16      4825      4809   3846.59      3860   0.0155733   0.0163121
    6      16      5780      5764    3842.1      3820   0.0419932   0.0163528
    7      16      6768      6752   3857.75      3952   0.0105615   0.0162858
    8      16      7739      7723   3860.99      3884   0.0276226   0.0162667
    9      16      8710      8694   3863.51      3884   0.0465301   0.0162684
   10      15      9687      9672   3868.32      3912   0.0315049   0.0162567
Total time run:       10.0155
Total reads made:     9687
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   3868.79
Average IOPS:         967
Stddev IOPS:          21.9484
Max IOPS:             1004
Min IOPS:             922
Average Latency(s):   0.0162569
Max latency(s):       0.0734149
Min latency(s):       0.00234236

alex97 · Oct 8, 2021

Code:

root@vs5:~# rados bench -p CEPHStor 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_vs5_1750819
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0 2 2 0 0 0 - 0
    1      16       905       889   3555.04      3556   0.0193181   0.0178472
    2      16      1905      1889   3777.27      4000   0.0133057   0.0168575
    3      16      2854      2838   3783.34      3796   0.0108616   0.0168635
4 16 3721 3705 3704.39 3468 0.013675 0.0172267
5 16 4561 4545 3635.43 3360 0.030318 0.0175574
    6      16      5333      5317   3544.14      3088   0.0195299   0.0180147
    7      16      6038      6022   3440.64      2820   0.0286309   0.0185614
    8      16      6699      6683   3341.03      2644   0.0108031   0.0191098
    9      16      7438      7422   3298.21      2956   0.0150584   0.0193668
   10      15      8167      8152   3260.34      2920   0.0151909   0.0195955
Total time run:         10.0142
Total writes made:      8167
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec):     3262.15
Stddev Bandwidth:       445.524
Max bandwidth (MB/sec): 4000
Min bandwidth (MB/sec): 2644
Average IOPS:           815
Stddev IOPS:            111.381
Max IOPS:               1000
Min IOPS:               661
Average Latency(s): 0.0196018
Stddev Latency(s): 0.00887305
Max latency(s): 0.223726
Min latency(s):   0.00938303

Code:

root@vs5:~# rados bench -p CEPHStor 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0 0 0 0 0 0 - 0
    1      16       769       753   3011.11      3012   0.0180821   0.0205162
    2      15      1548      1533   3065.32      3120   0.0170968   0.0203579
    3      16      2350      2334   3111.39      3204   0.0277297   0.0200513
    4      16      3109      3093   3092.26      3036  0.00673578    0.020189
    5      16      3840      3824   3058.52      2924   0.0498553   0.0204441
    6      16      4611      4595    3062.7      3084   0.0756783   0.0204134
    7      16      5364      5348    3055.4      3012  0.00871806    0.020499
    8      15      6136      6121   3059.81      3092  0.00625293   0.0204556
9 15 6911 6896 3064.14 3100  0.010547 0.0204364
   10      14      7677      7663   3064.49      3068  0.00772708   0.0204407
Total time run:       10.0178
Total reads made:     7677
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec):   3065.36
Average IOPS:         766
Stddev IOPS:          18.8211
Max IOPS:             801
Min IOPS:             731
Average Latency(s): 0.0204457
Max latency(s): 0.161328
Min latency(s):      0.00329108

ITT · Oct 8, 2021

Looks good!

Search

Search

CEPH Performance issues after upgrade from 15.2.8 to 15.2.10

alex97

New Member

aaron

Proxmox Staff Member

alex97

New Member

netm-de

Member

alex97

New Member

netm-de

Member

alex97

New Member

ITT

Well-Known Member

alex97

New Member

ITT

Well-Known Member