Ceph perfomance degraded after upgraded to PVE 7

A.M. · May 17, 2022

Hello,

We have 4 node Proxmox cluster with 3 nodes Ceph onboard.

For a long time we used proxmox 5 and recently updated according to the official instructions to the PVE 7.2-3 version through the 6th. There were no errors during the update. Everything went relatively smoothly. Ceph was upgraded to Pacific. All settings have remained in default values since version 5 was installed. Ceph nodes are connected by dedicated 10 gigabit interfaces, no manipulations with the drivers took place. After upgrade guests mashines began to show slow file system response from 2 to 10 times worse. Unfortunately, the rados bench was not measured before the update and there is nothing to compare the results with.

All nodes have the same kernel and Mellanox cards with mlx4_core drivers. PVE 7.2-3

Bash:

uname -a
Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) x86_64 GNU/Linux

Bash:

lspci | egrep -i --color 'network|ethernet'
b3:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

Bash:

find /sys | grep b3:00 | grep drivers

/sys/devices/pci0000:b2/0000:b2:00.0/0000:b3:00.0/sriov_drivers_autoprobe
/sys/bus/pci/drivers/mlx4_core/0000:b3:00.0

Ceph cluster have 16 ssd Blustore OSD

rados benchmark after upgrade

Bash:

2022-05-17T16:10:34.856379+0300 min lat: 0.0205689 max lat: 1.42925 avg lat: 0.19756
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
  100      16      8097      8081   323.197       256    0.135658     0.19756
  101      16      8153      8137   322.214       224   0.0613339    0.197956
  102      16      8211      8195    321.33       232   0.0221343    0.197982
  103      16      8289      8273   321.239       312    0.103886    0.199017
  104      16      8371      8355   321.303       328   0.0589578    0.198925
  105      16      8430      8414   320.491       236   0.0501508    0.199559
  106      16      8496      8480   319.958       264    0.172266    0.199542
  107      16      8572      8556   319.808       304   0.0373553    0.199637
  108      16      8662      8646    320.18       360   0.0648627    0.199379
  109      16      8730      8714   319.738       272    0.171747    0.199782
  110      16      8785      8769   318.831       220    0.221714    0.200526
  111      16      8841      8825   317.976       224    0.299212    0.200962
  112      16      8915      8899    317.78       296   0.0316448    0.201036
  113      16      8974      8958   317.056       236   0.0772135    0.201596
  114      16      9038      9022    316.52       256    0.768226    0.202169
  115      16      9113      9097   316.376       300    0.036784    0.202029
  116      16      9166      9150   315.476       212   0.0609076    0.202566
  117      16      9248      9232   315.583       328   0.0493887    0.202486
  118      16      9311      9295   315.044       252   0.0273586    0.202868
  119      16      9386      9370   314.917       300   0.0279775    0.202631
2022-05-17T16:10:54.858486+0300 min lat: 0.0205689 max lat: 1.42925 avg lat: 0.202805
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
  120      16      9454      9438   314.559       272   0.0650414    0.202805
Total time run:         120.183
Total writes made:      9454
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     314.653
Stddev Bandwidth:       75.5816
Max bandwidth (MB/sec): 536
Min bandwidth (MB/sec): 136
Average IOPS:           78
Stddev IOPS:            18.8954
Max IOPS:               134
Min IOPS:               34
Average Latency(s):     0.203341
Stddev Latency(s):      0.227894
Max latency(s):         1.42925
Min latency(s):         0.0205689

Here we can see that latency have been increased 5-10 times.
View attachment 37027

Maybe we're missing something?
What do you advise?

Search

Search

Ceph perfomance degraded after upgraded to PVE 7

A.M.

Member