Hello,
We have 4 node Proxmox cluster with 3 nodes Ceph onboard.
For a long time we used proxmox 5 and recently updated according to the official instructions to the PVE 7.2-3 version through the 6th. There were no errors during the update. Everything went relatively smoothly. Ceph was upgraded to Pacific. All settings have remained in default values since version 5 was installed. Ceph nodes are connected by dedicated 10 gigabit interfaces, no manipulations with the drivers took place. After upgrade guests mashines began to show slow file system response from 2 to 10 times worse. Unfortunately, the rados bench was not measured before the update and there is nothing to compare the results with.
All nodes have the same kernel and Mellanox cards with mlx4_core drivers. PVE 7.2-3
Ceph cluster have 16 ssd Blustore OSD
rados benchmark after upgrade
Here we can see that latency have been increased 5-10 times.
View attachment 37027
Maybe we're missing something?
What do you advise?
We have 4 node Proxmox cluster with 3 nodes Ceph onboard.
For a long time we used proxmox 5 and recently updated according to the official instructions to the PVE 7.2-3 version through the 6th. There were no errors during the update. Everything went relatively smoothly. Ceph was upgraded to Pacific. All settings have remained in default values since version 5 was installed. Ceph nodes are connected by dedicated 10 gigabit interfaces, no manipulations with the drivers took place. After upgrade guests mashines began to show slow file system response from 2 to 10 times worse. Unfortunately, the rados bench was not measured before the update and there is nothing to compare the results with.
All nodes have the same kernel and Mellanox cards with mlx4_core drivers. PVE 7.2-3
Bash:
uname -a
Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-2 (Thu, 05 May 2022 13:54:35 +0200) x86_64 GNU/Linux
Bash:
lspci | egrep -i --color 'network|ethernet'
b3:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
Bash:
find /sys | grep b3:00 | grep drivers
/sys/devices/pci0000:b2/0000:b2:00.0/0000:b3:00.0/sriov_drivers_autoprobe
/sys/bus/pci/drivers/mlx4_core/0000:b3:00.0
Ceph cluster have 16 ssd Blustore OSD
rados benchmark after upgrade
Bash:
2022-05-17T16:10:34.856379+0300 min lat: 0.0205689 max lat: 1.42925 avg lat: 0.19756
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
100 16 8097 8081 323.197 256 0.135658 0.19756
101 16 8153 8137 322.214 224 0.0613339 0.197956
102 16 8211 8195 321.33 232 0.0221343 0.197982
103 16 8289 8273 321.239 312 0.103886 0.199017
104 16 8371 8355 321.303 328 0.0589578 0.198925
105 16 8430 8414 320.491 236 0.0501508 0.199559
106 16 8496 8480 319.958 264 0.172266 0.199542
107 16 8572 8556 319.808 304 0.0373553 0.199637
108 16 8662 8646 320.18 360 0.0648627 0.199379
109 16 8730 8714 319.738 272 0.171747 0.199782
110 16 8785 8769 318.831 220 0.221714 0.200526
111 16 8841 8825 317.976 224 0.299212 0.200962
112 16 8915 8899 317.78 296 0.0316448 0.201036
113 16 8974 8958 317.056 236 0.0772135 0.201596
114 16 9038 9022 316.52 256 0.768226 0.202169
115 16 9113 9097 316.376 300 0.036784 0.202029
116 16 9166 9150 315.476 212 0.0609076 0.202566
117 16 9248 9232 315.583 328 0.0493887 0.202486
118 16 9311 9295 315.044 252 0.0273586 0.202868
119 16 9386 9370 314.917 300 0.0279775 0.202631
2022-05-17T16:10:54.858486+0300 min lat: 0.0205689 max lat: 1.42925 avg lat: 0.202805
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
120 16 9454 9438 314.559 272 0.0650414 0.202805
Total time run: 120.183
Total writes made: 9454
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 314.653
Stddev Bandwidth: 75.5816
Max bandwidth (MB/sec): 536
Min bandwidth (MB/sec): 136
Average IOPS: 78
Stddev IOPS: 18.8954
Max IOPS: 134
Min IOPS: 34
Average Latency(s): 0.203341
Stddev Latency(s): 0.227894
Max latency(s): 1.42925
Min latency(s): 0.0205689
Here we can see that latency have been increased 5-10 times.
View attachment 37027
Maybe we're missing something?
What do you advise?
Last edited: