Proxmox = 6.4-8
CEPH = 15.2.13
Nodes = 3
Network = 2x100G / node
Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB
nvme Samsung PM-1733 MZWLJ1T9HBJR 2TB
CPU = EPYC 7252
CEPH pools = 2 separate pools for each disk type and each disk spliced in 2 OSD's
Replica = 3
VM don't do many writes and i migrated main testing VM's to 2TB pool which in turns fragments faster.
Did a lot of tests and recreated pools and OSD's in many ways but in matter of days every time each OSD's gets severely fragmented and loses up to 80% of write performance (tested with many FIO tests , rados benches , osd benches , RBD benches).
If i delete the osd's from node and let it sync from 2 nodes it will be perfect for few days 0.1 - 0.2 bluestore fragmentation but then it is in 0.8+ state soon. We are using only block devices for VM's with ext4 FS and no SWAP on them.
IOPS in osd bench after some time go to as low as 108 with 455MB/s
I noticed post on internet asking how to prevent or fix fragmentation but no replies to them and RedHat CEPH documentation says "to call Redhat to assist with fragmentation."
Anyone knows what causes fragmentation and how to solve it without deleting OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes for all 3 nodes with 5TB of data but this is a testing cluster so for production it is not acceptable).
I tried changing these values :
ceph config set osd osd_memory_target 17179869184
ceph config set osd osd_memory_expected_fragmentation 0.800000
ceph config set osd osd_memory_base 2147483648
ceph config set osd osd_memory_cache_min 805306368
ceph config set osd bluestore_cache_size 17179869184
ceph config set osd bluestore_cache_size_ssd 17179869184
Cluster is really not in use.
CEPH = 15.2.13
Nodes = 3
Network = 2x100G / node
Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB
nvme Samsung PM-1733 MZWLJ1T9HBJR 2TB
CPU = EPYC 7252
CEPH pools = 2 separate pools for each disk type and each disk spliced in 2 OSD's
Replica = 3
VM don't do many writes and i migrated main testing VM's to 2TB pool which in turns fragments faster.
Code:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
3 nvme 1.74660 1.00000 1.7 TiB 432 GiB 431 GiB 4.3 MiB 1.3 GiB 1.3 TiB 24.18 0.90 186 up
10 nvme 1.74660 1.00000 1.7 TiB 382 GiB 381 GiB 599 KiB 1.4 GiB 1.4 TiB 21.38 0.79 151 up
7 ssd2n 0.87329 1.00000 894 GiB 279 GiB 278 GiB 2.0 MiB 1.2 GiB 615 GiB 31.19 1.16 113 up
8 ssd2n 0.87329 1.00000 894 GiB 351 GiB 349 GiB 5.8 MiB 1.2 GiB 544 GiB 39.22 1.46 143 up
4 nvme 1.74660 1.00000 1.7 TiB 427 GiB 425 GiB 9.6 MiB 1.4 GiB 1.3 TiB 23.85 0.89 180 up
11 nvme 1.74660 1.00000 1.7 TiB 388 GiB 387 GiB 3.5 MiB 1.5 GiB 1.4 TiB 21.72 0.81 157 up
2 ssd2n 0.87329 1.00000 894 GiB 297 GiB 296 GiB 4.1 MiB 1.1 GiB 598 GiB 33.18 1.23 121 up
6 ssd2n 0.87329 1.00000 894 GiB 333 GiB 332 GiB 8.6 MiB 1.2 GiB 561 GiB 37.23 1.38 135 up
5 nvme 1.74660 1.00000 1.7 TiB 415 GiB 413 GiB 5.9 MiB 1.3 GiB 1.3 TiB 23.18 0.86 176 up
9 nvme 1.74660 1.00000 1.7 TiB 400 GiB 399 GiB 4.3 MiB 1.7 GiB 1.4 TiB 22.38 0.83 161 up
0 ssd2n 0.87329 1.00000 894 GiB 332 GiB 330 GiB 4.3 MiB 1.3 GiB 563 GiB 37.07 1.38 135 up
1 ssd2n 0.87329 1.00000 894 GiB 298 GiB 297 GiB 1.7 MiB 1.3 GiB 596 GiB 33.35 1.24 121 up
TOTAL 16 TiB 4.2 TiB 4.2 TiB 55 MiB 16 GiB 11 TiB 26.92
MIN/MAX VAR: 0.79/1.46 STDDEV: 6.88
Code:
ID CLASS WEIGHT TYPE NAME
-12 ssd2n 5.23975 root default~ssd2n
-9 ssd2n 1.74658 host pmx-s01~ssd2n
7 ssd2n 0.87329 osd.7
8 ssd2n 0.87329 osd.8
-10 ssd2n 1.74658 host pmx-s02~ssd2n
2 ssd2n 0.87329 osd.2
6 ssd2n 0.87329 osd.6
-11 ssd2n 1.74658 host pmx-s03~ssd2n
0 ssd2n 0.87329 osd.0
1 ssd2n 0.87329 osd.1
-2 nvme 10.47958 root default~nvme
-4 nvme 3.49319 host pmx-s01~nvme
3 nvme 1.74660 osd.3
10 nvme 1.74660 osd.10
-6 nvme 3.49319 host pmx-s02~nvme
4 nvme 1.74660 osd.4
11 nvme 1.74660 osd.11
-8 nvme 3.49319 host pmx-s03~nvme
5 nvme 1.74660 osd.5
9 nvme 1.74660 osd.9
-1 15.71933 root default
-3 5.23978 host pmx-s01
3 nvme 1.74660 osd.3
10 nvme 1.74660 osd.10
7 ssd2n 0.87329 osd.7
8 ssd2n 0.87329 osd.8
-5 5.23978 host pmx-s02
4 nvme 1.74660 osd.4
11 nvme 1.74660 osd.11
2 ssd2n 0.87329 osd.2
6 ssd2n 0.87329 osd.6
-7 5.23978 host pmx-s03
5 nvme 1.74660 osd.5
9 nvme 1.74660 osd.9
0 ssd2n 0.87329 osd.0
1 ssd2n 0.87329 osd.1
Did a lot of tests and recreated pools and OSD's in many ways but in matter of days every time each OSD's gets severely fragmented and loses up to 80% of write performance (tested with many FIO tests , rados benches , osd benches , RBD benches).
If i delete the osd's from node and let it sync from 2 nodes it will be perfect for few days 0.1 - 0.2 bluestore fragmentation but then it is in 0.8+ state soon. We are using only block devices for VM's with ext4 FS and no SWAP on them.
Code:
osd.3 "fragmentation_rating": 0.090421032864104897
osd.10 "fragmentation_rating": 0.093359029842755931
osd.7 "fragmentation_rating": 0.083908842581664561
osd.8 "fragmentation_rating": 0.067356428512611116
after 5 days
osd.3 "fragmentation_rating": 0.2567613553223777
osd.10 "fragmentation_rating": 0.25025098722978778
osd.7 "fragmentation_rating": 0.77481281469969676
osd.8 "fragmentation_rating": 0.82260745733487917
after few weeks
0,882571391878622
0,891192311159292
..etc
Code:
after recreating OSD's and syncing data to them
osd.0: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.41652934400000002,
"bytes_per_sec": 2577829964.3638072,
"iops": 614.60255726905041
}
osd.1: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.42986965700000002,
"bytes_per_sec": 2497831160.0160232,
"iops": 595.52935600662784
}
osd.2: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.424486221,
"bytes_per_sec": 2529509253.4935308,
"iops": 603.08200204218167
}
osd.3: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.31504493500000003,
"bytes_per_sec": 3408218018.1693759,
"iops": 812.58249716028593
}
osd.4: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.26949361700000002,
"bytes_per_sec": 3984294084.412396,
"iops": 949.92973432836436
}
osd.5: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.278853238,
"bytes_per_sec": 3850562509.8748178,
"iops": 918.04564234610029
}
osd.6: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.41076984700000002,
"bytes_per_sec": 2613974301.7700129,
"iops": 623.22003883600541
}
osd.7: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.42715592699999999,
"bytes_per_sec": 2513699930.4705892,
"iops": 599.31276571049432
}
osd.8: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.42246709999999998,
"bytes_per_sec": 2541598680.7020001,
"iops": 605.96434609937671
}
osd.9: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.27906448499999997,
"bytes_per_sec": 3847647700.4947443,
"iops": 917.35069763535125
}
osd.10: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.29398438999999998,
"bytes_per_sec": 3652376998.6562896,
"iops": 870.79453436286201
}
osd.11: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.29044762800000001,
"bytes_per_sec": 3696851757.3846393,
"iops": 881.39814314475996
}
Code:
5 days later when 2TB pool fragmented
osd.0: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.2355760659999999,
"bytes_per_sec": 869021223.01226258,
"iops": 207.19080519968571
}
osd.1: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.2537920739999999,
"bytes_per_sec": 856395447.27254355,
"iops": 204.18058568776692
}
osd.2: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.109058316,
"bytes_per_sec": 968156325.51462686,
"iops": 230.82645547738716
}
osd.3: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.303978943,
"bytes_per_sec": 3532290142.8734818,
"iops": 842.16359683835071
}
osd.4: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.29256520600000002,
"bytes_per_sec": 3670094057.5961719,
"iops": 875.01861038116738
}
osd.5: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.34798205999999998,
"bytes_per_sec": 3085624080.7356563,
"iops": 735.67010897056014
}
osd.6: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.037829675,
"bytes_per_sec": 1034603124.0627226,
"iops": 246.6686067730719
}
osd.7: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.1761135300000001,
"bytes_per_sec": 912957632.58501065,
"iops": 217.66606154084459
}
osd.8: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 1.154277314,
"bytes_per_sec": 930228646.9436754,
"iops": 221.78379224388013
}
osd.9: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.27671432299999998,
"bytes_per_sec": 3880326151.3860998,
"iops": 925.14184746410842
}
osd.10: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.301649371,
"bytes_per_sec": 3559569245.7121019,
"iops": 848.66744177629994
}
osd.11: {
"bytes_written": 1073741824,
"blocksize": 4194304,
"elapsed_sec": 0.269951261,
"bytes_per_sec": 3977539575.1902046,
"iops": 948.3193338370811
}
Code:
Diff between them
4,6c4,6
< "elapsed_sec": 0.41652934400000002,
< "bytes_per_sec": 2577829964.3638072,
< "iops": 614.60255726905041
---
> "elapsed_sec": 1.2355760659999999,
> "bytes_per_sec": 869021223.01226258,
> "iops": 207.19080519968571
11,13c11,13
< "elapsed_sec": 0.42986965700000002,
< "bytes_per_sec": 2497831160.0160232,
< "iops": 595.52935600662784
---
> "elapsed_sec": 1.2537920739999999,
> "bytes_per_sec": 856395447.27254355,
> "iops": 204.18058568776692
18,20c18,20
< "elapsed_sec": 0.424486221,
< "bytes_per_sec": 2529509253.4935308,
< "iops": 603.08200204218167
---
> "elapsed_sec": 1.109058316,
> "bytes_per_sec": 968156325.51462686,
> "iops": 230.82645547738716
25,27c25,27
< "elapsed_sec": 0.31504493500000003,
< "bytes_per_sec": 3408218018.1693759,
< "iops": 812.58249716028593
---
> "elapsed_sec": 0.303978943,
> "bytes_per_sec": 3532290142.8734818,
> "iops": 842.16359683835071
32,34c32,34
< "elapsed_sec": 0.26949361700000002,
< "bytes_per_sec": 3984294084.412396,
< "iops": 949.92973432836436
---
> "elapsed_sec": 0.29256520600000002,
> "bytes_per_sec": 3670094057.5961719,
> "iops": 875.01861038116738
39,41c39,41
< "elapsed_sec": 0.278853238,
< "bytes_per_sec": 3850562509.8748178,
< "iops": 918.04564234610029
---
> "elapsed_sec": 0.34798205999999998,
> "bytes_per_sec": 3085624080.7356563,
> "iops": 735.67010897056014
46,48c46,48
< "elapsed_sec": 0.41076984700000002,
< "bytes_per_sec": 2613974301.7700129,
< "iops": 623.22003883600541
---
> "elapsed_sec": 1.037829675,
> "bytes_per_sec": 1034603124.0627226,
> "iops": 246.6686067730719
53,55c53,55
< "elapsed_sec": 0.42715592699999999,
< "bytes_per_sec": 2513699930.4705892,
< "iops": 599.31276571049432
---
> "elapsed_sec": 1.1761135300000001,
> "bytes_per_sec": 912957632.58501065,
> "iops": 217.66606154084459
60,62c60,62
< "elapsed_sec": 0.42246709999999998,
< "bytes_per_sec": 2541598680.7020001,
< "iops": 605.96434609937671
---
> "elapsed_sec": 1.154277314,
> "bytes_per_sec": 930228646.9436754,
> "iops": 221.78379224388013
67,69c67,69
< "elapsed_sec": 0.27906448499999997,
< "bytes_per_sec": 3847647700.4947443,
< "iops": 917.35069763535125
---
> "elapsed_sec": 0.27671432299999998,
> "bytes_per_sec": 3880326151.3860998,
> "iops": 925.14184746410842
74,76c74,76
< "elapsed_sec": 0.29398438999999998,
< "bytes_per_sec": 3652376998.6562896,
< "iops": 870.79453436286201
---
> "elapsed_sec": 0.301649371,
> "bytes_per_sec": 3559569245.7121019,
> "iops": 848.66744177629994
81,83c81,83
< "elapsed_sec": 0.29044762800000001,
< "bytes_per_sec": 3696851757.3846393,
< "iops": 881.39814314475996
---
> "elapsed_sec": 0.269951261,
> "bytes_per_sec": 3977539575.1902046,
> "iops": 948.3193338370811
IOPS in osd bench after some time go to as low as 108 with 455MB/s
I noticed post on internet asking how to prevent or fix fragmentation but no replies to them and RedHat CEPH documentation says "to call Redhat to assist with fragmentation."
Anyone knows what causes fragmentation and how to solve it without deleting OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes for all 3 nodes with 5TB of data but this is a testing cluster so for production it is not acceptable).
I tried changing these values :
ceph config set osd osd_memory_target 17179869184
ceph config set osd osd_memory_expected_fragmentation 0.800000
ceph config set osd osd_memory_base 2147483648
ceph config set osd osd_memory_cache_min 805306368
ceph config set osd bluestore_cache_size 17179869184
ceph config set osd bluestore_cache_size_ssd 17179869184
Cluster is really not in use.
Code:
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
cephfs_data 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
cephfs_metadata 15 MiB 22 0 66 0 0 0 176 716 KiB 206 195 KiB 0 B 0 B
containers 12 KiB 3 0 9 0 0 0 14890830 834 GiB 11371993 641 GiB 0 B 0 B
device_health_metrics 1.2 MiB 6 0 18 0 0 0 1389 3.6 MiB 1713 1.4 MiB 0 B 0 B
machines 2.4 TiB 221068 0 663204 0 0 0 35032709 3.3 TiB 433971410 7.3 TiB 0 B 0 B
two_tb_pool 1.8 TiB 186662 0 559986 0 0 0 12742384 864 GiB 217071088 5.0 TiB 0 B 0 B
total_objects 407761
total_used 4.2 TiB
total_avail 11 TiB
total_space 16 TiB
Code:
cluster:
id: REMOVED-for-pricvacy
health: HEALTH_OK
services:
mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w)
mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02
mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby
osd: 12 osds: 12 up (since 4h), 12 in (since 3d)
data:
pools: 6 pools, 593 pgs
objects: 407.76k objects, 1.5 TiB
usage: 4.2 TiB used, 11 TiB / 16 TiB avail
pgs: 593 active+clean
io:
client: 55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr
Last edited: