3 node CEPH hyperconverged cluster fragmentation.

noob1

New Member
Feb 8, 2022
1
0
1
39
Proxmox = 6.4-8
CEPH = 15.2.13
Nodes = 3
Network = 2x100G / node
Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB
nvme Samsung PM-1733 MZWLJ1T9HBJR 2TB
CPU = EPYC 7252
CEPH pools = 2 separate pools for each disk type and each disk spliced in 2 OSD's
Replica = 3

VM don't do many writes and i migrated main testing VM's to 2TB pool which in turns fragments faster.

Code:
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 3   nvme  1.74660   1.00000  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB  1.3 TiB  24.18  0.90  186      up
10   nvme  1.74660   1.00000  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB  1.4 TiB  21.38  0.79  151      up
 7  ssd2n  0.87329   1.00000  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB  615 GiB  31.19  1.16  113      up
 8  ssd2n  0.87329   1.00000  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB  544 GiB  39.22  1.46  143      up
 4   nvme  1.74660   1.00000  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB  1.3 TiB  23.85  0.89  180      up
11   nvme  1.74660   1.00000  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB  1.4 TiB  21.72  0.81  157      up
 2  ssd2n  0.87329   1.00000  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB  598 GiB  33.18  1.23  121      up
 6  ssd2n  0.87329   1.00000  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB  561 GiB  37.23  1.38  135      up
 5   nvme  1.74660   1.00000  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB  1.3 TiB  23.18  0.86  176      up
 9   nvme  1.74660   1.00000  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB  1.4 TiB  22.38  0.83  161      up
 0  ssd2n  0.87329   1.00000  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB  563 GiB  37.07  1.38  135      up
 1  ssd2n  0.87329   1.00000  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB  596 GiB  33.35  1.24  121      up
                       TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB   11 TiB  26.92
MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88

Code:
ID   CLASS  WEIGHT    TYPE NAME
-12  ssd2n   5.23975  root default~ssd2n
 -9  ssd2n   1.74658      host pmx-s01~ssd2n
  7  ssd2n   0.87329          osd.7
  8  ssd2n   0.87329          osd.8
-10  ssd2n   1.74658      host pmx-s02~ssd2n
  2  ssd2n   0.87329          osd.2
  6  ssd2n   0.87329          osd.6
-11  ssd2n   1.74658      host pmx-s03~ssd2n
  0  ssd2n   0.87329          osd.0
  1  ssd2n   0.87329          osd.1
 -2   nvme  10.47958  root default~nvme
 -4   nvme   3.49319      host pmx-s01~nvme
  3   nvme   1.74660          osd.3
 10   nvme   1.74660          osd.10
 -6   nvme   3.49319      host pmx-s02~nvme
  4   nvme   1.74660          osd.4
 11   nvme   1.74660          osd.11
 -8   nvme   3.49319      host pmx-s03~nvme
  5   nvme   1.74660          osd.5
  9   nvme   1.74660          osd.9
 -1         15.71933  root default
 -3          5.23978      host pmx-s01
  3   nvme   1.74660          osd.3
 10   nvme   1.74660          osd.10
  7  ssd2n   0.87329          osd.7
  8  ssd2n   0.87329          osd.8
 -5          5.23978      host pmx-s02
  4   nvme   1.74660          osd.4
 11   nvme   1.74660          osd.11
  2  ssd2n   0.87329          osd.2
  6  ssd2n   0.87329          osd.6
 -7          5.23978      host pmx-s03
  5   nvme   1.74660          osd.5
  9   nvme   1.74660          osd.9
  0  ssd2n   0.87329          osd.0
  1  ssd2n   0.87329          osd.1

Did a lot of tests and recreated pools and OSD's in many ways but in matter of days every time each OSD's gets severely fragmented and loses up to 80% of write performance (tested with many FIO tests , rados benches , osd benches , RBD benches).

If i delete the osd's from node and let it sync from 2 nodes it will be perfect for few days 0.1 - 0.2 bluestore fragmentation but then it is in 0.8+ state soon. We are using only block devices for VM's with ext4 FS and no SWAP on them.

Code:
osd.3  "fragmentation_rating": 0.090421032864104897
osd.10 "fragmentation_rating": 0.093359029842755931
osd.7  "fragmentation_rating": 0.083908842581664561
osd.8  "fragmentation_rating": 0.067356428512611116

after 5 days

osd.3  "fragmentation_rating": 0.2567613553223777
osd.10 "fragmentation_rating": 0.25025098722978778
osd.7  "fragmentation_rating": 0.77481281469969676
osd.8  "fragmentation_rating": 0.82260745733487917

after few weeks

0,882571391878622
0,891192311159292
..etc

Code:
after recreating OSD's and syncing data to them

osd.0: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.41652934400000002,
    "bytes_per_sec": 2577829964.3638072,
    "iops": 614.60255726905041
}
osd.1: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.42986965700000002,
    "bytes_per_sec": 2497831160.0160232,
    "iops": 595.52935600662784
}
osd.2: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.424486221,
    "bytes_per_sec": 2529509253.4935308,
    "iops": 603.08200204218167
}
osd.3: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.31504493500000003,
    "bytes_per_sec": 3408218018.1693759,
    "iops": 812.58249716028593
}
osd.4: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.26949361700000002,
    "bytes_per_sec": 3984294084.412396,
    "iops": 949.92973432836436
}
osd.5: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.278853238,
    "bytes_per_sec": 3850562509.8748178,
    "iops": 918.04564234610029
}
osd.6: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.41076984700000002,
    "bytes_per_sec": 2613974301.7700129,
    "iops": 623.22003883600541
}
osd.7: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.42715592699999999,
    "bytes_per_sec": 2513699930.4705892,
    "iops": 599.31276571049432
}
osd.8: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.42246709999999998,
    "bytes_per_sec": 2541598680.7020001,
    "iops": 605.96434609937671
}
osd.9: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.27906448499999997,
    "bytes_per_sec": 3847647700.4947443,
    "iops": 917.35069763535125
}
osd.10: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.29398438999999998,
    "bytes_per_sec": 3652376998.6562896,
    "iops": 870.79453436286201
}
osd.11: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.29044762800000001,
    "bytes_per_sec": 3696851757.3846393,
    "iops": 881.39814314475996
}
Code:
5 days later when 2TB pool fragmented

osd.0: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.2355760659999999,
    "bytes_per_sec": 869021223.01226258,
    "iops": 207.19080519968571
}
osd.1: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.2537920739999999,
    "bytes_per_sec": 856395447.27254355,
    "iops": 204.18058568776692
}
osd.2: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.109058316,
    "bytes_per_sec": 968156325.51462686,
    "iops": 230.82645547738716
}
osd.3: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.303978943,
    "bytes_per_sec": 3532290142.8734818,
    "iops": 842.16359683835071
}
osd.4: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.29256520600000002,
    "bytes_per_sec": 3670094057.5961719,
    "iops": 875.01861038116738
}
osd.5: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.34798205999999998,
    "bytes_per_sec": 3085624080.7356563,
    "iops": 735.67010897056014
}
osd.6: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.037829675,
    "bytes_per_sec": 1034603124.0627226,
    "iops": 246.6686067730719
}
osd.7: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.1761135300000001,
    "bytes_per_sec": 912957632.58501065,
    "iops": 217.66606154084459
}
osd.8: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 1.154277314,
    "bytes_per_sec": 930228646.9436754,
    "iops": 221.78379224388013
}
osd.9: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.27671432299999998,
    "bytes_per_sec": 3880326151.3860998,
    "iops": 925.14184746410842
}
osd.10: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.301649371,
    "bytes_per_sec": 3559569245.7121019,
    "iops": 848.66744177629994
}
osd.11: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 0.269951261,
    "bytes_per_sec": 3977539575.1902046,
    "iops": 948.3193338370811
}
Code:
Diff between them

4,6c4,6
<     "elapsed_sec": 0.41652934400000002,
<     "bytes_per_sec": 2577829964.3638072,
<     "iops": 614.60255726905041
---
>     "elapsed_sec": 1.2355760659999999,
>     "bytes_per_sec": 869021223.01226258,
>     "iops": 207.19080519968571
11,13c11,13
<     "elapsed_sec": 0.42986965700000002,
<     "bytes_per_sec": 2497831160.0160232,
<     "iops": 595.52935600662784
---
>     "elapsed_sec": 1.2537920739999999,
>     "bytes_per_sec": 856395447.27254355,
>     "iops": 204.18058568776692
18,20c18,20
<     "elapsed_sec": 0.424486221,
<     "bytes_per_sec": 2529509253.4935308,
<     "iops": 603.08200204218167
---
>     "elapsed_sec": 1.109058316,
>     "bytes_per_sec": 968156325.51462686,
>     "iops": 230.82645547738716
25,27c25,27
<     "elapsed_sec": 0.31504493500000003,
<     "bytes_per_sec": 3408218018.1693759,
<     "iops": 812.58249716028593
---
>     "elapsed_sec": 0.303978943,
>     "bytes_per_sec": 3532290142.8734818,
>     "iops": 842.16359683835071
32,34c32,34
<     "elapsed_sec": 0.26949361700000002,
<     "bytes_per_sec": 3984294084.412396,
<     "iops": 949.92973432836436
---
>     "elapsed_sec": 0.29256520600000002,
>     "bytes_per_sec": 3670094057.5961719,
>     "iops": 875.01861038116738
39,41c39,41
<     "elapsed_sec": 0.278853238,
<     "bytes_per_sec": 3850562509.8748178,
<     "iops": 918.04564234610029
---
>     "elapsed_sec": 0.34798205999999998,
>     "bytes_per_sec": 3085624080.7356563,
>     "iops": 735.67010897056014
46,48c46,48
<     "elapsed_sec": 0.41076984700000002,
<     "bytes_per_sec": 2613974301.7700129,
<     "iops": 623.22003883600541
---
>     "elapsed_sec": 1.037829675,
>     "bytes_per_sec": 1034603124.0627226,
>     "iops": 246.6686067730719
53,55c53,55
<     "elapsed_sec": 0.42715592699999999,
<     "bytes_per_sec": 2513699930.4705892,
<     "iops": 599.31276571049432
---
>     "elapsed_sec": 1.1761135300000001,
>     "bytes_per_sec": 912957632.58501065,
>     "iops": 217.66606154084459
60,62c60,62
<     "elapsed_sec": 0.42246709999999998,
<     "bytes_per_sec": 2541598680.7020001,
<     "iops": 605.96434609937671
---
>     "elapsed_sec": 1.154277314,
>     "bytes_per_sec": 930228646.9436754,
>     "iops": 221.78379224388013
67,69c67,69
<     "elapsed_sec": 0.27906448499999997,
<     "bytes_per_sec": 3847647700.4947443,
<     "iops": 917.35069763535125
---
>     "elapsed_sec": 0.27671432299999998,
>     "bytes_per_sec": 3880326151.3860998,
>     "iops": 925.14184746410842
74,76c74,76
<     "elapsed_sec": 0.29398438999999998,
<     "bytes_per_sec": 3652376998.6562896,
<     "iops": 870.79453436286201
---
>     "elapsed_sec": 0.301649371,
>     "bytes_per_sec": 3559569245.7121019,
>     "iops": 848.66744177629994
81,83c81,83
<     "elapsed_sec": 0.29044762800000001,
<     "bytes_per_sec": 3696851757.3846393,
<     "iops": 881.39814314475996
---
>     "elapsed_sec": 0.269951261,
>     "bytes_per_sec": 3977539575.1902046,
>     "iops": 948.3193338370811

IOPS in osd bench after some time go to as low as 108 with 455MB/s

I noticed post on internet asking how to prevent or fix fragmentation but no replies to them and RedHat CEPH documentation says "to call Redhat to assist with fragmentation."

Anyone knows what causes fragmentation and how to solve it without deleting OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes for all 3 nodes with 5TB of data but this is a testing cluster so for production it is not acceptable).

I tried changing these values :
ceph config set osd osd_memory_target 17179869184
ceph config set osd osd_memory_expected_fragmentation 0.800000
ceph config set osd osd_memory_base 2147483648
ceph config set osd osd_memory_cache_min 805306368
ceph config set osd bluestore_cache_size 17179869184
ceph config set osd bluestore_cache_size_ssd 17179869184

Cluster is really not in use.

Code:
POOL_NAME                 USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD     WR_OPS       WR  USED COMPR  UNDER COMPR
cephfs_data                0 B        0       0       0                   0        0         0         0      0 B          0      0 B         0 B          0 B
cephfs_metadata         15 MiB       22       0      66                   0        0         0       176  716 KiB        206  195 KiB         0 B          0 B
containers              12 KiB        3       0       9                   0        0         0  14890830  834 GiB   11371993  641 GiB         0 B          0 B
device_health_metrics  1.2 MiB        6       0      18                   0        0         0      1389  3.6 MiB       1713  1.4 MiB         0 B          0 B
machines               2.4 TiB   221068       0  663204                   0        0         0  35032709  3.3 TiB  433971410  7.3 TiB         0 B          0 B
two_tb_pool            1.8 TiB   186662       0  559986                   0        0         0  12742384  864 GiB  217071088  5.0 TiB         0 B          0 B

total_objects    407761
total_used       4.2 TiB
total_avail      11 TiB
total_space      16 TiB

Code:
  cluster:
    id:     REMOVED-for-pricvacy
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w)
    mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02
    mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby
    osd: 12 osds: 12 up (since 4h), 12 in (since 3d)

  data:
    pools:   6 pools, 593 pgs
    objects: 407.76k objects, 1.5 TiB
    usage:   4.2 TiB used, 11 TiB / 16 TiB avail
    pgs:     593 active+clean

  io:
    client:   55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!