CEPH issue: continuous rebalancing

Senin · May 3, 2023

Hi!

I've got 3 node PVE cluster with CEPH.
1 pool, 4 OSDs per node

Yesterday CEPH started rabalancing and it still goes.
I think this because I've added abot 6Tb of data and autoscale changed pgs number from 32 to 128.
It's ok but I'm bit confused because it's doing recovery fine until 95% is reached and than it goes back to 94% again.

In the log I can see this (as you can see 4.99% and then 5.77% misplaced again):

2023-05-03T21:29:08.670288+0300 mgr.pve1 (mgr.13426191) 551 : cluster [DBG] pgmap v474: 129 pgs: 18 active+remapped+backfilling, 111 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 2.7 MiB/s rd, 2.5 MiB/s wr, 167 op/s; 602677/12051759 objects misplaced (5.001%); 164 MiB/s, 41 objects/s recovering
2023-05-03T21:29:10.670835+0300 mgr.pve1 (mgr.13426191) 552 : cluster [DBG] pgmap v475: 129 pgs: 18 active+remapped+backfilling, 111 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 2.5 MiB/s rd, 2.5 MiB/s wr, 159 op/s; 602571/12051759 objects misplaced (5.000%); 146 MiB/s, 37 objects/s recovering
2023-05-03T21:29:10.867345+0300 mon.pve1 (mon.0) 1002 : cluster [DBG] osdmap e3661: 12 total, 12 up, 12 in
2023-05-03T21:29:11.878821+0300 mon.pve1 (mon.0) 1003 : cluster [DBG] osdmap e3662: 12 total, 12 up, 12 in
2023-05-03T21:29:12.671381+0300 mgr.pve1 (mgr.13426191) 553 : cluster [DBG] pgmap v478: 129 pgs: 1 remapped+peering, 18 active+remapped+backfilling, 110 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 3.4 MiB/s rd, 3.6 MiB/s wr, 262 op/s; 602225/12051759 objects misplaced (4.997%); 194 MiB/s, 49 objects/s recovering
2023-05-03T21:29:12.881694+0300 mon.pve1 (mon.0) 1008 : cluster [DBG] osdmap e3663: 12 total, 12 up, 12 in
2023-05-03T21:29:12.899231+0300 osd.1 (osd.1) 2611 : cluster [DBG] 2.51 starting backfill to osd.3 from (0'0,0'0] MAX to 3660'22103919
2023-05-03T21:29:12.925535+0300 osd.1 (osd.1) 2612 : cluster [DBG] 2.51 starting backfill to osd.5 from (0'0,0'0] MAX to 3660'22103919
2023-05-03T21:29:12.948445+0300 osd.1 (osd.1) 2613 : cluster [DBG] 2.51 starting backfill to osd.10 from (0'0,0'0] MAX to 3660'22103919
2023-05-03T21:29:14.671807+0300 mgr.pve1 (mgr.13426191) 554 : cluster [DBG] pgmap v480: 129 pgs: 1 remapped+peering, 18 active+remapped+backfilling, 110 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 2.3 MiB/s rd, 2.5 MiB/s wr, 164 op/s; 602225/12051759 objects misplaced (4.997%); 131 MiB/s, 32 objects/s recovering
2023-05-03T21:29:16.672609+0300 mgr.pve1 (mgr.13426191) 555 : cluster [DBG] pgmap v481: 129 pgs: 1 remapped+peering, 18 active+remapped+backfilling, 110 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 3.2 MiB/s rd, 3.0 MiB/s wr, 228 op/s; 601957/12051759 objects misplaced (4.995%); 181 MiB/s, 45 objects/s recovering
2023-05-03T21:29:18.673223+0300 mgr.pve1 (mgr.13426191) 556 : cluster [DBG] pgmap v482: 129 pgs: 19 active+remapped+backfilling, 110 active+clean; 15 TiB data, 48 TiB used, 170 TiB / 218 TiB avail; 3.4 MiB/s rd, 2.7 MiB/s wr, 209 op/s; 695930/12051759 objects misplaced (5.775%); 196 MiB/s, 49 objects/s recovering

Is there any way to fix this?

Best regards, Alex

Senin · May 4, 2023

ok, because of performance issues I just moved my large disk to "classic" shared storage and after another 7 hours rebalance was done.
tried to increase target_max_misplaced_ratio to 7% but it didn't work (same behaviour but at 7% limit)

mjkl · Aug 23, 2024

I suffer from this exact issue!

My ceph cluster has been bouncing between 94 and 95% for days now (At first I didn't notice) Does anyone know a fix?

You mentioned going back to "classic" shared storage? What did you mean with that? you moved to a non-ceph solution?

Maximiliano · Aug 23, 2024

Hello,

In order to better understand your setup, could you please post the following:

- ceph config dump
- cat /etc/pve/ceph.conf
- ceph osd df tree
- pveceph pool ls

Please use code blocks so it is easier to follow.

mjkl · Aug 27, 2024

The requested info. Due note i've been looking around and theorized it might be the autoscaler and rebalancing playing havoc with each other juggling pg's around, not being able to rebalance correctly. The autoscaler put the pg's of my pool at 256.

Code:

root@balaur:~# ceph config dump
WHO     MASK  LEVEL     OPTION                                 VALUE              RO
global        advanced  mon_allow_pool_size_one                true               
global        advanced  osd_scrub_load_threshold               10.000000         
mon           advanced  auth_allow_insecure_global_id_reclaim  false             
mon           advanced  mon_max_pg_per_osd                     500               
mgr           advanced  mgr/balancer/mode                      upmap             
mgr           advanced  mgr/dashboard/ssl                      false              *
mgr           advanced  mgr/telemetry/enabled                  true               *
osd           advanced  osd_mclock_profile                     high_recovery_ops 
osd           advanced  osd_scrub_auto_repair                  true               
osd.0         basic     osd_mclock_max_capacity_iops_hdd       335.393982         
osd.1         basic     osd_mclock_max_capacity_iops_hdd       350.584368         
osd.10        basic     osd_mclock_max_capacity_iops_ssd       42688.763129       
osd.11        basic     osd_mclock_max_capacity_iops_ssd       43070.621779       
osd.2         basic     osd_mclock_max_capacity_iops_hdd       476.610339         
osd.3         basic     osd_mclock_max_capacity_iops_hdd       450.379448         
osd.6         basic     osd_mclock_max_capacity_iops_hdd       486.571185         
osd.7         basic     osd_mclock_max_capacity_iops_hdd       461.046658         
osd.9         basic     osd_mclock_max_capacity_iops_ssd       42984.100697

Code:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.15.15.0/24
        fsid = d60c8e23-ff54-4552-9119-e7801e913dc2
        mon_allow_pool_delete = true
        mon_host = 10.15.15.50 10.15.15.51 10.15.15.52
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.15.15.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.balaur-0]
        host = balaur
        mds_standby_for_name = pve

[mds.balaur-1]
        host = balaur
        mds_standby_for_name = pve

[mds.cerberus-0]
        host = cerberus
        mds_standby_for_name = pve

[mds.cerberus-1]
        host = cerberus
        mds_standby_for_name = pve

[mds.chimera-0]
        host = chimera
        mds_standby_for_name = pve

[mds.chimera-1]
        host = chimera
        mds_standby_for_name = pve

[mon]
        mgr_initial_modules = dashboard

[mon.balaur]
        public_addr = 10.15.15.52

[mon.cerberus]
        public_addr = 10.15.15.50

[mon.chimera]
        public_addr = 10.15.15.51

Code:

ceph osd df tree
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME      
 -1         52.07874         -   52 TiB   31 TiB   31 TiB  2.4 GiB   86 GiB   21 TiB  59.20  1.00    -          root default    
-10         17.28278         -   17 TiB   10 TiB   10 TiB  812 MiB   28 GiB  7.1 TiB  58.95  1.00    -              host balaur
  6    hdd   9.09569   1.00000  9.1 TiB  5.7 TiB  5.7 TiB  7.6 MiB   13 GiB  3.4 TiB  62.92  1.06  118      up          osd.6  
  7    hdd   7.27739   1.00000  7.3 TiB  3.8 TiB  3.8 TiB  4.3 MiB  9.4 GiB  3.4 TiB  52.76  0.89   76      up          osd.7  
  9   nvme   0.90970   1.00000  932 GiB  640 GiB  634 GiB  800 MiB  5.4 GiB  291 GiB  68.75  1.16  177      up          osd.9  
 -7         17.39798         -   17 TiB   10 TiB   10 TiB  807 MiB   30 GiB  7.1 TiB  59.34  1.00    -              host cerberus
  1    hdd   7.33499   1.00000  7.3 TiB  4.2 TiB  4.1 TiB  1.4 MiB   11 GiB  3.1 TiB  57.13  0.97   86      up          osd.1  
  2    hdd   9.15329   1.00000  9.2 TiB  5.5 TiB  5.4 TiB  7.3 MiB   13 GiB  3.7 TiB  59.71  1.01  108      up          osd.2  
 11   nvme   0.90970   1.00000  932 GiB  684 GiB  677 GiB  799 MiB  6.2 GiB  248 GiB  73.43  1.24  178      up          osd.11  
 -3         17.39798         -   17 TiB   10 TiB   10 TiB  816 MiB   29 GiB  7.1 TiB  59.31  1.00    -              host chimera
  0    hdd   9.15329   1.00000  9.2 TiB  5.5 TiB  5.4 TiB   11 MiB   12 GiB  3.7 TiB  59.68  1.01  110      up          osd.0  
  3    hdd   7.33499   1.00000  7.3 TiB  4.2 TiB  4.1 TiB  4.9 MiB  9.7 GiB  3.1 TiB  57.17  0.97   84      up          osd.3  
 10   nvme   0.90970   1.00000  932 GiB  679 GiB  672 GiB  799 MiB  6.7 GiB  252 GiB  72.91  1.23  178      up          osd.10  
                         TOTAL   52 TiB   31 TiB   31 TiB  2.4 GiB   86 GiB   21 TiB  59.20                                    
MIN/MAX VAR: 0.89/1.24  STDDEV: 7.78

Code:

root@balaur:~# pveceph pool ls
┌─────────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────
│ Name                │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio
╞═════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════
│ .mgr                │    3 │        2 │      1 │           1 │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph                │    3 │        2 │    128 │             │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph-hdd            │    3 │        2 │     32 │             │             32 │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.data    │    3 │        2 │      1 │             │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.meta    │    3 │        2 │      1 │          16 │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ loot                │    2 │        1 │      1 │             │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_data     │    3 │        2 │    256 │             │            128 │ on              │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_metadata │    3 │        2 │     32 │          16 │                │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_data          │    3 │        2 │     32 │             │             32 │ on                │                          │                          
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_metadata      │    3 │        2 │     16 │          16 │                │ on                │                          │                          
└─────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────

mjkl · Aug 27, 2024

In the meanwhile i've set autoscaler to warn and manually reduced the amount of pg's from 256 to 128 (advised size)

The problematic pool is the multimedia_data:

Code:

root@balaur:~# pveceph pool ls
┌─────────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────
│ Name                │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio
╞═════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════
│ .mgr                │    3 │        2 │      1 │           1 │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph                │    3 │        2 │    128 │             │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph-hdd            │    3 │        2 │     32 │             │             32 │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.data    │    3 │        2 │      1 │             │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.meta    │    3 │        2 │      1 │          16 │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ loot                │    2 │        1 │      1 │             │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_data     │    3 │        2 │    128 │             │            128 │ warn              │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_metadata │    3 │        2 │     32 │          16 │                │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_data          │    3 │        2 │     32 │             │             32 │ on                │                          │                           
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_metadata      │    3 │        2 │     16 │          16 │                │ on                │                          │                           
└─────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────

It did cost a couple of days to reduce the amount pg's to 128, but now rebalancing is done and the issue seems resolved.

Also i head around 300 of unscrubbed pg's, after the change it started lowering quite quickly and is now sitting around 72 and still lowering

I will try to re-enable the autoscaler to see if the issue comes back

Maximiliano · Aug 27, 2024

The number of PGs of your pool won't change from 128 to 256 if the autoscaler thinks 256 is better. Changes are only done when the optimal number of PGs differs by a factor of 3 from the current number, i.e. if its at least three times bigger, or smaller than one third. Note that for the autoscaler to produce the right number of PGs you need to set the target ratio and target size on all pools (except on .mgr). See Ceph docs [1].

Could you please run

Code:

ceph osd pool get <pool-name> crush_rule

for *all* pools? If you are using device specific rules, then you cannot use the `replicated_rule` anymore.

[1] https://docs.ceph.com/en/latest/rad...nt-groups/#viewing-pg-scaling-recommendations

spirit · Aug 28, 2024

mjkl said:

The requested info. Due note i've been looking around and theorized it might be the autoscaler and rebalancing playing havoc with each other juggling pg's around, not being able to rebalance correctly. The autoscaler put the pg's of my pool at 256.

Code:

root@balaur:~# ceph config dump
WHO     MASK  LEVEL     OPTION                                 VALUE              RO
global        advanced  mon_allow_pool_size_one                true              
global        advanced  osd_scrub_load_threshold               10.000000        
mon           advanced  auth_allow_insecure_global_id_reclaim  false            
mon           advanced  mon_max_pg_per_osd                     500              
mgr           advanced  mgr/balancer/mode                      upmap            
mgr           advanced  mgr/dashboard/ssl                      false              *
mgr           advanced  mgr/telemetry/enabled                  true               *
osd           advanced  osd_mclock_profile                     high_recovery_ops
osd           advanced  osd_scrub_auto_repair                  true              
osd.0         basic     osd_mclock_max_capacity_iops_hdd       335.393982        
osd.1         basic     osd_mclock_max_capacity_iops_hdd       350.584368        
osd.10        basic     osd_mclock_max_capacity_iops_ssd       42688.763129      
osd.11        basic     osd_mclock_max_capacity_iops_ssd       43070.621779      
osd.2         basic     osd_mclock_max_capacity_iops_hdd       476.610339        
osd.3         basic     osd_mclock_max_capacity_iops_hdd       450.379448        
osd.6         basic     osd_mclock_max_capacity_iops_hdd       486.571185        
osd.7         basic     osd_mclock_max_capacity_iops_hdd       461.046658        
osd.9         basic     osd_mclock_max_capacity_iops_ssd       42984.100697

Code:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.15.15.0/24
        fsid = d60c8e23-ff54-4552-9119-e7801e913dc2
        mon_allow_pool_delete = true
        mon_host = 10.15.15.50 10.15.15.51 10.15.15.52
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.15.15.0/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
        keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.balaur-0]
        host = balaur
        mds_standby_for_name = pve

[mds.balaur-1]
        host = balaur
        mds_standby_for_name = pve

[mds.cerberus-0]
        host = cerberus
        mds_standby_for_name = pve

[mds.cerberus-1]
        host = cerberus
        mds_standby_for_name = pve

[mds.chimera-0]
        host = chimera
        mds_standby_for_name = pve

[mds.chimera-1]
        host = chimera
        mds_standby_for_name = pve

[mon]
        mgr_initial_modules = dashboard

[mon.balaur]
        public_addr = 10.15.15.52

[mon.cerberus]
        public_addr = 10.15.15.50

[mon.chimera]
        public_addr = 10.15.15.51

Code:

ceph osd df tree
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME     
 -1         52.07874         -   52 TiB   31 TiB   31 TiB  2.4 GiB   86 GiB   21 TiB  59.20  1.00    -          root default   
-10         17.28278         -   17 TiB   10 TiB   10 TiB  812 MiB   28 GiB  7.1 TiB  58.95  1.00    -              host balaur
  6    hdd   9.09569   1.00000  9.1 TiB  5.7 TiB  5.7 TiB  7.6 MiB   13 GiB  3.4 TiB  62.92  1.06  118      up          osd.6 
  7    hdd   7.27739   1.00000  7.3 TiB  3.8 TiB  3.8 TiB  4.3 MiB  9.4 GiB  3.4 TiB  52.76  0.89   76      up          osd.7 
  9   nvme   0.90970   1.00000  932 GiB  640 GiB  634 GiB  800 MiB  5.4 GiB  291 GiB  68.75  1.16  177      up          osd.9 
 -7         17.39798         -   17 TiB   10 TiB   10 TiB  807 MiB   30 GiB  7.1 TiB  59.34  1.00    -              host cerberus
  1    hdd   7.33499   1.00000  7.3 TiB  4.2 TiB  4.1 TiB  1.4 MiB   11 GiB  3.1 TiB  57.13  0.97   86      up          osd.1 
  2    hdd   9.15329   1.00000  9.2 TiB  5.5 TiB  5.4 TiB  7.3 MiB   13 GiB  3.7 TiB  59.71  1.01  108      up          osd.2 
 11   nvme   0.90970   1.00000  932 GiB  684 GiB  677 GiB  799 MiB  6.2 GiB  248 GiB  73.43  1.24  178      up          osd.11 
 -3         17.39798         -   17 TiB   10 TiB   10 TiB  816 MiB   29 GiB  7.1 TiB  59.31  1.00    -              host chimera
  0    hdd   9.15329   1.00000  9.2 TiB  5.5 TiB  5.4 TiB   11 MiB   12 GiB  3.7 TiB  59.68  1.01  110      up          osd.0 
  3    hdd   7.33499   1.00000  7.3 TiB  4.2 TiB  4.1 TiB  4.9 MiB  9.7 GiB  3.1 TiB  57.17  0.97   84      up          osd.3 
 10   nvme   0.90970   1.00000  932 GiB  679 GiB  672 GiB  799 MiB  6.7 GiB  252 GiB  72.91  1.23  178      up          osd.10 
                         TOTAL   52 TiB   31 TiB   31 TiB  2.4 GiB   86 GiB   21 TiB  59.20                                   
MIN/MAX VAR: 0.89/1.24  STDDEV: 7.78

Code:

root@balaur:~# pveceph pool ls
┌─────────────────────┬──────┬──────────┬────────┬─────────────┬────────────────┬───────────────────┬──────────────────────────┬───────────────────────────
│ Name                │ Size │ Min Size │ PG Num │ min. PG Num │ Optimal PG Num │ PG Autoscale Mode │ PG Autoscale Target Size │ PG Autoscale Target Ratio
╞═════════════════════╪══════╪══════════╪════════╪═════════════╪════════════════╪═══════════════════╪══════════════════════════╪═══════════════════════════
│ .mgr                │    3 │        2 │      1 │           1 │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph                │    3 │        2 │    128 │             │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ ceph-hdd            │    3 │        2 │     32 │             │             32 │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.data    │    3 │        2 │      1 │             │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ cephfs.loot.meta    │    3 │        2 │      1 │          16 │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ loot                │    2 │        1 │      1 │             │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_data     │    3 │        2 │    256 │             │            128 │ on              │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ multimedia_metadata │    3 │        2 │     32 │          16 │                │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_data          │    3 │        2 │     32 │             │             32 │ on                │                          │                         
├─────────────────────┼──────┼──────────┼────────┼─────────────┼────────────────┼───────────────────┼──────────────────────────┼───────────────────────────
│ vault_metadata      │    3 │        2 │     16 │          16 │                │ on                │                          │                         
└─────────────────────┴──────┴──────────┴────────┴─────────────┴────────────────┴───────────────────┴──────────────────────────┴───────────────────────────

do you have created 2 differents crush rules for your hdd && nvme ?

mjkl · Aug 28, 2024

Yes, The problematic pool was an erasure coded pool with the poolname as crush rule

I do remember ceph advised me to set "bulk" when creating the pool. Which I did. Does this affect things? Like I mentioned manually lowering the amount of pg's and resetting to autoscale seems to have fixed it

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host chimera {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    id -5 class ssd        # do not change unnecessarily
    id -13 class nvme        # do not change unnecessarily
    id -17 class cache        # do not change unnecessarily
    # weight 17.39798
    alg straw2
    hash 0    # rjenkins1
    item osd.10 weight 0.90970
    item osd.0 weight 9.15329
    item osd.3 weight 7.33499
}
host cerberus {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    id -9 class ssd        # do not change unnecessarily
    id -14 class nvme        # do not change unnecessarily
    id -18 class cache        # do not change unnecessarily
    # weight 17.39798
    alg straw2
    hash 0    # rjenkins1
    item osd.11 weight 0.90970
    item osd.2 weight 9.15329
    item osd.1 weight 7.33499
}
host balaur {
    id -10        # do not change unnecessarily
    id -11 class hdd        # do not change unnecessarily
    id -12 class ssd        # do not change unnecessarily
    id -15 class nvme        # do not change unnecessarily
    id -19 class cache        # do not change unnecessarily
    # weight 17.28278
    alg straw2
    hash 0    # rjenkins1
    item osd.9 weight 0.90970
    item osd.7 weight 7.27739
    item osd.6 weight 9.09569
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    id -16 class nvme        # do not change unnecessarily
    id -20 class cache        # do not change unnecessarily
    # weight 52.07874
    alg straw2
    hash 0    # rjenkins1
    item chimera weight 17.39798
    item cerberus weight 17.39798
    item balaur weight 17.28278
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule ssd_only {
    id 1
    type replicated
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
rule vm_storage {
    id 2
    type replicated
    step take default class nvme
    step chooseleaf firstn 0 type host
    step emit
}
rule erasure-code {
    id 3
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}
rule multimedia_data {
    id 4
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class hdd
    step chooseleaf indep 0 type host
    step emit
}
rule replicated_hdd {
    id 5
    type replicated
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule cache {
    id 6
    type replicated
    step take default class cache
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_cache {
    id 7
    type replicated
    step take default class cache
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_nvme {
    id 8
    type replicated
    step take default class nvme
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Server View
Logs

Search

Search

CEPH issue: continuous rebalancing

Senin

Member

Senin

Member

mjkl

Active Member

Maximiliano

Proxmox Staff Member

mjkl

Active Member

mjkl

Active Member

Maximiliano

Proxmox Staff Member

spirit

Distinguished Member

mjkl

Active Member

We value your privacy