[SOLVED] CEPH OSDs Full, Unbalanced PGs, and Rebalancing Issues in Proxmox VE 8

lfq_made4it · Wednesday at 14:51

Scenario

I have a Proxmox VE 8 cluster with 6 nodes, using CEPH as distributed storage. The cluster consists of 48 OSDs, distributed across 4 servers with SSDs and 2 with HDDs.

Monday night, three OSDs reached 100% capacity and crashed:

osd.16 (pve118)
osd.23 (pve118)
osd.24 (pve119)

Logs indicate "ENOSPC" (No Space Left on Device), and these OSDs are unable to start.

Troubleshooting Steps Taken

Removed and re-added the problematic OSDs.
Enabled rebalancing (ceph osd unset norebalance and ceph osd unset norecover).
Tried to force PG relocation with ceph osd pg-upmap-items, but PGs did not move.
Executed ceph osd reweight on full OSDs, but the issue persists.
Considered using ceph-bluestore-tool bluefs-bdev-migrate, but I’m unsure if it applies in this case.

Current Cluster Status

Output of ceph -s:

Code:

root@pve118:~# ceph -s
  cluster:
    id:     52d10d07-2f32-41e7-b8cf-7d7282af69a2
    health: HEALTH_WARN
            2 nearfull osd(s)
            Degraded data redundancy: 1668878/14228676 objects degraded (11.729%), 45 pgs degraded, 45 pgs undersized
            23 pgs not deep-scrubbed in time
            23 pgs not scrubbed in time
            2 pool(s) nearfull

  services:
    mon: 6 daemons, quorum pve118,pve119,pve114,pve142,pve143,pve117 (age 5d)
    mgr: pve119(active, since 17h), standbys: pve118, pve117, pve143, pve142, pve114
    osd: 48 osds: 48 up (since 23h), 48 in (since 23h); 83 remapped pgs

  data:
    pools:   3 pools, 289 pgs
    objects: 4.74M objects, 17 TiB
    usage:   47 TiB used, 156 TiB / 203 TiB avail
    pgs:     1668878/14228676 objects degraded (11.729%)
             3073835/14228676 objects misplaced (21.603%)
             161 active+clean
             81  active+clean+remapped
             45  active+undersized+degraded
             2   active+clean+remapped+scrubbing+deep

  io:
    client:   85 KiB/s rd, 7.9 MiB/s wr, 6 op/s rd, 351 op/s wr

root@pve118:~#

Output of ceph osd df shows some OSDs at 90%+ usage, while others are below 20%.

Code:

root@pve118:~# ceph osd df tree
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1         202.57080         -  203 TiB   47 TiB   47 TiB  858 KiB  134 GiB  156 TiB  23.07  1.00    -          root default
 -9           6.98633         -  7.0 TiB  1.4 TiB  1.4 TiB   77 KiB  5.4 GiB  5.5 TiB  20.65  0.90    -              host pve114
  0    ssd    0.87329   1.00000  894 GiB  193 GiB  192 GiB   20 KiB  683 MiB  701 GiB  21.56  0.93   25      up          osd.0
  1    ssd    0.87329   1.00000  894 GiB  194 GiB  193 GiB   12 KiB  853 MiB  701 GiB  21.64  0.94   21      up          osd.1
  2    ssd    0.87329   1.00000  894 GiB  128 GiB  128 GiB    3 KiB  505 MiB  766 GiB  14.37  0.62    5      up          osd.2
  3    ssd    0.87329   1.00000  894 GiB  193 GiB  192 GiB   10 KiB  745 MiB  701 GiB  21.56  0.93   35      up          osd.3
  4    ssd    0.87329   1.00000  894 GiB  129 GiB  128 GiB   12 KiB  576 MiB  766 GiB  14.39  0.62   10      up          osd.4
  5    ssd    0.87329   1.00000  894 GiB  193 GiB  192 GiB    9 KiB  576 MiB  701 GiB  21.58  0.94   10      up          osd.5
  6    ssd    0.87329   1.00000  894 GiB  320 GiB  319 GiB    6 KiB  736 MiB  574 GiB  35.80  1.55    5      up          osd.6
  7    ssd    0.87329   1.00000  894 GiB  128 GiB  127 GiB    5 KiB  821 MiB  766 GiB  14.31  0.62   20      up          osd.7
 -3           6.98633         -  7.0 TiB  4.3 TiB  4.3 TiB  166 KiB   15 GiB  2.7 TiB  61.28  2.66    -              host pve117
  8    ssd    0.87329   0.95000  894 GiB  686 GiB  683 GiB   28 KiB  2.5 GiB  209 GiB  76.67  3.32   20      up          osd.8
  9    ssd    0.87329   1.00000  894 GiB  137 GiB  136 GiB   24 KiB  939 MiB  757 GiB  15.36  0.67   11      up          osd.9
 10    ssd    0.87329   0.95000  894 GiB  684 GiB  682 GiB   15 KiB  2.3 GiB  210 GiB  76.52  3.32   15      up          osd.10
 11    ssd    0.87329   0.95001  894 GiB  687 GiB  685 GiB   27 KiB  1.9 GiB  207 GiB  76.87  3.33   15      up          osd.11
 12    ssd    0.87329   0.95000  894 GiB  681 GiB  680 GiB   26 KiB  1.4 GiB  213 GiB  76.20  3.30   10      up          osd.12
 13    ssd    0.87329   0.95001  894 GiB  685 GiB  683 GiB   19 KiB  2.1 GiB  209 GiB  76.64  3.32   30      up          osd.13
 14    ssd    0.87329   1.00000  894 GiB  138 GiB  136 GiB   11 KiB  1.8 GiB  756 GiB  15.42  0.67   26      up          osd.14
 15    ssd    0.87329   0.95000  894 GiB  685 GiB  683 GiB   16 KiB  2.0 GiB  209 GiB  76.59  3.32   15      up          osd.15
 -5           6.98633         -  7.0 TiB  2.7 TiB  2.7 TiB  105 KiB   10 GiB  4.3 TiB  38.25  1.66    -              host pve118
 16    ssd    0.87329   1.00000  894 GiB  217 MiB  169 MiB    1 KiB   48 MiB  894 GiB   0.02  0.00   25      up          osd.16
 17    ssd    0.87329   0.95001  894 GiB  818 GiB  816 GiB   16 KiB  1.9 GiB   77 GiB  91.44  3.96   11      up          osd.17
 18    ssd    0.87329   0.95001  894 GiB  819 GiB  817 GiB   18 KiB  2.1 GiB   75 GiB  91.61  3.97   16      up          osd.18
 19    ssd    0.87329   1.00000  894 GiB  274 GiB  272 GiB   12 KiB  1.6 GiB  620 GiB  30.64  1.33   12      up          osd.19
 20    ssd    0.87329   1.00000  894 GiB  139 GiB  137 GiB   14 KiB  1.6 GiB  755 GiB  15.54  0.67   46      up          osd.20
 21    ssd    0.87329   1.00000  894 GiB  138 GiB  137 GiB   16 KiB  677 MiB  757 GiB  15.38  0.67   16      up          osd.21
 22    ssd    0.87329   1.00000  894 GiB  549 GiB  546 GiB   27 KiB  2.5 GiB  346 GiB  61.34  2.66   14      up          osd.22
 23    ssd    0.87329   1.00000  894 GiB  174 MiB  130 MiB    1 KiB   44 MiB  894 GiB   0.02     0   20      up          osd.23
 -7           6.98633         -  7.0 TiB  3.1 TiB  3.1 TiB  135 KiB   14 GiB  3.9 TiB  44.04  1.91    -              host pve119
 24    ssd    0.87329   1.00000  894 GiB  125 MiB   98 MiB    1 KiB   26 MiB  894 GiB   0.01     0   10      up          osd.24
 25    ssd    0.87329   0.95000  894 GiB  686 GiB  684 GiB   20 KiB  2.5 GiB  208 GiB  76.77  3.33   10      up          osd.25
 26    ssd    0.87329   1.00000  894 GiB  409 GiB  407 GiB   10 KiB  2.2 GiB  485 GiB  45.75  1.98    8      up          osd.26
 27    ssd    0.87329   1.00000  894 GiB  408 GiB  406 GiB   13 KiB  2.2 GiB  486 GiB  45.67  1.98   23      up          osd.27
 28    ssd    0.87329   0.95000  894 GiB  684 GiB  682 GiB   32 KiB  2.1 GiB  210 GiB  76.52  3.32   20      up          osd.28
 29    ssd    0.87329   1.00000  894 GiB  413 GiB  411 GiB   20 KiB  1.9 GiB  481 GiB  46.17  2.00    8      up          osd.29
 30    ssd    0.87329   1.00000  894 GiB  412 GiB  410 GiB   23 KiB  2.1 GiB  482 GiB  46.10  2.00   33      up          osd.30
 31    ssd    0.87329   1.00000  894 GiB  137 GiB  136 GiB   16 KiB  895 MiB  757 GiB  15.36  0.67   11      up          osd.31
-16          87.31274         -   87 TiB   18 TiB   18 TiB  184 KiB   47 GiB   70 TiB  20.35  0.88    -              host pve142
 32    hdd   10.91409   1.00000   11 TiB  2.4 TiB  2.4 TiB   26 KiB  5.9 GiB  8.5 TiB  22.03  0.95   19      up          osd.32
 33    hdd   10.91409   1.00000   11 TiB  1.7 TiB  1.7 TiB   25 KiB  4.6 GiB  9.2 TiB  15.91  0.69   13      up          osd.33
 34    hdd   10.91409   1.00000   11 TiB  2.0 TiB  2.0 TiB   16 KiB  4.4 GiB  8.9 TiB  18.37  0.80   15      up          osd.34
 35    hdd   10.91409   1.00000   11 TiB  2.7 TiB  2.7 TiB   23 KiB  7.1 GiB  8.2 TiB  24.49  1.06   20      up          osd.35
 36    hdd   10.91409   1.00000   11 TiB  1.9 TiB  1.9 TiB   22 KiB  4.9 GiB  9.0 TiB  17.14  0.74   14      up          osd.36
 37    hdd   10.91409   1.00000   11 TiB  1.2 TiB  1.2 TiB   13 KiB  3.3 GiB  9.7 TiB  10.99  0.48    9      up          osd.37
 38    hdd   10.91409   1.00000   11 TiB  3.5 TiB  3.5 TiB   33 KiB   11 GiB  7.4 TiB  31.84  1.38   26      up          osd.38
 39    hdd   10.91409   1.00000   11 TiB  2.4 TiB  2.4 TiB   26 KiB  6.0 GiB  8.5 TiB  22.01  0.95   18      up          osd.39
-19          87.31274         -   87 TiB   17 TiB   17 TiB  191 KiB   43 GiB   70 TiB  20.04  0.87    -              host pve143
 40    hdd   10.91409   1.00000   11 TiB  2.7 TiB  2.7 TiB   20 KiB  7.1 GiB  8.2 TiB  24.46  1.06   20      up          osd.40
 41    hdd   10.91409   1.00000   11 TiB  1.9 TiB  1.9 TiB   12 KiB  4.4 GiB  9.0 TiB  17.18  0.74   15      up          osd.41
 42    hdd   10.91409   1.00000   11 TiB  1.9 TiB  1.9 TiB   36 KiB  4.7 GiB  9.0 TiB  17.09  0.74   14      up          osd.42
 43    hdd   10.91409   1.00000   11 TiB  3.2 TiB  3.2 TiB   27 KiB  8.2 GiB  7.7 TiB  29.38  1.27   24      up          osd.43
 44    hdd   10.91409   1.00000   11 TiB  2.1 TiB  2.1 TiB   20 KiB  4.8 GiB  8.8 TiB  19.57  0.85   16      up          osd.44
 45    hdd   10.91409   1.00000   11 TiB  1.6 TiB  1.6 TiB   27 KiB  4.1 GiB  9.3 TiB  14.71  0.64   12      up          osd.45
 46    hdd   10.91409   1.00000   11 TiB  2.1 TiB  2.1 TiB   20 KiB  4.7 GiB  8.8 TiB  19.58  0.85   16      up          osd.46
 47    hdd   10.91409   1.00000   11 TiB  2.0 TiB  2.0 TiB   29 KiB  4.7 GiB  8.9 TiB  18.33  0.79   15      up          osd.47
                          TOTAL  203 TiB   47 TiB   47 TiB  882 KiB  134 GiB  156 TiB  23.07
MIN/MAX VAR: 0/3.97  STDDEV: 27.96
root@pve118:~#

Questions and Help Request

How can I force PG reallocation to OSDs with available space?
Why is pg-upmap-items not working for PG redistribution?
Is there any other method to free up space on these OSDs and rebalance the cluster?

Any help from the community would be greatly appreciated!

I can provide logs or additional command outputs if needed.

Thanks in advance for your support!

renadown · Wednesday at 15:25

Leandro Aude · Wednesday at 15:42

VictorSTS · Wednesday at 15:59

Would need the output of:

ceph osd pool ls detail
ceph osd crush rule dump
ceph pg dump (this last might get huge!)

VictorSTS · Wednesday at 16:11

Checking this again, with the already provided data, pretty sure you are using crush rule(s) that do not use device class and mixing 1T with 10T drives in the same pools with so few PGs will cause such imbalace.

quanto11 · Wednesday at 16:19

VictorSTS said:
Checking this again, with the already provided data, pretty sure you are using crush rule(s) that do not use device class and mixing 1T with 10T drives in the same pools with so few PGs will cause such imbalace.

i think he does, because of :

data: pools: 3 pools, 289 pgs

1. .mgr
2. hdd
3. ssd

correct me if im wrong.

This could help you, but should really only be the very last method with complete backups:

V

Thread 'PVE CEPH issues (full and recovery)'

Jul 27, 2023

Hi all,

Since a couple of days we are experiencing an outage of our CEPH cluster.
We are running a 3/2 config, with 3 nodes and 3 OSD's per node (all same NVMe in both type and size).
At first our VMs became unreachable, probably due to the fact that the disks where full and thus prevented IO.
After that we restarted the machines and noticed that 3 OSD's where down and tried to restart them.

But they failed to start with bluestore throwing enospc error:

Code:

-1038> 2023-07-26T14:50:58.886+0200 7fbc8af6e540 -1 bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty...

VictorSTS · Wednesday at 16:32

quanto11 said:
data: pools: 3 pools, 289 pgs

Need the ouput requested to be sure.

quanto11 said:
This could help you, but should really only be the very last method with complete backups:

That was helpful in that case due to using just 3 hosts. With 6 it will probably help, but does not completely avoid the main issue and over time unbalanced distribution will arise, not to mention how unsafe is to use just 2 replicas

lfq_made4it · Wednesday at 16:41

VictorSTS said:
ceph osd pool ls detail
ceph osd crush rule dump
ceph pg dump (this last might get huge!)

The output of the commands is attached

VictorSTS said:
Checking this again, with the already provided data, pretty sure you are using crush rule(s) that do not use device class and mixing 1T with 10T drives in the same pools with so few PGs will cause such imbalace.

We are using two pools, one ssd-pool with a specific crush rule for the SSD drives (1TB), and another hdd-pool with a specific crush rule too for the HDD drives (10TB)

VictorSTS said:
That was helpful in that case due to using just 3 hosts. With 6 it will probably help, but does not completely avoid the main issue and over time unbalanced distribution will arise, not to mention how unsafe is to use just 2 replicas

There's an important detail in our scenario, 4 of the 6 servers are with SSD and we have created the ssd-pool for them, but only 2 of the 6 servers are with HDD and we have created the hdd-pool for them but with size 3 and min size 2.

quanto11 · Wednesday at 17:09

i think we got several problems here:

1. uneven alligement of HDDs only on 2 nodes.
2. 83 remapped pgs (which is working right now?)

pool 4 'hdd-pool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 120 pgp_num_target 128 autoscale_mode on last_change 4488 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.88

you got only 2 nodes, if 1 crashes all data on the hdd pool is gone.

on the other side, some of your ssd osds crashed or got full, which is not related to the hdd pool.

what type of ssd are you using? can you show us some smart values?

Code:

'ssd-pool' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins
pg_num 160 pgp_num 32 pg_num_target 1024 pgp_num_target 1024

VictorSTS · Wednesday at 18:01

A lot of things here...

You are using 6 monitors, which isn't supported. Use either 3 or 5 (preferred as it would allow 2 mons to fail and still keep quorum).
You have a 3/2 pool set to drive class "hdd", but only have 2 servers with "hdd" drives. This is the main origin of your problem. All the PGs used in that "hdd" pool are using one OSD from the "ssd", hence causing the great imbalance in some OSD. Check any PG of ceph pg dump output whose "UP" column only has 2 OSD, and look that there are 3 OSD "ACTING". Not 100% sure about why Ceph is using a different device class for those "third replicas". Those PG should be "active+undersized" instead of "active+remapped". Maybe you've changed the crush rule for that pool and while rebalancing some OSD got too_full?
Your "ssd-pool" uses crush rule 3, which is set to failure domain of "osd":

Code:

    {
        "rule_id": 3,
        "rule_name": "ssd-osd-replicated-rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -2,
                "item_name": "default~ssd"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    }

That means "create 3 copies of each bit on 3 OSD of class ssd". It does not force each copy to be created in a different host. You can check that in ceph pg dump output, where many PGs are located in two or even 3 OSD of the same host. If some host fails or is rebooted, PGs will become inactive. This essentially defeats the whole purpose of Ceph.

You have the .mgr pool still using the default "replicated_rule", which disables autoscaler ability to adapt the PGs of each pool.
The "ssd-pool" is set to 160PGs, which is not a valid value (should be multiple of 2). Probably due to the autoscaler being disabled as explained above.

Actions:

Triple check you have working backups.
Change the "hdd" pool to use 2/2 replicas, so it frees space from the "ssd" OSD.
Change crush rule for .mgr to use the ssd-replicated-rule rule.
Change the crush rule for the "ssd" pool to use the ssd-replicated-rule rule.
Then buy another server and place HDD with same size in it to change the "hdd" pool to 3/2 replicas.

VictorSTS · Wednesday at 18:06

quanto11 said:
you got only 2 nodes, if 1 crashes all data on the hdd pool is gone.

Not exactly: data will remain in one node but PGs will become inactive unless the pool is set to 2/1 replicas (which is something not recomended in any case except for disaster recovery).

lfq_made4it · Wednesday at 18:58

quanto11 said:
i think we got several problems here:

VictorSTS said:
A lot of things here...

We imagined... :/

quanto11 said:
2. 83 remapped pgs (which is working right now?)

The “hdd-pool” is functional with a few discrepancies, such as not being able to see its contents via the Proxmox GUI, and the SSD, despite having all the PGs “active+clean”, has its total capacity completely reduced (it should be at 9.7T and is at just under 400G).

quanto11 said:
what type of ssd are you using? can you show us some smart values?

The SSDs we use are all the same model, I'll let the smartctl of one here as an example.

Code:

root@pve118:~# smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HPE
Product:              VO000960JWZJF
Revision:             HPD5
Compliance:           SPC-5
User Capacity:        960,197,124,096 bytes [960 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x50000f0b0183d050
Serial number:        S5KRNA0R806433
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed Feb 19 13:44:47 2025 -04
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     22 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 23375:35
Manufactured in week 32 of year 2021
Accumulated start-stop cycles:  84
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.083           0
write:         0        0         0         0          0          0.000           0

Non-medium error count:     1915

  Pending defect count:0 Pending Defects
No Self-tests have been logged

root@pve118:~#

VictorSTS said:
You are using 6 monitors, which isn't supported. Use either 3 or 5 (preferred as it would allow 2 mons to fail and still keep quorum).

We didn't pay attention to this, we'll adjust it.

VictorSTS said:
You have a 3/2 pool set to drive class "hdd", but only have 2 servers with "hdd" drives. This is the main origin of your problem.

We created the “hdd-pool” with 3/2 size because the intention is to add a 7th server to the cluster with HDD devices to reach the ideal size, and we didn't know that the 2/2 size could be used in production until the other server arrived.

VictorSTS said:
All the PGs used in that "hdd" pool are using one OSD from the "ssd", hence causing the great imbalance in some OSD. Check any PG of ceph pg dump output whose "UP" column only has 2 OSD, and look that there are 3 OSD "ACTING". Not 100% sure about why Ceph is using a different device class for those "third replicas". Those PG should be "active+undersized" instead of "active+remapped". Maybe you've changed the crush rule for that pool and while rebalancing some OSD got too_full?

We don't have a lot of experience with CEPH to be able to troubleshoot this, which is why we came to the community, but looking at what you've said, it really is a very strange behavior. Ever since the “hdd-pool” was created it has had the same crush rule, we've just changed the “ssd-pool” rule. Do you have any idea why this might have happened?

VictorSTS said:
Your "ssd-pool" uses crush rule 3, which is set to failure domain of "osd":

Initially the pool used the crush rule “ssd-replicated-rule” which uses “host” as the failure domain type, we changed its crush rule thinking that balancing could happen, but it had no effect.

VictorSTS said:
You have the .mgr pool still using the default "replicated_rule", which disables autoscaler ability to adapt the PGs of each pool.

The "ssd-pool" is set to 160PGs, which is not a valid value (should be multiple of 2). Probably due to the autoscaler being disabled as explained above.

We have changed the crush rule of the .mgr pool for both the “ssd-replicated-rule” and the “hdd-replicated-rule” for the autoscaler to work, but we don't keep it in them because there are two of them and we are in doubt as to how to make it work... the ssd-pool only has 160PGs due to the autoscaler, it set this value.

VictorSTS said:
Actions:

Triple check you have working backups.

Change the "hdd" pool to use 2/2 replicas, so it frees space from the "ssd" OSD.

Change crush rule for .mgr to use the ssd-replicated-rule rule.

Change the crush rule for the "ssd" pool to use the ssd-replicated-rule rule.

Then buy another server and place HDD with same size in it to change the "hdd" pool to 3/2 replicas.

We will work on these actions and bring you news as soon as possible.

Thank you all very much for your help so far!

lfq_made4it · 2025-02-21T19:27:03+0100

Hello, everyone!

I'm back to report that the CEPH environment is now 100% healthy after applying the actions you recommended!

Code:

root@pve114:~# ceph -s
  cluster:
    id:     52d10d07-2f32-41e7-b8cf-7d7282af69a2
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum pve118,pve119,pve114,pve142,pve117 (age 22m)
    mgr: pve119(active, since 2d), standbys: pve118, pve117, pve142, pve114
    osd: 48 osds: 48 up (since 3d), 48 in (since 3d)

  data:
    pools:   3 pools, 1153 pgs
    objects: 4.68M objects, 17 TiB
    usage:   34 TiB used, 169 TiB / 203 TiB avail
    pgs:     1152 active+clean
             1    active+clean+scrubbing+deep

  io:
    client:   48 MiB/s rd, 31 MiB/s wr, 498 op/s rd, 499 op/s wr

root@pve114:~#

Thank you so much for your help!

We changed the CRUSH rule of the "ssd-pool," adjusted the size of the "hdd-pool" to 2/2, and also removed one monitor to keep the total at five.

I've attached the verification commands for you to check.

By the way, I have one last question... As I mentioned, the cluster has six servers, and four of them have SSDs, with eight SSDs per server. Considering that the "Failure domain type" we are using for the pools is "host" and that we reduced the "hdd-pool" size to 2/2 since we have two servers with HDDs, should we adjust the "ssd-pool" size to 4/3?

alexskysilk · 2025-02-21T19:34:11+0100

lfq_made4it said:
As I mentioned, the cluster has six servers, and four of them have SSDs, with eight SSDs per server. Considering that the "Failure domain type" we are using for the pools is "host" and that we reduced the "hdd-pool" size to 2/2 since we have two servers with HDDs, should we adjust the "ssd-pool" size to 4/3?

number of hosts isnt directly related to your size rules. A typical replication group is set to 3:2 where 3 osd's are required for healthy operation and 2 is the minimum osd count required for write access. such a pool would need a minimum of 3 hosts but doesnt change as host count gets larger.

lfq_made4it said:
Failure domain type" we are using for the pools is "host" and that we reduced the "hdd-pool" size to 2/2 since we have two servers with HDDs,

This is not a good idea at all. a 2:2 operation means you cannot sustain any fault, and also means there is never quorum on disagree; any disgree becomes fault by definition and will shut down your file system.

lfq_made4it · 2025-02-21T19:44:35+0100

alexskysilk said:
number of hosts isnt directly related to your size rules. A typical replication group is set to 3:2 where 3 osd's are required for healthy operation and 2 is the minimum osd count required for write access. such a pool would need a minimum of 3 hosts but doesnt change as host count gets larger.

Understood

alexskysilk said:
This is not a good idea at all. a 2:2 operation means you cannot sustain any fault, and also means there is never quorum on disagree; any disgree becomes fault by definition and will shut down your file system.

We are aware, and the idea is not to keep this way. We are working on adding a third server with HDDs and adjust the size back to 3/2.

VictorSTS · 2025-02-22T12:03:20+0100

Glad to know its ok now!

lfq_made4it said:
adjusted the size of the "hdd-pool" to 2/2

Add that third host asap, like yesterday. Meanwhile you add the third node with HDD's, if for some reason you have to take down one of the HDD hosts or if it breaks, change the pool to 2/1. That will allow to keep I/O to the pool at the (very high) cost of no redundancy and data loss risk if something fails to that last HDD host.

lfq_made4it said:
should we adjust the "ssd-pool" size to 4/3?

alexskysilk gave a perfect explanation. I would like to add that on Ceph 3 hosts is the minimum, the fourth essentially just adds redundancy: covers one host failure and still be able to have 3 copies, so the space it provides shouldn't really be treated as "available space". The fifth host is the first that really adds both redundancy and space to the cluster, as do additional hosts. All hosts add I/O and bandwidth capacity to the cluster as a whole.

You could have a 8/7 ssd pool, which will create 8 copies and require 7 hosts to be up for writes to work, but that doesn't really make much sense unless you can print money on your own

Once you reach at least 9 hosts, you may increase the fault domain i.e. to rack, place 3 servers in 3 different racks and tell Ceph to create a copy in a server of a different rack. This will increase availability in case a full rack is down due to power or network failures. There are lots of possibilities!

[SOLVED] CEPH OSDs Full, Unbalanced PGs, and Rebalancing Issues in Proxmox VE 8

Member

Scenario​

Troubleshooting Steps Taken​

Current Cluster Status​

Questions and Help Request​

New Member

New Member

Famous Member

Famous Member

Member

Famous Member

Member

Attachments

Member

Famous Member

Famous Member

Member

Member

Attachments

Distinguished Member

Member

Famous Member

We value your privacy

Scenario

Troubleshooting Steps Taken

Current Cluster Status

Questions and Help Request