[SOLVED] Ceph hang in Degraded data redundancy

ilia987

Active Member
Sep 9, 2019
275
13
38
37
flow:
1 servers had reboot due to power maintenance,
2 (after the reboot) i noticed one server had bad clock sync - fixing the issue and another reboot solved it)
the
3. after time sync fixed cluster started to load and rebalance,
4 it hang at error state (data looks ok and everything stable and working) but i see error in ceph health

any idea?


Update - full shut down of all ceph nodes solved the issue (one by one did not help)
 

Attachments

  • Screenshot from 2024-02-28 12-27-36.png
    Screenshot from 2024-02-28 12-27-36.png
    115.2 KB · Views: 30
Last edited:
Please post the output of ceph -s and ceph osd df tree.
ceph -s

Code:
cluster:
    id:     8ebca482-f985-4e74-9ff8-35e03a1af15e
    health: HEALTH_WARN
            Degraded data redundancy: 1608/62722158 objects degraded (0.003%), 28 pgs degraded, 22 pgs undersized
 
  services:
    mon: 3 daemons, quorum pve-srv2,pve-srv3,pve-srv4 (age 2d)
    mgr: pve-srv2(active, since 6h), standbys: pve-srv4, pve-srv3
    mds: 2/2 daemons up, 1 standby
    osd: 32 osds: 32 up (since 2d), 32 in (since 2d); 42 remapped pgs
 
  data:
    volumes: 2/2 healthy
    pools:   6 pools, 705 pgs
    objects: 20.91M objects, 41 TiB
    usage:   123 TiB used, 59 TiB / 182 TiB avail
    pgs:     1608/62722158 objects degraded (0.003%)
             401413/62722158 objects misplaced (0.640%)
             654 active+clean
             22  active+recovering+undersized+degraded+remapped
             19  active+remapped+backfilling
             5   active+recovering+degraded
             4   active+recovering
             1   active+recovering+degraded+remapped
 
  io:
    client:   37 KiB/s rd, 3.8 MiB/s wr, 2 op/s rd, 343 op/s wr


ceph osd df tree
Code:
ID  CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL     %USE   VAR   PGS  STATUS  TYPE NAME     
-1         181.63490         -  182 TiB  123 TiB  123 TiB  7.0 GiB  249 GiB    59 TiB  67.70  1.00    -          root default   
-3          41.91394         -   42 TiB   29 TiB   29 TiB  124 MiB   58 GiB    13 TiB  69.67  1.03    -              host pve-srv2
 0    ssd    3.49219   1.00000  3.5 TiB  2.5 TiB  2.5 TiB  1.1 MiB  4.5 GiB  1005 GiB  71.90  1.06   41      up          osd.0 
 1    ssd    3.49219   1.00000  3.5 TiB  2.6 TiB  2.6 TiB  4.5 MiB  5.6 GiB   929 GiB  74.01  1.09   45      up          osd.1 
 2    ssd    3.49219   1.00000  3.5 TiB  2.3 TiB  2.3 TiB  1.9 MiB  4.4 GiB   1.2 TiB  65.20  0.96   39      up          osd.2 
 3    ssd    3.49219   1.00000  3.5 TiB  1.9 TiB  1.9 TiB  9.4 MiB  4.7 GiB   1.6 TiB  54.76  0.81   32      up          osd.3 
12    ssd    6.98630   1.00000  7.0 TiB  4.9 TiB  4.9 TiB   13 MiB  9.7 GiB   2.1 TiB  70.50  1.04   82      up          osd.12 
13    ssd    6.98630   1.00000  7.0 TiB  5.2 TiB  5.2 TiB  9.6 MiB   10 GiB   1.7 TiB  74.98  1.11   86      up          osd.13 
14    ssd    6.98630   1.00000  7.0 TiB  4.9 TiB  4.8 TiB   80 MiB  9.3 GiB   2.1 TiB  69.50  1.03   76      up          osd.14 
15    ssd    6.98630   1.00000  7.0 TiB  4.9 TiB  4.9 TiB  4.8 MiB   10 GiB   2.1 TiB  70.14  1.04   78      up          osd.15 
-7          41.91574         -   42 TiB   31 TiB   31 TiB  1.9 GiB   61 GiB    11 TiB  72.99  1.08    -              host pve-srv3
 8    ssd    3.49219   1.00000  3.5 TiB  2.6 TiB  2.6 TiB  2.1 MiB  4.9 GiB   941 GiB  73.68  1.09   43      up          osd.8 
10    ssd    3.49309   1.00000  3.5 TiB  2.7 TiB  2.7 TiB  602 MiB  5.9 GiB   767 GiB  78.55  1.16   47      up          osd.10 
11    ssd    3.49219   1.00000  3.5 TiB  2.4 TiB  2.4 TiB  1.2 MiB  4.2 GiB   1.1 TiB  67.82  1.00   39      up          osd.11 
17    ssd    6.98630   1.00000  7.0 TiB  5.5 TiB  5.5 TiB  5.6 MiB   10 GiB   1.4 TiB  79.26  1.17   93      up          osd.17 
18    ssd    6.98630   1.00000  7.0 TiB  5.1 TiB  5.1 TiB  6.0 MiB   11 GiB   1.8 TiB  73.61  1.09   90      up          osd.18 
23    ssd    6.98630   1.00000  7.0 TiB  5.0 TiB  5.0 TiB  461 MiB  9.7 GiB   2.0 TiB  71.43  1.05   85      up          osd.23 
27    ssd    3.49309   1.00000  3.5 TiB  1.9 TiB  1.9 TiB  302 MiB  4.2 GiB   1.6 TiB  54.90  0.81   32      up          osd.27 
32    ssd    6.98630   1.00000  7.0 TiB  5.3 TiB  5.3 TiB  542 MiB   10 GiB   1.7 TiB  76.16  1.12   91      up          osd.32 
-5          41.91484         -   42 TiB   29 TiB   29 TiB  1.7 GiB   60 GiB    13 TiB  69.10  1.02    -              host pve-srv4
 4    ssd    3.49219   1.00000  3.5 TiB  2.1 TiB  2.1 TiB  1.1 MiB  4.0 GiB   1.4 TiB  60.55  0.89   35      up          osd.4 
 5    ssd    3.49219   1.00000  3.5 TiB  2.6 TiB  2.6 TiB  3.3 MiB  5.5 GiB   879 GiB  75.42  1.11   46      up          osd.5 
 6    ssd    3.49219   1.00000  3.5 TiB  2.9 TiB  2.9 TiB  1.1 MiB  5.1 GiB   583 GiB  83.70  1.24   49      up          osd.6 
 7    ssd    3.49309   1.00000  3.5 TiB  2.7 TiB  2.7 TiB  1.3 GiB  6.2 GiB   818 GiB  77.13  1.14   50      up          osd.7 
19    ssd    6.98630   1.00000  7.0 TiB  5.4 TiB  5.4 TiB  6.6 MiB   11 GiB   1.6 TiB  77.29  1.14   91      up          osd.19 
20    ssd    6.98630   1.00000  7.0 TiB  4.3 TiB  4.3 TiB  6.4 MiB  9.4 GiB   2.7 TiB  61.71  0.91   79      up          osd.20 
21    ssd    6.98630   1.00000  7.0 TiB  4.5 TiB  4.5 TiB  4.3 MiB   10 GiB   2.5 TiB  64.37  0.95   78      up          osd.21 
22    ssd    6.98630   1.00000  7.0 TiB  4.4 TiB  4.4 TiB  389 MiB  8.9 GiB   2.6 TiB  62.83  0.93   72      up          osd.22 
-9          55.89038         -   56 TiB   34 TiB   34 TiB  3.3 GiB   70 GiB    22 TiB  61.22  0.90    -              host pve-srv5
 9    ssd    6.98630   1.00000  7.0 TiB  4.0 TiB  4.0 TiB  233 MiB  7.6 GiB   3.0 TiB  56.86  0.84   67      up          osd.9 
16    ssd    6.98630   1.00000  7.0 TiB  4.6 TiB  4.6 TiB  909 MiB  8.8 GiB   2.4 TiB  65.38  0.97   78      up          osd.16 
24    ssd    6.98630   1.00000  7.0 TiB  4.1 TiB  4.0 TiB  7.6 MiB  9.1 GiB   2.9 TiB  58.01  0.86   74      up          osd.24 
25    ssd    6.98630   1.00000  7.0 TiB  3.9 TiB  3.9 TiB  3.0 MiB  8.3 GiB   3.0 TiB  56.48  0.83   67      up          osd.25 
26    ssd    6.98630   1.00000  7.0 TiB  3.9 TiB  3.9 TiB   14 MiB  8.7 GiB   3.1 TiB  55.53  0.82   67      up          osd.26 
29    ssd    6.98630   1.00000  7.0 TiB  5.0 TiB  4.9 TiB  688 MiB  9.8 GiB   2.0 TiB  70.95  1.05   85      up          osd.29 
30    ssd    6.98630   1.00000  7.0 TiB  4.4 TiB  4.3 TiB  1.0 GiB  8.9 GiB   2.6 TiB  62.27  0.92   77      up          osd.30 
31    ssd    6.98630   1.00000  7.0 TiB  4.5 TiB  4.5 TiB  515 MiB  9.0 GiB   2.5 TiB  64.26  0.95   79      up          osd.31 
                         TOTAL  182 TiB  123 TiB  123 TiB  7.0 GiB  249 GiB    59 TiB  67.70
 
Last edited:
Please show the output of ceph osd pool ls detail and ceph osd crush rule dump.
ceph osd pool ls detail
Code:
pool 2 'ceph-lxc' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 227165 lfor 0/136355/136651 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
    removed_snaps_queue [14f51~1,14f53~1,14f55~1,14f57~1,14f59~1,14f5b~1,14f5d~b,14f69~2,14f6f~1,14f74~1,14f76~1,14f78~1,14f7a~1,14f7c~1,14f7e~3,14f82~1,14f85~1,14f8a~2,14f8d~1,14f8f~1,14f91~1,14f93~1,14f95~1,14f99~1,14f9b~1,14f9d~1,14fa0~1,14fa2~1,14fa5~1,14fa7~1,14fa9~1,14fab~1,14fad~1,14faf~1,14fb1~1,14fb3~1,14fb5~6,14fbc~8,14fc7~1,14fcb~1,14fce~1,14fd0~1,14fd2~1,14fd5~1,14fd7~2,14fdb~1,14fde~1,14fe2~2,14fe5~1,14fe7~1,14fe9~1,14feb~1,14fed~1,14ff1~1,14ff3~1,14ff5~1,14ff7~1,14ffa~1,14ffd~1,14fff~1,15001~1,15003~1,15005~1,15007~1,15009~1,1500b~1,1500d~b,15019~3,15020~1,15023~1,15026~1,15029~2,1502d~1,1502f~2,15033~1,15035~1,15038~1,1503b~1,1503d~1,1503f~1,15041~1,15043~1,15045~1,15049~1,1504b~1,1504d~1,1504f~1,15052~1,15055~1,15057~1,15059~1,1505b~1,1505d~1,1505f~1,15061~1,15063~1]
pool 8 'cephfs-data_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 137796 lfor 0/0/66741 flags hashpspool stripe_width 0 application cephfs
pool 9 'cephfs-data_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 159049 lfor 0/159049/159047 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 14 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 200366 lfor 0/157710/157708 flags hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
pool 18 'cephfs-shared_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 159227 lfor 0/159227/159225 flags hashpspool stripe_width 0 application cephfs
pool 19 'cephfs-shared_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 158692 lfor 0/158692/158690 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

ceph osd crush rule dump
Code:
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]
 
There is another change i noticed today: pgs scrub issue.

*till now the systems are running and responsive but i dont think is is healy
Screenshot at 2024-03-06 12-28-30.png
 
Update - full shut down of all ceph nodes solved the issue (one by one did not help)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!