Ceph Issues After Chassis Failure

MrPaul · Apr 24, 2023

Had a chassis die and a couple drive die in other chassis after a power event. I know there is going to be some data loss but I'm trying to get the cluster into a healthy state. Been working on this for about a week now and decided to ask for help.

I know there are pg issues with this cluster that I've inherited. On my todo list as well.

Below are the stuff I've seen asked for.

Code:

root@proxmox-ceph-2:~# ceph status
  cluster:
    id:     f8d6430f-0df8-4ec5-b78a-d8956832b0de
    health: HEALTH_WARN
            2 pools have many more objects per pg than average
            Reduced data availability: 7 pgs inactive
            Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized
            1001 pgs not deep-scrubbed in time
            1001 pgs not scrubbed in time
            692 slow ops, oldest one blocked for 153766 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
 
  services:
    mon: 2 daemons, quorum proxmox-ceph-2,proxmox-ceph-3 (age 50m)
    mgr: proxmox-ceph-2(active, since 42h), standbys: proxmox-ceph-3
    mds: cephfs:1 {0=proxmox-ceph-2=up:active} 1 up:standby
    osd: 26 osds: 26 up (since 22h), 26 in (since 3d); 1201 remapped pgs
 
  data:
    pools:   4 pools, 2209 pgs
    objects: 268.35k objects, 1.0 TiB
    usage:   2.4 TiB used, 26 TiB / 28 TiB avail
    pgs:     0.317% pgs unknown
             124430/805041 objects degraded (15.456%)
             143917/805041 objects misplaced (17.877%)
             1201 active+clean+remapped
             1001 active+undersized+degraded
             7    unknown

Code:

root@proxmox-ceph-2:~# ceph health detail
HEALTH_WARN 2 pools have many more objects per pg than average; Reduced data availability: 7 pgs inactive; Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized; 1001 pgs not deep-scrubbed in time; 1001 pgs not scrubbed in time; 692 slow ops, oldest one blocked for 153926 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
[WRN] MANY_OBJECTS_PER_PG: 2 pools have many more objects per pg than average
    pool cephfs_data objects per pg (1747) is more than 14.438 times cluster average (121)
    pool ceph objects per pg (3273) is more than 27.0496 times cluster average (121)
[WRN] PG_AVAILABILITY: Reduced data availability: 7 pgs inactive
    pg 1.d3 is stuck inactive for 42h, current state unknown, last acting []
    pg 1.1c3 is stuck inactive for 42h, current state unknown, last acting []
    pg 1.249 is stuck inactive for 42h, current state unknown, last acting []
    pg 1.24e is stuck inactive for 42h, current state unknown, last acting []
    pg 1.2bc is stuck inactive for 42h, current state unknown, last acting []
    pg 1.5c1 is stuck inactive for 42h, current state unknown, last acting []
    pg 1.730 is stuck inactive for 42h, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized
    pg 1.797 is active+undersized+degraded, acting [32,6]
    pg 1.798 is stuck undersized for 22h, current state active+undersized+degraded, last acting [25,35]
    pg 1.799 is stuck undersized for 22h, current state active+undersized+degraded, last acting [11,5]
    pg 1.79b is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,28]
    pg 1.79c is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,4]
    pg 1.79d is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,10]
    pg 1.7a0 is stuck undersized for 22h, current state active+undersized+degraded, last acting [6,35]
    pg 1.7a2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [36,7]
    pg 1.7a6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,40]
    pg 1.7a8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,37]
    pg 1.7aa is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,10]
    pg 1.7ab is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,6]
    pg 1.7ad is stuck undersized for 22h, current state active+undersized+degraded, last acting [26,11]
    pg 1.7b2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,35]
    pg 1.7b4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,25]
    pg 1.7b5 is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,38]
    pg 1.7b6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,6]
    pg 1.7b7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,28]
    pg 1.7b8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [24,32]
    pg 1.7b9 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,25]
    pg 1.7ba is stuck undersized for 22h, current state active+undersized+degraded, last acting [29,35]
    pg 1.7bb is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,5]
    pg 1.7bc is stuck undersized for 22h, current state active+undersized+degraded, last acting [4,33]
    pg 1.7bd is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,8]
    pg 1.7bf is stuck undersized for 22h, current state active+undersized+degraded, last acting [28,36]
    pg 1.7c1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [4,34]
    pg 1.7c2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,9]
    pg 1.7c4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [24,35]
    pg 1.7c8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,26]
    pg 1.7c9 is stuck undersized for 22h, current state active+undersized+degraded, last acting [30,32]
    pg 1.7cb is stuck undersized for 22h, current state active+undersized+degraded, last acting [23,38]
    pg 1.7cd is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,35]
    pg 1.7d1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,30]
    pg 1.7d2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [11,27]
    pg 1.7d3 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,27]
    pg 1.7dc is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,34]
    pg 1.7e2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,35]
    pg 1.7e5 is stuck undersized for 22h, current state active+undersized+degraded, last acting [23,35]
    pg 1.7e7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [30,8]
    pg 1.7e8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,32]
    pg 1.7ea is stuck undersized for 22h, current state active+undersized+degraded, last acting [8,27]
    pg 1.7eb is stuck undersized for 22h, current state active+undersized+degraded, last acting [38,22]
    pg 1.7ee is stuck undersized for 22h, current state active+undersized+degraded, last acting [29,40]
    pg 1.7ef is stuck undersized for 22h, current state active+undersized+degraded, last acting [32,29]
    pg 1.7f1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [6,37]
    pg 1.7f2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,29]
    pg 1.7f3 is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,30]
    pg 1.7f4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,22]
    pg 1.7f6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,27]
    pg 1.7f7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,7]
    pg 1.7f8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,26]
[WRN] PG_NOT_DEEP_SCRUBBED: 1001 pgs not deep-scrubbed in time
    pg 1.7f8 not deep-scrubbed since 2020-06-18T06:27:18.221763-0500
    pg 1.7f7 not deep-scrubbed since 2020-06-12T05:55:27.154339-0500
    pg 1.7f6 not deep-scrubbed since 2020-06-15T18:18:36.467503-0500
    pg 1.7f4 not deep-scrubbed since 2020-06-15T19:31:29.997456-0500
    pg 1.7f3 not deep-scrubbed since 2020-06-14T00:05:15.580003-0500
    pg 1.7f2 not deep-scrubbed since 2020-06-13T19:38:04.592250-0500
    pg 1.7f1 not deep-scrubbed since 2020-06-14T06:27:19.401836-0500
    pg 1.7ef not deep-scrubbed since 2020-06-15T11:56:13.523007-0500
    pg 1.7ee not deep-scrubbed since 2020-06-18T10:15:33.258917-0500
    pg 1.7eb not deep-scrubbed since 2020-06-14T04:56:18.927258-0500
    pg 1.7ea not deep-scrubbed since 2020-06-18T14:57:40.566479-0500
    pg 1.7e8 not deep-scrubbed since 2020-06-17T10:52:01.138073-0500
    pg 1.7e7 not deep-scrubbed since 2020-06-16T15:58:05.688546-0500
    pg 1.7e5 not deep-scrubbed since 2020-06-12T11:56:45.772138-0500
    pg 1.7e2 not deep-scrubbed since 2020-06-18T08:05:12.498183-0500
    pg 1.7dc not deep-scrubbed since 2020-06-12T09:16:45.867627-0500
    pg 1.7d3 not deep-scrubbed since 2020-06-17T21:41:50.255727-0500
    pg 1.7d2 not deep-scrubbed since 2020-06-18T07:22:52.704067-0500
    pg 1.7d1 not deep-scrubbed since 2020-06-15T12:32:07.612190-0500
    pg 1.7cd not deep-scrubbed since 2020-06-16T14:06:21.349030-0500
    pg 1.7cb not deep-scrubbed since 2020-06-12T17:09:57.005794-0500
    pg 1.7c9 not deep-scrubbed since 2020-06-17T09:42:26.713244-0500
    pg 1.7c8 not deep-scrubbed since 2020-06-17T15:32:23.314540-0500
    pg 1.7c4 not deep-scrubbed since 2020-06-14T22:53:29.435341-0500
    pg 1.7c2 not deep-scrubbed since 2020-06-16T23:33:29.212014-0500
    pg 1.7c1 not deep-scrubbed since 2020-06-13T23:20:02.232378-0500
    pg 1.7bf not deep-scrubbed since 2020-06-15T07:01:12.117779-0500
    pg 1.7bd not deep-scrubbed since 2020-06-17T12:54:49.424101-0500
    pg 1.7bc not deep-scrubbed since 2020-06-17T18:21:32.053083-0500
    pg 1.7bb not deep-scrubbed since 2020-06-17T00:30:37.529580-0500
    pg 1.7ba not deep-scrubbed since 2020-06-18T16:56:09.675439-0500
    pg 1.7b9 not deep-scrubbed since 2020-06-18T23:36:13.132979-0500
    pg 1.7b8 not deep-scrubbed since 2020-06-18T05:58:53.581638-0500
    pg 1.7b7 not deep-scrubbed since 2020-06-15T09:47:36.679832-0500
    pg 1.7b6 not deep-scrubbed since 2020-06-13T18:54:43.934220-0500
    pg 1.7b5 not deep-scrubbed since 2020-06-18T13:33:23.266822-0500
    pg 1.7b4 not deep-scrubbed since 2020-06-13T21:44:46.624773-0500
    pg 1.7b2 not deep-scrubbed since 2020-06-18T14:41:59.387378-0500
    pg 1.7ad not deep-scrubbed since 2020-06-18T08:24:10.388516-0500
    pg 1.7ab not deep-scrubbed since 2020-06-17T14:03:24.854422-0500
    pg 1.7aa not deep-scrubbed since 2020-06-13T09:22:39.382439-0500
    pg 1.7a8 not deep-scrubbed since 2020-06-13T07:51:28.900820-0500
    pg 1.7a6 not deep-scrubbed since 2020-06-13T15:11:47.365532-0500
    pg 1.7a2 not deep-scrubbed since 2020-06-14T03:24:47.873247-0500
    pg 1.7a0 not deep-scrubbed since 2020-06-15T20:47:16.885139-0500
    pg 1.79d not deep-scrubbed since 2020-06-18T06:30:04.176538-0500
    pg 1.79c not deep-scrubbed since 2020-06-13T19:40:57.498208-0500
    pg 1.79b not deep-scrubbed since 2020-06-13T01:18:38.103653-0500
    pg 1.799 not deep-scrubbed since 2020-06-17T23:59:10.550439-0500
    pg 1.798 not deep-scrubbed since 2020-06-15T01:09:53.154938-0500
    951 more pgs...
[WRN] PG_NOT_SCRUBBED: 1001 pgs not scrubbed in time
    pg 1.7f8 not scrubbed since 2020-06-18T06:27:18.221763-0500
    pg 1.7f7 not scrubbed since 2020-06-18T23:37:00.430759-0500
    pg 1.7f6 not scrubbed since 2020-06-18T01:02:50.801081-0500
    pg 1.7f4 not scrubbed since 2020-06-18T09:39:43.019677-0500
    pg 1.7f3 not scrubbed since 2020-06-18T19:48:43.141276-0500
    pg 1.7f2 not scrubbed since 2020-06-18T12:46:59.143593-0500
    pg 1.7f1 not scrubbed since 2020-06-18T09:08:01.812785-0500
    pg 1.7ef not scrubbed since 2020-06-18T08:16:33.615415-0500
    pg 1.7ee not scrubbed since 2020-06-18T10:15:33.258917-0500
    pg 1.7eb not scrubbed since 2020-06-18T03:09:01.301923-0500
    pg 1.7ea not scrubbed since 2020-06-18T14:57:40.566479-0500
    pg 1.7e8 not scrubbed since 2020-06-18T11:59:39.329315-0500
    pg 1.7e7 not scrubbed since 2020-06-18T20:03:33.459059-0500
    pg 1.7e5 not scrubbed since 2020-06-18T15:28:43.263333-0500
    pg 1.7e2 not scrubbed since 2020-06-18T08:05:12.498183-0500
    pg 1.7dc not scrubbed since 2020-06-18T10:26:21.759761-0500
    pg 1.7d3 not scrubbed since 2020-06-19T00:23:16.679908-0500
    pg 1.7d2 not scrubbed since 2020-06-18T07:22:52.704067-0500
    pg 1.7d1 not scrubbed since 2020-06-18T18:06:00.247136-0500
    pg 1.7cd not scrubbed since 2020-06-17T18:48:26.912212-0500
    pg 1.7cb not scrubbed since 2020-06-18T21:52:01.078062-0500
    pg 1.7c9 not scrubbed since 2020-06-18T11:41:02.271054-0500
    pg 1.7c8 not scrubbed since 2020-06-18T19:56:49.521473-0500
    pg 1.7c4 not scrubbed since 2020-06-18T14:54:36.343759-0500
    pg 1.7c2 not scrubbed since 2020-06-18T06:23:05.699025-0500
    pg 1.7c1 not scrubbed since 2020-06-18T20:55:25.407265-0500
    pg 1.7bf not scrubbed since 2020-06-17T21:07:41.596595-0500
    pg 1.7bd not scrubbed since 2020-06-18T14:06:23.355490-0500
    pg 1.7bc not scrubbed since 2020-06-17T18:21:32.053083-0500
    pg 1.7bb not scrubbed since 2020-06-18T09:09:54.970840-0500
    pg 1.7ba not scrubbed since 2020-06-18T16:56:09.675439-0500
    pg 1.7b9 not scrubbed since 2020-06-18T23:36:13.132979-0500
    pg 1.7b8 not scrubbed since 2020-06-18T05:58:53.581638-0500
    pg 1.7b7 not scrubbed since 2020-06-18T06:55:11.771207-0500
    pg 1.7b6 not scrubbed since 2020-06-18T19:48:35.862303-0500
    pg 1.7b5 not scrubbed since 2020-06-18T13:33:23.266822-0500
    pg 1.7b4 not scrubbed since 2020-06-19T00:28:27.512317-0500
    pg 1.7b2 not scrubbed since 2020-06-18T14:41:59.387378-0500
    pg 1.7ad not scrubbed since 2020-06-18T08:24:10.388516-0500
    pg 1.7ab not scrubbed since 2020-06-18T14:55:31.056745-0500
    pg 1.7aa not scrubbed since 2020-06-18T17:34:13.124195-0500
    pg 1.7a8 not scrubbed since 2020-06-18T08:50:54.375698-0500
    pg 1.7a6 not scrubbed since 2020-06-18T20:25:04.720733-0500
    pg 1.7a2 not scrubbed since 2020-06-18T16:36:02.051328-0500
    pg 1.7a0 not scrubbed since 2020-06-18T08:54:00.614194-0500
    pg 1.79d not scrubbed since 2020-06-18T06:30:04.176538-0500
    pg 1.79c not scrubbed since 2020-06-18T13:06:36.092813-0500
    pg 1.79b not scrubbed since 2020-06-17T22:35:03.257797-0500
    pg 1.799 not scrubbed since 2020-06-17T23:59:10.550439-0500
    pg 1.798 not scrubbed since 2020-06-18T20:07:32.701492-0500
    951 more pgs...
[WRN] SLOW_OPS: 692 slow ops, oldest one blocked for 153926 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.

Oddly, when I try to delete some of the lost pgs it tells me they don't exist.

Code:

root@proxmox-ceph-2:~# ceph pg dump_stuck | grep unknown
ok
1.d3                        unknown       []          -1       []              -1
1.249                       unknown       []          -1       []              -1
1.24e                       unknown       []          -1       []              -1
1.5c1                       unknown       []          -1       []              -1
1.730                       unknown       []          -1       []              -1
1.2bc                       unknown       []          -1       []              -1
1.1c3                       unknown       []          -1       []              -1

root@proxmox-ceph-2:~# ceph pg 1.d3 mark_unfound_lost delete
Error ENOENT: i don't have pgid 1.d3

aaron · Apr 24, 2023

PGs that are inactive are not good. Just marking them as unfound can result in actual data loss.

Do you know which pool is the pool #1?
ceph osd pool ls detail will list the pools with their numerical ID.

Code:

            692 slow ops, oldest one blocked for 153766 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.

Have you tried to restart these OSD services?

Did you remove the failed OSDs from the cluster already? Because it reports all 26 OSDs as up and in.

MrPaul · Apr 24, 2023

I'm not concerned with data loss at this point as I don't think the data exists to be recovered.

Code:

root@proxmox-ceph-2:~# ceph osd pool ls detail
pool 1 'ceph' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 2048 pg_num_target 64 pgp_num_target 64 autoscale_mode on last_change 5981 lfor 0/0/2015 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps_queue [11~2,14~2,18~3]
pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 4708 flags hashpspool stripe_width 0 application cephfs
pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 4698 flags hashpspool stripe_width 0 application cephfs
pool 4 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 6000 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth

I know the pg settings are way off and once in a healthy state I'll be turning on auto.

aaron · Apr 24, 2023

Okay. If the `mark_unfound_lost delete` fails, you can try to restart your MGRs and then try it again. That seemed to have helped in the past.

MrPaul · Apr 24, 2023

I've restarted both manager services via the WebUI but I'm still unable to mark these as deleted.

aaron · Apr 25, 2023

Hmm, have you tried to reboot the nodes at some point? Then the services would definitely get a fresh start.

MrPaul · Apr 25, 2023

These nodes have been rebooted multiple times.

aaron · Apr 25, 2023

MrPaul said:
These nodes have been rebooted multiple times.

I was afraid you would say that

Can you post some more information about the cluster please?

ceph osd df tree
ceph pg 1.d3 query

MrPaul · Apr 25, 2023

Here is the output you requested.

Code:

root@proxmox-ceph-2:~# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL     %USE  VAR   PGS  STATUS  TYPE NAME             
-1         28.36053         -   28 TiB  2.4 TiB  2.4 TiB  121 MiB    26 GiB    26 TiB  8.48  1.00    -          root default           
-5         14.18027         -   14 TiB  1.2 TiB  1.2 TiB   59 MiB    13 GiB    13 TiB  8.57  1.01    -              host proxmox-ceph-2
 4    hdd   1.09079   1.00000  1.1 TiB  105 GiB  104 GiB  5.7 MiB  1018 MiB  1012 GiB  9.40  1.11  242      up          osd.4         
 5    hdd   1.09079   1.00000  1.1 TiB   84 GiB   83 GiB  3.6 MiB  1020 MiB   1.0 TiB  7.48  0.88  200      up          osd.5         
 6    hdd   1.09079   1.00000  1.1 TiB   97 GiB   96 GiB  4.5 MiB  1020 MiB  1020 GiB  8.68  1.02  212      up          osd.6         
 7    hdd   1.09079   1.00000  1.1 TiB   98 GiB   97 GiB  4.8 MiB  1019 MiB  1019 GiB  8.77  1.03  226      up          osd.7         
22    hdd   1.09079   1.00000  1.1 TiB  109 GiB  108 GiB  3.8 MiB  1020 MiB  1008 GiB  9.80  1.16  244      up          osd.22         
23    hdd   1.09079   1.00000  1.1 TiB  104 GiB  103 GiB  6.0 MiB  1018 MiB  1013 GiB  9.32  1.10  231      up          osd.23         
24    hdd   1.09079   1.00000  1.1 TiB   93 GiB   92 GiB  4.4 MiB  1020 MiB   1.0 TiB  8.32  0.98  204      up          osd.24         
25    hdd   1.09079   1.00000  1.1 TiB   95 GiB   94 GiB  5.0 MiB  1019 MiB  1022 GiB  8.49  1.00  221      up          osd.25         
26    hdd   1.09079   1.00000  1.1 TiB   92 GiB   91 GiB  4.4 MiB  1020 MiB   1.0 TiB  8.21  0.97  213      up          osd.26         
27    hdd   1.09079   1.00000  1.1 TiB  102 GiB  101 GiB  5.1 MiB  1019 MiB  1015 GiB  9.16  1.08  233      up          osd.27         
28    hdd   1.09079   1.00000  1.1 TiB   84 GiB   83 GiB  3.7 MiB  1020 MiB   1.0 TiB  7.55  0.89  184      up          osd.28         
29    hdd   1.09079   1.00000  1.1 TiB   90 GiB   89 GiB  3.6 MiB  1020 MiB   1.0 TiB  8.10  0.95  191      up          osd.29         
30    hdd   1.09079   1.00000  1.1 TiB   91 GiB   90 GiB  4.6 MiB  1019 MiB   1.0 TiB  8.15  0.96  219      up          osd.30         
-7         14.18027         -   14 TiB  1.2 TiB  1.2 TiB   62 MiB    13 GiB    13 TiB  8.39  0.99    -              host proxmox-ceph-3
 8    hdd   1.09079   1.00000  1.1 TiB  101 GiB  100 GiB  5.0 MiB  1019 MiB  1016 GiB  9.07  1.07  224      up          osd.8         
 9    hdd   1.09079   1.00000  1.1 TiB   86 GiB   85 GiB  4.2 MiB  1020 MiB   1.0 TiB  7.73  0.91  214      up          osd.9         
10    hdd   1.09079   1.00000  1.1 TiB   98 GiB   97 GiB  5.0 MiB  1019 MiB  1019 GiB  8.77  1.03  229      up          osd.10         
11    hdd   1.09079   1.00000  1.1 TiB   96 GiB   95 GiB  5.0 MiB  1019 MiB  1021 GiB  8.59  1.01  225      up          osd.11         
32    hdd   1.09079   1.00000  1.1 TiB  102 GiB  101 GiB  5.1 MiB  1019 MiB  1015 GiB  9.12  1.07  214      up          osd.32         
33    hdd   1.09079   1.00000  1.1 TiB   86 GiB   85 GiB  4.2 MiB  1020 MiB   1.0 TiB  7.71  0.91  205      up          osd.33         
34    hdd   1.09079   1.00000  1.1 TiB   93 GiB   92 GiB  4.7 MiB  1019 MiB   1.0 TiB  8.32  0.98  199      up          osd.34         
35    hdd   1.09079   1.00000  1.1 TiB   87 GiB   86 GiB  4.2 MiB  1020 MiB   1.0 TiB  7.80  0.92  201      up          osd.35         
36    hdd   1.09079   1.00000  1.1 TiB   97 GiB   96 GiB  5.2 MiB  1019 MiB  1020 GiB  8.69  1.02  223      up          osd.36         
37    hdd   1.09079   1.00000  1.1 TiB  106 GiB  105 GiB  5.8 MiB  1018 MiB  1011 GiB  9.46  1.12  240      up          osd.37         
38    hdd   1.09079   1.00000  1.1 TiB   80 GiB   79 GiB  3.7 MiB  1020 MiB   1.0 TiB  7.13  0.84  183      up          osd.38         
39    hdd   1.09079   1.00000  1.1 TiB   96 GiB   95 GiB  4.9 MiB  1019 MiB  1021 GiB  8.63  1.02  213      up          osd.39         
40    hdd   1.09079   1.00000  1.1 TiB   91 GiB   90 GiB  4.9 MiB  1019 MiB   1.0 TiB  8.11  0.96  215      up          osd.40         
                        TOTAL   28 TiB  2.4 TiB  2.4 TiB  121 MiB    26 GiB    26 TiB  8.48

The query command seems to be timing out. Been at the carriage return for a while.

Since I don't care about data loss I may just wipe these and redeploy from scratch in the next day or so unless you see something you want me to try. At this point this is purely educational.

aaron · Apr 25, 2023

Hmm ok, so just 2 nodes in the cluster that have OSDs? That is not a recommended setup and the cluster might behave unexpected in some ways. Maybe this situation is part of it.

AFAICT the size parameter for the pools is 3. If you haven't touched the rule, then the cluster is one node short.

Try setting the "size" for the pools to 2 as well.
The OSDs have a lot of PGs according to the ceph osd df tree output. Try to set the pg_num of the pool "ceph" lower, 1024 or 512. The number of PGs per OSD should be around 100, not 200.

You can change both in the GUI when you edit the pool.

pg 1.7f7 not scrubbed since 2020-06-18T23:37:00.430759-0500

Unless the date in the cluster is totally off, this is not good. The OSDs should be able to scrub the data regularly to make sure there was no bit flip / corruption

MrPaul · Apr 25, 2023

Yea, there is some history on this. The ceph-1 node did fail (probably in 2020 as indicated) and I've removed it with eventual plans on reinstalling the OS and putting back into the cluster.

After reducing the nodes things seem to be better.

Code:

root@proxmox-ceph-2:~# ceph status
  cluster:
    id:     f8d6430f-0df8-4ec5-b78a-d8956832b0de
    health: HEALTH_WARN
            2 pools have many more objects per pg than average
            1 MDSs report slow metadata IOs
            Reduced data availability: 7 pgs inactive
            770 slow ops, oldest one blocked for 266703 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
 
  services:
    mon: 2 daemons, quorum proxmox-ceph-2,proxmox-ceph-3 (age 109m)
    mgr: proxmox-ceph-2(active, since 109m), standbys: proxmox-ceph-3
    mds: cephfs:1 {0=proxmox-ceph-2=up:active} 1 up:standby
    osd: 26 osds: 26 up (since 2d), 26 in (since 5d)
 
  data:
    pools:   4 pools, 2209 pgs
    objects: 264.25k objects, 1.0 TiB
    usage:   1.9 TiB used, 26 TiB / 28 TiB avail
    pgs:     0.317% pgs unknown
             2202 active+clean
             7    unknown
 
  progress:
    PG autoscaler decreasing pool 2 PGs from 128 to 32 (107m)
      [............................]
    PG autoscaler decreasing pool 1 PGs from 2048 to 64 (107m)
      [............................]

Going to let autoscaler run for a while to see how things work out.

Still cannot delete the 7 unknown pgs

MrPaul · Apr 26, 2023

The ceph status hasn't changed since my last note. I think I'll just fully delete this and start over. I do appreciate your time.

aaron · Apr 27, 2023

Understandable.

Search

Search

Ceph Issues After Chassis Failure

MrPaul

Active Member

aaron

Proxmox Staff Member

MrPaul

Active Member

aaron

Proxmox Staff Member

MrPaul

Active Member

aaron

Proxmox Staff Member

MrPaul

Active Member

aaron

Proxmox Staff Member

MrPaul

Active Member

aaron

Proxmox Staff Member

MrPaul

Active Member

MrPaul

Active Member

aaron

Proxmox Staff Member