Had a chassis die and a couple drive die in other chassis after a power event. I know there is going to be some data loss but I'm trying to get the cluster into a healthy state. Been working on this for about a week now and decided to ask for help.
I know there are pg issues with this cluster that I've inherited. On my todo list as well.
Below are the stuff I've seen asked for.
Oddly, when I try to delete some of the lost pgs it tells me they don't exist.
I know there are pg issues with this cluster that I've inherited. On my todo list as well.
Below are the stuff I've seen asked for.
Code:
root@proxmox-ceph-2:~# ceph status
cluster:
id: f8d6430f-0df8-4ec5-b78a-d8956832b0de
health: HEALTH_WARN
2 pools have many more objects per pg than average
Reduced data availability: 7 pgs inactive
Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized
1001 pgs not deep-scrubbed in time
1001 pgs not scrubbed in time
692 slow ops, oldest one blocked for 153766 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
services:
mon: 2 daemons, quorum proxmox-ceph-2,proxmox-ceph-3 (age 50m)
mgr: proxmox-ceph-2(active, since 42h), standbys: proxmox-ceph-3
mds: cephfs:1 {0=proxmox-ceph-2=up:active} 1 up:standby
osd: 26 osds: 26 up (since 22h), 26 in (since 3d); 1201 remapped pgs
data:
pools: 4 pools, 2209 pgs
objects: 268.35k objects, 1.0 TiB
usage: 2.4 TiB used, 26 TiB / 28 TiB avail
pgs: 0.317% pgs unknown
124430/805041 objects degraded (15.456%)
143917/805041 objects misplaced (17.877%)
1201 active+clean+remapped
1001 active+undersized+degraded
7 unknown
Code:
root@proxmox-ceph-2:~# ceph health detail
HEALTH_WARN 2 pools have many more objects per pg than average; Reduced data availability: 7 pgs inactive; Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized; 1001 pgs not deep-scrubbed in time; 1001 pgs not scrubbed in time; 692 slow ops, oldest one blocked for 153926 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
[WRN] MANY_OBJECTS_PER_PG: 2 pools have many more objects per pg than average
pool cephfs_data objects per pg (1747) is more than 14.438 times cluster average (121)
pool ceph objects per pg (3273) is more than 27.0496 times cluster average (121)
[WRN] PG_AVAILABILITY: Reduced data availability: 7 pgs inactive
pg 1.d3 is stuck inactive for 42h, current state unknown, last acting []
pg 1.1c3 is stuck inactive for 42h, current state unknown, last acting []
pg 1.249 is stuck inactive for 42h, current state unknown, last acting []
pg 1.24e is stuck inactive for 42h, current state unknown, last acting []
pg 1.2bc is stuck inactive for 42h, current state unknown, last acting []
pg 1.5c1 is stuck inactive for 42h, current state unknown, last acting []
pg 1.730 is stuck inactive for 42h, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 124430/805041 objects degraded (15.456%), 1001 pgs degraded, 1001 pgs undersized
pg 1.797 is active+undersized+degraded, acting [32,6]
pg 1.798 is stuck undersized for 22h, current state active+undersized+degraded, last acting [25,35]
pg 1.799 is stuck undersized for 22h, current state active+undersized+degraded, last acting [11,5]
pg 1.79b is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,28]
pg 1.79c is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,4]
pg 1.79d is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,10]
pg 1.7a0 is stuck undersized for 22h, current state active+undersized+degraded, last acting [6,35]
pg 1.7a2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [36,7]
pg 1.7a6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,40]
pg 1.7a8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,37]
pg 1.7aa is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,10]
pg 1.7ab is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,6]
pg 1.7ad is stuck undersized for 22h, current state active+undersized+degraded, last acting [26,11]
pg 1.7b2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,35]
pg 1.7b4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,25]
pg 1.7b5 is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,38]
pg 1.7b6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,6]
pg 1.7b7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,28]
pg 1.7b8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [24,32]
pg 1.7b9 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,25]
pg 1.7ba is stuck undersized for 22h, current state active+undersized+degraded, last acting [29,35]
pg 1.7bb is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,5]
pg 1.7bc is stuck undersized for 22h, current state active+undersized+degraded, last acting [4,33]
pg 1.7bd is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,8]
pg 1.7bf is stuck undersized for 22h, current state active+undersized+degraded, last acting [28,36]
pg 1.7c1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [4,34]
pg 1.7c2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,9]
pg 1.7c4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [24,35]
pg 1.7c8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,26]
pg 1.7c9 is stuck undersized for 22h, current state active+undersized+degraded, last acting [30,32]
pg 1.7cb is stuck undersized for 22h, current state active+undersized+degraded, last acting [23,38]
pg 1.7cd is stuck undersized for 22h, current state active+undersized+degraded, last acting [5,35]
pg 1.7d1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,30]
pg 1.7d2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [11,27]
pg 1.7d3 is stuck undersized for 22h, current state active+undersized+degraded, last acting [34,27]
pg 1.7dc is stuck undersized for 22h, current state active+undersized+degraded, last acting [22,34]
pg 1.7e2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [27,35]
pg 1.7e5 is stuck undersized for 22h, current state active+undersized+degraded, last acting [23,35]
pg 1.7e7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [30,8]
pg 1.7e8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [7,32]
pg 1.7ea is stuck undersized for 22h, current state active+undersized+degraded, last acting [8,27]
pg 1.7eb is stuck undersized for 22h, current state active+undersized+degraded, last acting [38,22]
pg 1.7ee is stuck undersized for 22h, current state active+undersized+degraded, last acting [29,40]
pg 1.7ef is stuck undersized for 22h, current state active+undersized+degraded, last acting [32,29]
pg 1.7f1 is stuck undersized for 22h, current state active+undersized+degraded, last acting [6,37]
pg 1.7f2 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,29]
pg 1.7f3 is stuck undersized for 22h, current state active+undersized+degraded, last acting [10,30]
pg 1.7f4 is stuck undersized for 22h, current state active+undersized+degraded, last acting [37,22]
pg 1.7f6 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,27]
pg 1.7f7 is stuck undersized for 22h, current state active+undersized+degraded, last acting [39,7]
pg 1.7f8 is stuck undersized for 22h, current state active+undersized+degraded, last acting [40,26]
[WRN] PG_NOT_DEEP_SCRUBBED: 1001 pgs not deep-scrubbed in time
pg 1.7f8 not deep-scrubbed since 2020-06-18T06:27:18.221763-0500
pg 1.7f7 not deep-scrubbed since 2020-06-12T05:55:27.154339-0500
pg 1.7f6 not deep-scrubbed since 2020-06-15T18:18:36.467503-0500
pg 1.7f4 not deep-scrubbed since 2020-06-15T19:31:29.997456-0500
pg 1.7f3 not deep-scrubbed since 2020-06-14T00:05:15.580003-0500
pg 1.7f2 not deep-scrubbed since 2020-06-13T19:38:04.592250-0500
pg 1.7f1 not deep-scrubbed since 2020-06-14T06:27:19.401836-0500
pg 1.7ef not deep-scrubbed since 2020-06-15T11:56:13.523007-0500
pg 1.7ee not deep-scrubbed since 2020-06-18T10:15:33.258917-0500
pg 1.7eb not deep-scrubbed since 2020-06-14T04:56:18.927258-0500
pg 1.7ea not deep-scrubbed since 2020-06-18T14:57:40.566479-0500
pg 1.7e8 not deep-scrubbed since 2020-06-17T10:52:01.138073-0500
pg 1.7e7 not deep-scrubbed since 2020-06-16T15:58:05.688546-0500
pg 1.7e5 not deep-scrubbed since 2020-06-12T11:56:45.772138-0500
pg 1.7e2 not deep-scrubbed since 2020-06-18T08:05:12.498183-0500
pg 1.7dc not deep-scrubbed since 2020-06-12T09:16:45.867627-0500
pg 1.7d3 not deep-scrubbed since 2020-06-17T21:41:50.255727-0500
pg 1.7d2 not deep-scrubbed since 2020-06-18T07:22:52.704067-0500
pg 1.7d1 not deep-scrubbed since 2020-06-15T12:32:07.612190-0500
pg 1.7cd not deep-scrubbed since 2020-06-16T14:06:21.349030-0500
pg 1.7cb not deep-scrubbed since 2020-06-12T17:09:57.005794-0500
pg 1.7c9 not deep-scrubbed since 2020-06-17T09:42:26.713244-0500
pg 1.7c8 not deep-scrubbed since 2020-06-17T15:32:23.314540-0500
pg 1.7c4 not deep-scrubbed since 2020-06-14T22:53:29.435341-0500
pg 1.7c2 not deep-scrubbed since 2020-06-16T23:33:29.212014-0500
pg 1.7c1 not deep-scrubbed since 2020-06-13T23:20:02.232378-0500
pg 1.7bf not deep-scrubbed since 2020-06-15T07:01:12.117779-0500
pg 1.7bd not deep-scrubbed since 2020-06-17T12:54:49.424101-0500
pg 1.7bc not deep-scrubbed since 2020-06-17T18:21:32.053083-0500
pg 1.7bb not deep-scrubbed since 2020-06-17T00:30:37.529580-0500
pg 1.7ba not deep-scrubbed since 2020-06-18T16:56:09.675439-0500
pg 1.7b9 not deep-scrubbed since 2020-06-18T23:36:13.132979-0500
pg 1.7b8 not deep-scrubbed since 2020-06-18T05:58:53.581638-0500
pg 1.7b7 not deep-scrubbed since 2020-06-15T09:47:36.679832-0500
pg 1.7b6 not deep-scrubbed since 2020-06-13T18:54:43.934220-0500
pg 1.7b5 not deep-scrubbed since 2020-06-18T13:33:23.266822-0500
pg 1.7b4 not deep-scrubbed since 2020-06-13T21:44:46.624773-0500
pg 1.7b2 not deep-scrubbed since 2020-06-18T14:41:59.387378-0500
pg 1.7ad not deep-scrubbed since 2020-06-18T08:24:10.388516-0500
pg 1.7ab not deep-scrubbed since 2020-06-17T14:03:24.854422-0500
pg 1.7aa not deep-scrubbed since 2020-06-13T09:22:39.382439-0500
pg 1.7a8 not deep-scrubbed since 2020-06-13T07:51:28.900820-0500
pg 1.7a6 not deep-scrubbed since 2020-06-13T15:11:47.365532-0500
pg 1.7a2 not deep-scrubbed since 2020-06-14T03:24:47.873247-0500
pg 1.7a0 not deep-scrubbed since 2020-06-15T20:47:16.885139-0500
pg 1.79d not deep-scrubbed since 2020-06-18T06:30:04.176538-0500
pg 1.79c not deep-scrubbed since 2020-06-13T19:40:57.498208-0500
pg 1.79b not deep-scrubbed since 2020-06-13T01:18:38.103653-0500
pg 1.799 not deep-scrubbed since 2020-06-17T23:59:10.550439-0500
pg 1.798 not deep-scrubbed since 2020-06-15T01:09:53.154938-0500
951 more pgs...
[WRN] PG_NOT_SCRUBBED: 1001 pgs not scrubbed in time
pg 1.7f8 not scrubbed since 2020-06-18T06:27:18.221763-0500
pg 1.7f7 not scrubbed since 2020-06-18T23:37:00.430759-0500
pg 1.7f6 not scrubbed since 2020-06-18T01:02:50.801081-0500
pg 1.7f4 not scrubbed since 2020-06-18T09:39:43.019677-0500
pg 1.7f3 not scrubbed since 2020-06-18T19:48:43.141276-0500
pg 1.7f2 not scrubbed since 2020-06-18T12:46:59.143593-0500
pg 1.7f1 not scrubbed since 2020-06-18T09:08:01.812785-0500
pg 1.7ef not scrubbed since 2020-06-18T08:16:33.615415-0500
pg 1.7ee not scrubbed since 2020-06-18T10:15:33.258917-0500
pg 1.7eb not scrubbed since 2020-06-18T03:09:01.301923-0500
pg 1.7ea not scrubbed since 2020-06-18T14:57:40.566479-0500
pg 1.7e8 not scrubbed since 2020-06-18T11:59:39.329315-0500
pg 1.7e7 not scrubbed since 2020-06-18T20:03:33.459059-0500
pg 1.7e5 not scrubbed since 2020-06-18T15:28:43.263333-0500
pg 1.7e2 not scrubbed since 2020-06-18T08:05:12.498183-0500
pg 1.7dc not scrubbed since 2020-06-18T10:26:21.759761-0500
pg 1.7d3 not scrubbed since 2020-06-19T00:23:16.679908-0500
pg 1.7d2 not scrubbed since 2020-06-18T07:22:52.704067-0500
pg 1.7d1 not scrubbed since 2020-06-18T18:06:00.247136-0500
pg 1.7cd not scrubbed since 2020-06-17T18:48:26.912212-0500
pg 1.7cb not scrubbed since 2020-06-18T21:52:01.078062-0500
pg 1.7c9 not scrubbed since 2020-06-18T11:41:02.271054-0500
pg 1.7c8 not scrubbed since 2020-06-18T19:56:49.521473-0500
pg 1.7c4 not scrubbed since 2020-06-18T14:54:36.343759-0500
pg 1.7c2 not scrubbed since 2020-06-18T06:23:05.699025-0500
pg 1.7c1 not scrubbed since 2020-06-18T20:55:25.407265-0500
pg 1.7bf not scrubbed since 2020-06-17T21:07:41.596595-0500
pg 1.7bd not scrubbed since 2020-06-18T14:06:23.355490-0500
pg 1.7bc not scrubbed since 2020-06-17T18:21:32.053083-0500
pg 1.7bb not scrubbed since 2020-06-18T09:09:54.970840-0500
pg 1.7ba not scrubbed since 2020-06-18T16:56:09.675439-0500
pg 1.7b9 not scrubbed since 2020-06-18T23:36:13.132979-0500
pg 1.7b8 not scrubbed since 2020-06-18T05:58:53.581638-0500
pg 1.7b7 not scrubbed since 2020-06-18T06:55:11.771207-0500
pg 1.7b6 not scrubbed since 2020-06-18T19:48:35.862303-0500
pg 1.7b5 not scrubbed since 2020-06-18T13:33:23.266822-0500
pg 1.7b4 not scrubbed since 2020-06-19T00:28:27.512317-0500
pg 1.7b2 not scrubbed since 2020-06-18T14:41:59.387378-0500
pg 1.7ad not scrubbed since 2020-06-18T08:24:10.388516-0500
pg 1.7ab not scrubbed since 2020-06-18T14:55:31.056745-0500
pg 1.7aa not scrubbed since 2020-06-18T17:34:13.124195-0500
pg 1.7a8 not scrubbed since 2020-06-18T08:50:54.375698-0500
pg 1.7a6 not scrubbed since 2020-06-18T20:25:04.720733-0500
pg 1.7a2 not scrubbed since 2020-06-18T16:36:02.051328-0500
pg 1.7a0 not scrubbed since 2020-06-18T08:54:00.614194-0500
pg 1.79d not scrubbed since 2020-06-18T06:30:04.176538-0500
pg 1.79c not scrubbed since 2020-06-18T13:06:36.092813-0500
pg 1.79b not scrubbed since 2020-06-17T22:35:03.257797-0500
pg 1.799 not scrubbed since 2020-06-17T23:59:10.550439-0500
pg 1.798 not scrubbed since 2020-06-18T20:07:32.701492-0500
951 more pgs...
[WRN] SLOW_OPS: 692 slow ops, oldest one blocked for 153926 sec, daemons [osd.26,osd.30,osd.35,osd.6] have slow ops.
Oddly, when I try to delete some of the lost pgs it tells me they don't exist.
Code:
root@proxmox-ceph-2:~# ceph pg dump_stuck | grep unknown
ok
1.d3 unknown [] -1 [] -1
1.249 unknown [] -1 [] -1
1.24e unknown [] -1 [] -1
1.5c1 unknown [] -1 [] -1
1.730 unknown [] -1 [] -1
1.2bc unknown [] -1 [] -1
1.1c3 unknown [] -1 [] -1
root@proxmox-ceph-2:~# ceph pg 1.d3 mark_unfound_lost delete
Error ENOENT: i don't have pgid 1.d3