[SOLVED] Ceph health error: pg current state unknown, last acting [76]

cmonty14 · Jun 12, 2019

Hi,

I had trouble with my ceph cluster after rebooting the nodes sequentially.
This is fixed in the meantime, however there's an error message when executing ceph health detail:
root@ld3955:~# ceph health detail
HEALTH_WARN 2 pools have many more objects per pg than average; Reduced data availability: 3 pgs inactive; clock skew detected on mon.ld5506
MANY_OBJECTS_PER_PG 2 pools have many more objects per pg than average
pool hdd objects per pg (29) is more than 29 times cluster average (1)
pool pve_cephfs_data objects per pg (41) is more than 41 times cluster average (1)
PG_AVAILABILITY Reduced data availability: 3 pgs inactive
pg 4.4b is stuck inactive since forever, current state unknown, last acting [76]
pg 4.28a is stuck inactive since forever, current state unknown, last acting [76]
pg 4.30f is stuck inactive since forever, current state unknown, last acting [76]

Checking the
root@ld3955:~# ceph pg 4.4b query
{
"state": "unknown",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 10106,
"up": [
104,
148,
178
],
"acting": [
76
],
[...]
"peer_info": [],
"recovery_state": [
{
"name": "Started/Primary/Peering/WaitActingChange",
"enter_time": "2019-06-12 13:43:12.205232",
"comment": "waiting for pg acting set to change"
},
{
"name": "Started",
"enter_time": "2019-06-12 13:43:12.195757"
}
],
"agent_state": {}
}

Can you please advise how to fix this?

THX

RokaKen · Jun 12, 2019

On the node with osd.76, try restarting the OSD as 'root' with:

Code:

systemctl restart ceph-osd@76.service

sb-jw · Jun 12, 2019

Did you wait between the reboots if ceph are okay again?
Generally it seems you have a scaling problem in your cluster. Did you check the usage of the OSDs, the Replicas, PGs etc.?

cmonty14 · Jun 12, 2019

sb-jw said:
Did you wait between the reboots if ceph are okay again?
Generally it seems you have a scaling problem in your cluster. Did you check the usage of the OSDs, the Replicas, PGs etc.?

Well, I did not wait until all OSDs have been green in WebUI before rebooting another node.
What do you mean by "scaling problem in cluster"?

I don't think there's an issue with the usage, though.
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_WARN
2 pools have many more objects per pg than average
Reduced data availability: 3 pgs inactive
clock skew detected on mon.ld5506

services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507
mgr: ld5506(active), standbys: ld5507, ld5505
mds: pve_cephfs-1/1/1 up {0=ld3955=up:active}
osd: 268 osds: 268 up, 268 in; 3 remapped pgs

data:
pools: 7 pools, 10880 pgs
objects: 17.91k objects, 69.6GiB
usage: 585GiB used, 448TiB / 449TiB avail
pgs: 0.028% pgs unknown
10877 active+clean
3 unknown

cmonty14 · Jun 12, 2019

RokaKen said:
On the node with osd.76, try restarting the OSD as 'root' with:

Code:

systemctl restart ceph-osd@76.service

Restarting osd.76 fixed the issue.
Now, ceph health detail does not report this again.

root@ld3955:~# ceph health detail
HEALTH_WARN 2 pools have many more objects per pg than average; clock skew detected on mon.ld5506
MANY_OBJECTS_PER_PG 2 pools have many more objects per pg than average
pool hdd objects per pg (29) is more than 29 times cluster average (1)
pool pve_cephfs_data objects per pg (41) is more than 41 times cluster average (1)
MON_CLOCK_SKEW clock skew detected on mon.ld5506
mon.ld5506 addr 10.97.206.94:6789/0 clock skew 0.0802101s > max 0.05s (latency 0.000937229s)

Alwin · Jun 12, 2019

c.monty said:
pool hdd objects per pg (29) is more than 29 times cluster average (1)
pool pve_cephfs_data objects per pg (41) is more than 41 times cluster average (1)

Does one of these pools have the ID 4? If so, then it could be that there are simply to many PGs and Ceph may not be able to distribute them.

cmonty14 · Jun 12, 2019

Alwin said:
Does one of these pools have the ID 4? If so, then it could be that there are simply to many PGs and Ceph may not be able to distribute them.

Yes, there's a pool with ID 4.
root@ld3955:~# ceph osd pool ls detail
pool 4 'backup' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 7626 flags hashpspool stripe_width 0 application rbd
pool 6 'nvme' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 512 pgp_num 512 last_change 7665 flags hashpspool stripe_width 0 application rbd
pool 11 'hdb_backup' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 8335 flags hashpspool stripe_width 0 application rbd
pool 21 'pve_cephfs_data' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 10164 lfor 0/10162 flags hashpspool stripe_width 0 application cephfs,rbd
pool 22 'pve_cephfs_metadata' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 10168 lfor 0/10166 flags hashpspool stripe_width 0 application cephfs,rbd
pool 25 'hdd' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 10115 lfor 0/10113 flags hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
pool 26 'pve_default' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 last_change 9807 flags hashpspool stripe_width 0 application rbd

Alwin · Jun 13, 2019

What does the below command show?

Code:

ceph pg map 4.4b

And I believe the pool 4 doesn't have enough PGs, hence Ceph complaining about 2 pools not having more objects per PG than average. If I read the crushmap correctly then there should be 144 OSDs associated with strgbox.

Code:

( "Target PGs per OSD" x "OSD #" x "%Data" ) / Size
---------------------------------------------------------------------------
(100 x 144 x 100) / 3 = 4096 PGs

Search

Search

[SOLVED] Ceph health error: pg current state unknown, last acting [76]

cmonty14

Well-Known Member

RokaKen

Active Member

sb-jw

Famous Member

cmonty14

Well-Known Member

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff