Ceph shows Health_WARN

Ayush · Oct 27, 2023

Hello ,

I have 3 server clusters with ceph as shared storage.
We are trying to create ceph pool using class but during that activity server reboot and shows ceph health warning . I am new with ceph pools.

Following are the logs for your ref.
ceph -s
cluster:
id: 47061c54-d430-47c6-afa6-952da8e88877
health: HEALTH_WARN
Reduced data availability: 143 pgs inactive, 15 pgs incomplete, 128 pgs stale
Degraded data redundancy: 102410/425571 objects degraded (24.064%), 128 pgs degraded, 128 pgs undersized
139 slow ops, oldest one blocked for 92053 sec, daemons [osd.3,osd.4,osd.5] have slow ops.

services:
mon: 3 daemons, quorum 172,171,173 (age 26h)
mgr:172(active, since 7w), standbys: 173, 171
mds: 1/1 daemons up
osd: 6 osds: 6 up (since 25m), 6 in (since 25m)

data:
volumes: 1/1 healthy
pools: 6 pools, 465 pgs
objects: 141.86k objects, 554 GiB
usage: 1.1 TiB used, 6.9 TiB / 8.0 TiB avail
pgs: 30.753% pgs not active
102410/425571 objects degraded (24.064%)
322 active+clean
128 stale+undersized+degraded+peered
15 incomplete

ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 26 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 3 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 54 flags hashpspool stripe_width 0 application cephfs
pool 4 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 55 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'Storage2' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 796 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 322122547200 application rbd
pool 7 'SSD-POOL' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1118 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 16106127360000 application rbd
pool 8 'HDD-1TB' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1116 flags hashpspool stripe_width 0 target_size_bytes 751619276800 application rbd

Please help me to understand Ceph along with a better way to manage it.

Ayush · Oct 27, 2023

Can anyone help me to understand ceph?

Ayush · Oct 27, 2023

ceph health detail | grep incomplete
HEALTH_WARN Reduced data availability: 143 pgs inactive, 15 pgs incomplete, 128 pgs stale; Degraded data redundancy: 102410/425571 objects degraded (24.064%), 128 pgs degraded, 128 pgs undersized; 10 slow ops, oldest one blocked for 2796 sec, daemons [osd.3,osd.4,osd.5] have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 143 pgs inactive, 15 pgs incomplete, 128 pgs stale
pg 5.20 is incomplete, acting [4,3,5] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.25 is incomplete, acting [5,3,4] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.27 is incomplete, acting [4,3,5] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.38 is incomplete, acting [3,5,4] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.3b is incomplete, acting [4,5,3] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 5.3c is incomplete, acting [5,3,4] (reducing pool Storage2 min_size from 2 may help; search ceph.com/docs for 'incomplete')

ceph pg map 5.20
osdmap e1500 pg 5.20 (5.20) -> up [4,3,5] acting [4,3,5]

1:~# ceph pg map 5.25
osdmap e1500 pg 5.25 (5.25) -> up [5,3,4] acting [5,3,4]
1:~# ceph pg map 5.27
osdmap e1500 pg 5.27 (5.27) -> up [4,3,5] acting [4,3,5]
1:~# ceph pg map 5.38
osdmap e1500 pg 5.38 (5.38) -> up [3,5,4] acting [3,5,4]
1:~# ceph pg map 5.3b
osdmap e1500 pg 5.3b (5.3b) -> up [4,5,3] acting [4,5,3]
1:~# ceph pg map 5.3c
osdmap e1500 pg 5.3c (5.3c) -> up [5,3,4] acting [5,3,4]

Ayush · Oct 27, 2023

ceph health detail | grep incomplete
HEALTH_WARN Reduced data availability: 143 pgs inactive, 15 pgs incomplete, 128 pgs stale; Degraded data redundancy: 102410/425571 objects degraded (24.064%), 128 pgs degraded, 128 pgs undersized; 10 slow ops, oldest one blocked for 2835 sec, daemons [osd.3,osd.4,osd.5] have slow ops.

How to fix it.

Ayush · Oct 27, 2023

As this is a non production ,

So I removed following storage pools to get ceph healty and create new pools according to the class of disk.

pool 5 'Storage2' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 796 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 322122547200 application rbd
pool 7 'SSD-POOL' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1118 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 16106127360000 application rbd
pool 8 'HDD-1TB' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 1116 flags hashpspool stripe_width 0 target_size_bytes 751619276800 application rbd

Maximiliano · Oct 27, 2023

Hello,

I am not sure what is the issue only with the info you provided. Could you please try to restart all Ceph services in one node, give Ceph some time to adjust, and proceed to the next one? 10 slow ops, oldest one blocked for 2835 sec in many cases can be fixed by restarting the services, just be sure that you don't restart them all at once.

Note that you can use use the "Code" blocks from the editor to make code/output look readable.

Ceph shows Health_WARN

Ayush

Member

Ayush

Member

Ayush

Member

Ayush

Member

Ayush

Member

Maximiliano

Proxmox Staff Member

We value your privacy