Ceph HEALTH_WARN: Degraded data redundancy: 512 pgs undersized

cmonty14

Renowned Member
Mar 4, 2014
344
5
83
Hi,
I have configured Ceph on a 3-node-cluster.
Then I created OSDs as follows:
Node 1: 3x 1TB HDD
Node 2: 3x 8TB HDD
Node 3: 4x 8TB HDD

This results in following OSD tree:
root@ld4257:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 54.20874 root default
-3 3.27118 host ld4257
0 hdd 1.09039 osd.0 up 1.00000 1.00000
1 hdd 1.09039 osd.1 up 1.00000 1.00000
2 hdd 1.09039 osd.2 up 1.00000 1.00000
-7 21.83038 host ld4464
3 hdd 7.27679 osd.3 up 1.00000 1.00000
4 hdd 7.27679 osd.4 up 1.00000 1.00000
5 hdd 7.27679 osd.5 up 1.00000 1.00000
-5 29.10718 host ld4465
6 hdd 7.27679 osd.6 up 1.00000 1.00000
7 hdd 7.27679 osd.7 up 1.00000 1.00000
8 hdd 7.27679 osd.8 up 1.00000 1.00000
9 hdd 7.27679 osd.9 up 1.00000 1.00000


After this I created a pool with PG total 256 based on the calculation done here.
(size=3, osd=10, data=100%, target=100)

Ceph health gives me a warning:
root@ld4257:~# ceph -s
cluster:
id: fda2f219-7355-4c46-b300-8a65b3834761
health: HEALTH_WARN
Degraded data redundancy: 12 pgs undersized
clock skew detected on mon.ld4464, mon.ld4465

services:
mon: 3 daemons, quorum ld4257,ld4464,ld4465
mgr: ld4257(active), standbys: ld4465, ld4464
osd: 10 osds: 10 up, 10 in

data:
pools: 1 pools, 256 pgs
objects: 0 objects, 0 bytes
usage: 10566 MB used, 55499 GB / 55509 GB avail
pgs: 244 active+clean
12 active+undersized


And this is the health detail:
root@ld4257:~# ceph health detail
HEALTH_WARN Degraded data redundancy: 12 pgs undersized; clock skew detected on mon.ld4464, mon.ld4465
PG_DEGRADED Degraded data redundancy: 12 pgs undersized
pg 2.1d is stuck undersized for 115.728186, current state active+undersized, last acting [3,7]
pg 2.22 is stuck undersized for 115.737825, current state active+undersized, last acting [6,3]
pg 2.29 is stuck undersized for 115.736686, current state active+undersized, last acting [6,5]
pg 2.31 is stuck undersized for 115.738920, current state active+undersized, last acting [9,5]
pg 2.38 is stuck undersized for 115.728054, current state active+undersized, last acting [3,6]
pg 2.57 is stuck undersized for 115.727351, current state active+undersized, last acting [4,6]
pg 2.65 is stuck undersized for 115.727032, current state active+undersized, last acting [3,6]
pg 2.76 is stuck undersized for 115.727156, current state active+undersized, last acting [4,6]
pg 2.90 is stuck undersized for 115.738454, current state active+undersized, last acting [7,3]
pg 2.cc is stuck undersized for 115.728976, current state active+undersized, last acting [3,6]
pg 2.df is stuck undersized for 115.741311, current state active+undersized, last acting [8,5]
pg 2.e2 is stuck undersized for 115.741280, current state active+undersized, last acting [7,3]
MON_CLOCK_SKEW clock skew detected on mon.ld4464, mon.ld4465
mon.ld4464 addr 192.168.100.12:6789/0 clock skew 0.0825909s > max 0.05s (latency 0.00569053s)
mon.ld4465 addr 192.168.100.13:6789/0 clock skew 0.0824161s > max 0.05s (latency 0.00140001s)


I'm wondering why there are less PGs on the larger disks 6-9 compared to disks 0-3:
root@ld4257:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 hdd 1.09039 1.00000 1116G 1056M 1115G 0.09 4.97 78
1 hdd 1.09039 1.00000 1116G 1056M 1115G 0.09 4.97 73
2 hdd 1.09039 1.00000 1116G 1056M 1115G 0.09 4.97 93
3 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 76
4 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 102
5 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 78
6 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 65
7 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 60
8 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 67
9 hdd 7.27679 1.00000 7451G 1056M 7450G 0.01 0.74 64
TOTAL 55509G 10566M 55499G 0.02
MIN/MAX VAR: 0.74/4.97 STDDEV: 0.04



How can I fix this?
 
What is the data usage on your osd?
Code:
ceph osd df

Size=3 means, all pg need to be replicated 3 times on 3 node. But your node1 have much less hdd than others.

And first, fix clock skew, check all nodes using the same NTP server and time syncronized.
 
I have modified CRUSH map and created 2 different buckets for the 2 different HDD types.
This means one bucket for all HDDs of size 1TB and one bucket for all HDDs of size 8TB.
root@ld4257:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-11 0 root ssd
-12 0 host ld4257-ssd
-13 0 host ld4464-ssd
-14 0 host ld4465-ssd
-10 43.66196 root hdd_strgbox
-27 0 host ld4257-hdd_strgbox
-28 21.83098 host ld4464-hdd_strgbox
3 hdd 7.27699 osd.3 up 1.00000 1.00000
4 hdd 7.27699 osd.4 up 1.00000 1.00000
5 hdd 7.27699 osd.5 up 1.00000 1.00000
-29 21.83098 host ld4465-hdd_strgbox
6 hdd 7.27699 osd.6 up 1.00000 1.00000
7 hdd 7.27699 osd.7 up 1.00000 1.00000
8 hdd 7.27699 osd.8 up 1.00000 1.00000
-9 3.26999 root hdd
-15 3.26999 host ld4257-hdd
0 hdd 1.09000 osd.0 up 1.00000 1.00000
1 hdd 1.09000 osd.1 up 1.00000 1.00000
2 hdd 1.09000 osd.2 up 1.00000 1.00000
-16 0 host ld4464-hdd
-17 0 host ld4465-hdd
-1 0 root default
-3 0 host ld4257
-7 0 host ld4464
-5 0 host ld4465


Then I created relevant pools:
pveceph createpool hdd -crush_rule hdd_rule -pg_num 256 -size 2
pveceph createpool hddstrgbox -crush_rule hddstrgbox_rule -pg_num 512 -size 2


If I use another pg_num the error message is always:
mon_command failed - pg_num 1024 size 3 would mean 3584 total pgs, which exceeds max 2000 (mon_max_pg_per_osd 200 * num_in_osds 10)

And as a consequence the Health Status reports this:
root@ld4257:~# ceph -s
cluster:
id: fda2f219-7355-4c46-b300-8a65b3834761
health: HEALTH_WARN
Reduced data availability: 512 pgs inactive
Degraded data redundancy: 512 pgs undersized

services:
mon: 3 daemons, quorum ld4257,ld4464,ld4465
mgr: ld4257(active), standbys: ld4465, ld4464
osd: 10 osds: 10 up, 10 in

data:
pools: 2 pools, 512 pgs
objects: 0 objects, 0 bytes
usage: 10765 MB used, 55499 GB / 55509 GB avail
pgs: 100.000% pgs not active
512 undersized+peered


What mus be considered to overcome this warning?

THX