There seems to be a problem with pg 1.0 and my understanding of placement groups and pools and OSDs.
Yesterday, I removed osd.0 in an attempt to get the contents of pg 1.0 moved to another osd. But today it was stuck inactive for 24 hours, so my attempt resulted in resetting the inactive state from 8d to 24h. Thus, I moved osd.0 back into the cluster. Currently `ceph daemon osd.0 dump_ops_in_flight` reports 106 ops in queue. Interestingly, yesterday I had more than 400 PGs, but today they are less than 300. I guess the autoscaler did some work.
As I am new to Ceph I don't know what's causing these issues.
Maybe someone can shed some light on my deficiencies. Are PGs numbered globally or on a per-pool basis? Can a pg 1.0 exist in several pools? When defining a pool I also set the number of PGs. Does this increase the total number of PGs over all OSDs if I add another pool?
Can the autoscaler rescaling the PG size interfere with a VM? In the last few days, our Univention Corporate Server VM showed outages we hadn't before moving to Ceph.
Yesterday, I removed osd.0 in an attempt to get the contents of pg 1.0 moved to another osd. But today it was stuck inactive for 24 hours, so my attempt resulted in resetting the inactive state from 8d to 24h. Thus, I moved osd.0 back into the cluster. Currently `ceph daemon osd.0 dump_ops_in_flight` reports 106 ops in queue. Interestingly, yesterday I had more than 400 PGs, but today they are less than 300. I guess the autoscaler did some work.
As I am new to Ceph I don't know what's causing these issues.
Code:
root@node1:~# ceph health detail
HEALTH_WARN Reduced data availability: 1 pg inactive; 17 slow ops, oldest one blocked for 87198 sec, osd.0 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
pg 1.0 is stuck inactive for 24h, current state unknown, last acting []
[WRN] SLOW_OPS: 17 slow ops, oldest one blocked for 87198 sec, osd.0 has slow ops
root@node1:~# ceph status
cluster:
id: 8bc967d9-48a1-444c-af5a-3e875b867150
health: HEALTH_WARN
Reduced data availability: 1 pg inactive
66 slow ops, oldest one blocked for 87209 sec, osd.0 has slow ops
services:
mon: 3 daemons, quorum node1,node3,node2 (age 24h)
mgr: node1(active, since 24h), standbys: node3, node2
mds: 1/1 daemons up, 2 standby
osd: 24 osds: 24 up (since 2h), 23 in (since 2h); 21 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 289 pgs
objects: 838.09k objects, 3.2 TiB
usage: 9.8 TiB used, 82 TiB / 92 TiB avail
pgs: 0.346% pgs unknown
25477/2514282 objects misplaced (1.013%)
268 active+clean
20 active+remapped+backfilling
1 unknown
io:
client: 43 KiB/s rd, 993 KiB/s wr, 2 op/s rd, 86 op/s wr
recovery: 50 MiB/s, 12 objects/s
root@node1:~# ceph pg map 1.0
osdmap e1044 pg 1.0 (1.0) -> up [0,10,16] acting [0]
Maybe someone can shed some light on my deficiencies. Are PGs numbered globally or on a per-pool basis? Can a pg 1.0 exist in several pools? When defining a pool I also set the number of PGs. Does this increase the total number of PGs over all OSDs if I add another pool?
Can the autoscaler rescaling the PG size interfere with a VM? In the last few days, our Univention Corporate Server VM showed outages we hadn't before moving to Ceph.