[PVE 5.4.15] Ceph 12.2 - Redunced Data Availibility: 1 pg inactive, 1 pg down

gglambert

New Member
Dec 2, 2020
7
0
1
43
Hi,

After removing one OSD (on a cluster with 7 node and 80 OSD), we have Health Ceph in Warning.
1 PG is inactive and down.
What can we do to change this state?
Thanks to you.
 
Can you please elaborate more about your cluster? Hardware, configuration, states, etc...
 
In my cluster, I Have 7 node with 5.4.15 PVE version.
Each node have between 5 and 14 OSD, 80 OSD for the cluster.
The state of Ceph is actually HEALTH_ERR:
  • Reduced data availability: 1 pg inactive, 1 pg incomplete (this is the same pg)
  • 294 stuck requests are blocked > 4096 sec. Implicated OSD 28
The concerned pool size is 3/1 with 2048 pg. I know we need to update the pg number to 4096..

The pg incomplete is store on 3 OSD.
 
Each node has approximately:
  • 256Go RAM
  • 2 Xeon
  • Every disk are connected to JBOD (No RAID configuration)
 
The pg incomplete is store on 3 OSD.
That's not certain, since min_size is 1. Never run min_size=1, data might be inflight while no copy is left on disk and that can lead to data loss.

What do the logs say? And what does ceph -s as well as ceph osd df tree say?
 
Ceph -s said:

health: HEALTH_WARN
Reduced Data availability 1 pg inactive, 1 pg incomplete

data:
pool: 4 pools, 3200 pgs
pgs: 0.031% pgs not active
3199 active+clean
1 incomplete


Everything is normal with the ceph osd df tree command..

Can I dump this pg on this 3 OSD to see something on the different size of the 3 dump and put the dump on the two other OSD?
The dump size of the pg on the OSD 28 is: 1.5G
The dump size of the pg on the OSD 22 is: 1.5G
The dump size of the pg on the OSD 72 is: 3.5ko

The primary is the OSD 28
 
Last edited:
The situation has changed. I don't know if is good or bad..
Ceph is stin in HEALTH_WARN . But:
  • pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]
 
  • pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]
Does OSD 72 still exist? And is the PG still connected to an active OSD?
 
The situation have change again.
I have restore a dump of the pg on the 2 other osd should be contain pg.
Now I have ;
1/3125014 objects unfound
Do you think I can use the command :
  • ceph pg 1.5 mark_unfound_lost delete
 
I have executed the command. The health is now HEALTH_OK.
It is possible do know what we have lost?
We have 3 PGs in active+clean+scubbing+deep. I think is good (more than yestherday..). Next step, is everything is ok is to pass the min_size to 2 on the pool.