[PVE 5.4.15] Ceph 12.2 - Redunced Data Availibility: 1 pg inactive, 1 pg down

gglambert

New Member
Dec 2, 2020
7
0
1
42
Hi,

After removing one OSD (on a cluster with 7 node and 80 OSD), we have Health Ceph in Warning.
1 PG is inactive and down.
What can we do to change this state?
Thanks to you.
 
Can you please elaborate more about your cluster? Hardware, configuration, states, etc...
 
In my cluster, I Have 7 node with 5.4.15 PVE version.
Each node have between 5 and 14 OSD, 80 OSD for the cluster.
The state of Ceph is actually HEALTH_ERR:
  • Reduced data availability: 1 pg inactive, 1 pg incomplete (this is the same pg)
  • 294 stuck requests are blocked > 4096 sec. Implicated OSD 28
The concerned pool size is 3/1 with 2048 pg. I know we need to update the pg number to 4096..

The pg incomplete is store on 3 OSD.
 
Each node has approximately:
  • 256Go RAM
  • 2 Xeon
  • Every disk are connected to JBOD (No RAID configuration)
 
The pg incomplete is store on 3 OSD.
That's not certain, since min_size is 1. Never run min_size=1, data might be inflight while no copy is left on disk and that can lead to data loss.

What do the logs say? And what does ceph -s as well as ceph osd df tree say?
 
Ceph -s said:

health: HEALTH_WARN
Reduced Data availability 1 pg inactive, 1 pg incomplete

data:
pool: 4 pools, 3200 pgs
pgs: 0.031% pgs not active
3199 active+clean
1 incomplete


Everything is normal with the ceph osd df tree command..

Can I dump this pg on this 3 OSD to see something on the different size of the 3 dump and put the dump on the two other OSD?
The dump size of the pg on the OSD 28 is: 1.5G
The dump size of the pg on the OSD 22 is: 1.5G
The dump size of the pg on the OSD 72 is: 3.5ko

The primary is the OSD 28
 
Last edited:
The situation has changed. I don't know if is good or bad..
Ceph is stin in HEALTH_WARN . But:
  • pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]
 
  • pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]
Does OSD 72 still exist? And is the PG still connected to an active OSD?
 
The situation have change again.
I have restore a dump of the pg on the 2 other osd should be contain pg.
Now I have ;
1/3125014 objects unfound
Do you think I can use the command :
  • ceph pg 1.5 mark_unfound_lost delete
 
I have executed the command. The health is now HEALTH_OK.
It is possible do know what we have lost?
We have 3 PGs in active+clean+scubbing+deep. I think is good (more than yestherday..). Next step, is everything is ok is to pass the min_size to 2 on the pool.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!