[PVE 5.4.15] Ceph 12.2 - Redunced Data Availibility: 1 pg inactive, 1 pg down

gglambert · Dec 2, 2020

Hi,

After removing one OSD (on a cluster with 7 node and 80 OSD), we have Health Ceph in Warning.
1 PG is inactive and down.
What can we do to change this state?
Thanks to you.

Alwin · Dec 2, 2020

Can you please elaborate more about your cluster? Hardware, configuration, states, etc...

Whatever · Dec 2, 2020

I have the same error but after and an upgrade to CEPH 15.2.6 recently

gglambert · Dec 2, 2020

In my cluster, I Have 7 node with 5.4.15 PVE version.
Each node have between 5 and 14 OSD, 80 OSD for the cluster.
The state of Ceph is actually HEALTH_ERR:

Reduced data availability: 1 pg inactive, 1 pg incomplete (this is the same pg)
294 stuck requests are blocked > 4096 sec. Implicated OSD 28

The concerned pool size is 3/1 with 2048 pg. I know we need to update the pg number to 4096..

The pg incomplete is store on 3 OSD.

gglambert · Dec 2, 2020

Each node has approximately:

256Go RAM
2 Xeon
Every disk are connected to JBOD (No RAID configuration)

Alwin · Dec 2, 2020

gglambert said:
The pg incomplete is store on 3 OSD.

That's not certain, since min_size is 1. Never run min_size=1, data might be inflight while no copy is left on disk and that can lead to data loss.

What do the logs say? And what does ceph -s as well as ceph osd df tree say?

gglambert · Dec 2, 2020

Ceph -s said:

health: HEALTH_WARN
Reduced Data availability 1 pg inactive, 1 pg incomplete

data:
pool: 4 pools, 3200 pgs
pgs: 0.031% pgs not active
3199 active+clean
1 incomplete

Everything is normal with the ceph osd df tree command..

Can I dump this pg on this 3 OSD to see something on the different size of the 3 dump and put the dump on the two other OSD?
The dump size of the pg on the OSD 28 is: 1.5G
The dump size of the pg on the OSD 22 is: 1.5G
The dump size of the pg on the OSD 72 is: 3.5ko

The primary is the OSD 28

gglambert · Dec 2, 2020

The situation has changed. I don't know if is good or bad..
Ceph is stin in HEALTH_WARN . But:

pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]

Alwin · Dec 2, 2020

gglambert said:
pg 2.5c1 is stuck peering since forever, current state remapped+peering, last acting [72]

Does OSD 72 still exist? And is the PG still connected to an active OSD?

gglambert · Dec 3, 2020

The situation have change again.
I have restore a dump of the pg on the 2 other osd should be contain pg.
Now I have ;
1/3125014 objects unfound
Do you think I can use the command :

ceph pg 1.5 mark_unfound_lost delete

Alwin · Dec 3, 2020

gglambert said:
ceph pg 1.5 mark_unfound_lost delete

Technically yes, but wouldn't you want to know, what data doesn't exist first?

gglambert · Dec 3, 2020

I have executed the command. The health is now HEALTH_OK.
It is possible do know what we have lost?
We have 3 PGs in active+clean+scubbing+deep. I think is good (more than yestherday..). Next step, is everything is ok is to pass the min_size to 2 on the pool.

Alwin · Dec 3, 2020

gglambert said:
It is possible do know what we have lost?

Not afterwards.

Search

Search

[PVE 5.4.15] Ceph 12.2 - Redunced Data Availibility: 1 pg inactive, 1 pg down

gglambert

New Member

Alwin

Proxmox Retired Staff

Whatever

Renowned Member

gglambert

New Member

gglambert

New Member

Alwin

Proxmox Retired Staff

gglambert

New Member

gglambert

New Member

Alwin

Proxmox Retired Staff

gglambert

New Member

Alwin

Proxmox Retired Staff

gglambert

New Member

Alwin

Proxmox Retired Staff