[SOLVED] Proxmox Ceph - After power failure

jorel83

Active Member
Dec 11, 2017
25
2
43
40
Hi,

Today there was an unexpected power outage where my servers are co-located, the entire datacenter went dark. Luckily I had fresh backups to simply restore for the most part.

However, I have an issue with one OSD on one server, the OSD is stuck in "active+recovery_wait+degraded" I have tried repair etc but nothing is helping and google wasn't my friend in this case. Anyone have a suggestion on how to proceed, the drive seems not to be physically damaged due to the power failure.


ceph health detail
HEALTH_ERR 1 osds down; 5681/115284 objects misplaced (4.928%); 61/38428 objects unfound (0.159%); 4 scrub errors; Possible data damage: 3 pgs inconsistent; Degraded data redundancy: 160/115284 objects degraded (0.139%), 53 pgs degraded; 3 stuck requests are blocked > 4096 sec. I
mplicated osds 1
OSD_DOWN 1 osds down
osd.11 (root=default,host=proxmox4) is down
OBJECT_MISPLACED 5681/115284 objects misplaced (4.928%)
OBJECT_UNFOUND 61/38428 objects unfound (0.159%)
pg 1.236 has 1 unfound objects
pg 1.232 has 1 unfound objects
pg 1.230 has 1 unfound objects
pg 1.223 has 1 unfound objects
pg 1.212 has 1 unfound objects
pg 1.20c has 1 unfound objects
pg 1.20b has 1 unfound objects
pg 1.1f7 has 1 unfound objects
pg 1.1f6 has 1 unfound objects
pg 1.1ef has 1 unfound objects
pg 1.1e0 has 1 unfound objects
pg 1.1d5 has 1 unfound objects
pg 1.1cb has 2 unfound objects
pg 1.1c0 has 1 unfound objects
pg 1.1b3 has 1 unfound objects
pg 1.1a7 has 2 unfound objects
pg 1.19a has 2 unfound objects
pg 1.c0 has 2 unfound objects
pg 1.b2 has 1 unfound objects
pg 1.ab has 3 unfound objects
pg 1.9c has 1 unfound objects
pg 1.9b has 1 unfound objects
pg 1.9a has 1 unfound objects
pg 1.90 has 1 unfound objects
pg 1.86 has 1 unfound objects
pg 1.84 has 1 unfound objects
pg 1.79 has 1 unfound objects
pg 1.76 has 1 unfound objects
pg 1.74 has 1 unfound objects
pg 1.6a has 1 unfound objects
pg 1.c has 1 unfound objects
pg 1.10 has 1 unfound objects
pg 1.4e has 1 unfound objects
pg 1.ca has 1 unfound objects
pg 1.dd has 1 unfound objects
pg 1.e5 has 1 unfound objects
pg 1.e9 has 1 unfound objects
pg 1.f8 has 1 unfound objects
pg 1.100 has 1 unfound objects
pg 1.10c has 1 unfound objects
pg 1.116 has 1 unfound objects
pg 1.11f has 1 unfound objects
pg 1.131 has 1 unfound objects
pg 1.14a has 1 unfound objects
pg 1.151 has 1 unfound objects
pg 1.15b has 1 unfound objects
pg 1.170 has 1 unfound objects
pg 1.17a has 2 unfound objects
pg 1.17c has 1 unfound objects
pg 1.17e has 1 unfound objects
pg 1.181 has 1 unfound objects
(additional pgs left out for brevity)
OSD_SCRUB_ERRORS 4 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
pg 1.dc is active+clean+inconsistent, acting [1,4,12]
pg 1.163 is active+clean+remapped+inconsistent, acting [3,14,8]
pg 1.1c2 is active+clean+remapped+inconsistent, acting [0,14,6]
PG_DEGRADED Degraded data redundancy: 160/115284 objects degraded (0.139%), 53 pgs degraded
pg 1.c is active+recovery_wait+degraded, acting [14,0,6], 1 unfound
pg 1.10 is active+recovery_wait+degraded, acting [4,9,14], 1 unfound
pg 1.4e is active+recovery_wait+degraded, acting [4,9,1], 1 unfound
pg 1.6a is active+recovery_wait+degraded, acting [3,8,1], 1 unfound
pg 1.74 is active+recovery_wait+degraded, acting [7,14,9], 1 unfound
pg 1.76 is active+recovery_wait+degraded, acting [3,1,14], 1 unfound
pg 1.79 is active+recovery_wait+degraded, acting [14,8,4], 1 unfound
pg 1.84 is active+recovery_wait+degraded, acting [8,5,1], 1 unfound
pg 1.86 is active+recovery_wait+degraded, acting [14,9,5], 1 unfound
pg 1.90 is active+recovery_wait+degraded, acting [8,1,5], 1 unfound
pg 1.9a is active+recovery_wait+degraded, acting [9,0,5], 1 unfound
pg 1.9b is active+recovery_wait+degraded, acting [8,4,1], 1 unfound
pg 1.9c is active+recovery_wait+degraded, acting [5,1,9], 1 unfound
pg 1.ab is active+recovery_wait+degraded, acting [5,9,14], 3 unfound
pg 1.b2 is active+recovery_wait+degraded, acting [8,4,14], 1 unfound
pg 1.c0 is active+recovery_wait+degraded, acting [9,3,14], 2 unfound
pg 1.ca is active+recovery_wait+degraded, acting [4,8,14], 1 unfound
pg 1.dd is active+recovery_wait+degraded, acting [3,8,1], 1 unfound
pg 1.e5 is active+recovery_wait+degraded, acting [8,3,1], 1 unfound
pg 1.e9 is active+recovery_wait+degraded, acting [14,7,9], 1 unfound
pg 1.f8 is active+recovery_wait+degraded, acting [9,3,0], 1 unfound
pg 1.100 is active+recovery_wait+degraded, acting [9,0,15], 1 unfound
pg 1.10c is active+recovery_wait+degraded+remapped, acting [1,4,9], 1 unfound
pg 1.116 is active+recovery_wait+degraded, acting [1,8,14], 1 unfound
pg 1.11f is active+recovery_wait+degraded, acting [5,9,1], 1 unfound
pg 1.131 is active+recovery_wait+degraded, acting [9,0,15], 1 unfound
pg 1.14a is active+recovery_wait+degraded, acting [1,7,8], 1 unfound
pg 1.151 is active+recovery_wait+degraded+remapped, acting [14,1,8], 1 unfound
pg 1.15b is active+recovery_wait+degraded, acting [14,1,4], 1 unfound
pg 1.170 is active+recovery_wait+degraded, acting [5,8,14], 1 unfound
pg 1.17a is active+recovery_wait+degraded+remapped, acting [14,4,1], 2 unfound
pg 1.17c is active+recovery_wait+degraded+remapped, acting [5,1,9], 1 unfound
pg 1.17e is active+recovery_wait+degraded+remapped, acting [7,0,9], 1 unfound
pg 1.181 is active+recovery_wait+degraded, acting [9,5,1], 1 unfound
pg 1.19a is active+recovery_wait+degraded+remapped, acting [4,14,8], 2 unfound
pg 1.1a7 is active+recovery_wait+degraded, acting [1,8,4], 2 unfound
pg 1.1b3 is active+recovery_wait+degraded, acting [9,7,14], 1 unfound
pg 1.1c0 is active+recovery_wait+degraded, acting [14,4,9], 1 unfound
pg 1.1cb is active+recovery_wait+degraded, acting [8,14,1], 2 unfound
pg 1.1d5 is active+recovery_wait+degraded, acting [9,0,15], 1 unfound
pg 1.1e0 is active+recovery_wait+degraded+remapped, acting [4,1,15], 1 unfound
pg 1.1ef is active+recovery_wait+degraded+remapped, acting [15,0,7], 1 unfound
pg 1.1f6 is active+recovery_wait+degraded, acting [8,4,14], 1 unfound
pg 1.1f7 is active+recovery_wait+degraded, acting [3,1,9], 1 unfound
pg 1.20b is active+recovery_wait+degraded+remapped, acting [3,14,1], 1 unfound
pg 1.20c is active+recovery_wait+degraded, acting [1,8,3], 1 unfound
pg 1.212 is active+recovery_wait+degraded, acting [1,9,3], 1 unfound
pg 1.223 is active+recovery_wait+degraded, acting [3,9,1], 1 unfound
pg 1.230 is active+recovery_wait+degraded, acting [4,9,1], 1 unfound
pg 1.232 is active+recovery_wait+degraded, acting [14,8,7], 1 unfound
pg 1.236 is active+recovery_wait+degraded, acting [1,3,14], 1 unfound
REQUEST_STUCK 3 stuck requests are blocked > 4096 sec. Implicated osds 1
3 ops are blocked > 33554.4 sec
osd.1 has stuck requests > 33554.4 sec


Br/Joel
 
pg 1.10c has 1 unfound objects
pg 1.10c is active+recovery_wait+degraded+remapped, acting [1,4,9], 1 unfound
The OSD needs to go into the cluster, as there are objects on it, that are not found anywhere else in the cluster. If it is not possible, then those PG would need to be marked as lost and the data they contained would be gone too.

What is in the logs for the down OSD? And can you restart it by hand?
 
The OSD needs to go into the cluster, as there are objects on it, that are not found anywhere else in the cluster. If it is not possible, then those PG would need to be marked as lost and the data they contained would be gone too.

What is in the logs for the down OSD? And can you restart it by hand?

Thanks for your post Alwin, I've managed to fix the issues now, needed to "destroy" the osd in the gui, then basically since all was lost, i just do ceph pg 1.XY mark_unfond_lost delete on all lost pgs, and now the cluster reports ok again.

There were three vms that didn't boot properly, but had fresh backups of them so just did a restore to be up and running quickly again.

Just got to love the ease and redundancy of Proxmox!

Br/Joel
 
  • Like
Reactions: Alwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!