Proxmox 4 CEPH won't start after cold reboot

dgeist

Member
Feb 26, 2015
37
0
6
I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan
 
I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan
Hi Dan,
what is the status of the mons?
First it's important to get min. 2 mons running to get quota.

Be sure that you have on all nodes the right time (very very important). Than restart the mon process.

What is the output of
Code:
ceph health detail
Are your mons defined right in /etc/ceph/ceph.conf on all nodes?

Udo
 
The monitors were so badly out of sync (not time-wise, but otherwise) and I'd tried various things to fix them (exporting maps, editing them, importing, them, etc.). I finally had to call the cluster a loss. I had very little there that wasn't recoverable. I'll need to setup some real-world simulations once things are up and running. Not being able to take a cold power off to the cluster is not something I'm comfortable with in a production environment...

Dan
 
The only time i had seen this on my Proxmox Nodes was during evaluation of Proxmox4 with Ceph:Hammer.
We used a 3-Node test-cluster. (you needs a minimum of 3 Mon's for Cluster-Quorum)

During one of our tests a Monitor node became "stuck" causing a loss of quorum. A simple restart of all Mons via CLI fixed that. I have not been able to reproduce that afterwards since we use at least 10+ Ceph Nodes/mons per cluster (split over multiple Datarooms) and our Datarooms are backed by Diesel generators. I also was unable to get a monitor "stuck" afterwards, not for lack of trying tho.

We did extensive cold power-Cycling tests. the issue you described has not popped up.

edit: did you try a restart of the Ceph-Mons ?
Did you take the steps udo suggested ?