Proxmox 4 CEPH won't start after cold reboot

dgeist

Member
Feb 26, 2015
37
0
6
I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan
 
I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan
Hi Dan,
what is the status of the mons?
First it's important to get min. 2 mons running to get quota.

Be sure that you have on all nodes the right time (very very important). Than restart the mon process.

What is the output of
Code:
ceph health detail
Are your mons defined right in /etc/ceph/ceph.conf on all nodes?

Udo
 
The monitors were so badly out of sync (not time-wise, but otherwise) and I'd tried various things to fix them (exporting maps, editing them, importing, them, etc.). I finally had to call the cluster a loss. I had very little there that wasn't recoverable. I'll need to setup some real-world simulations once things are up and running. Not being able to take a cold power off to the cluster is not something I'm comfortable with in a production environment...

Dan
 
The only time i had seen this on my Proxmox Nodes was during evaluation of Proxmox4 with Ceph:Hammer.
We used a 3-Node test-cluster. (you needs a minimum of 3 Mon's for Cluster-Quorum)

During one of our tests a Monitor node became "stuck" causing a loss of quorum. A simple restart of all Mons via CLI fixed that. I have not been able to reproduce that afterwards since we use at least 10+ Ceph Nodes/mons per cluster (split over multiple Datarooms) and our Datarooms are backed by Diesel generators. I also was unable to get a monitor "stuck" afterwards, not for lack of trying tho.

We did extensive cold power-Cycling tests. the issue you described has not popped up.

edit: did you try a restart of the Ceph-Mons ?
Did you take the steps udo suggested ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!