Proxmox 4 CEPH won't start after cold reboot

dgeist · Nov 20, 2015

I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan

udo · Nov 20, 2015

dgeist said:
I have a 3 node PVE4.0 cluster running ceph with 3 mons and 18 OSDs (6 per host). I had a situation where all three hosts powercycled at the same time. When they came back, none of the ceph processed would start correctly. Can someone point me to a procedure to get a cluster started correctly where I don't think anything was actually corrupted?

Thanks
Dan

Hi Dan,
what is the status of the mons?
First it's important to get min. 2 mons running to get quota.

Be sure that you have on all nodes the right time (very very important). Than restart the mon process.

What is the output of

Code:

ceph health detail

Are your mons defined right in /etc/ceph/ceph.conf on all nodes?

Udo

udo · Nov 20, 2015

Any hint in the mon-log?

Code:

tail /var/log/ceph/ceph-mon.*.log

Udo

dgeist · Nov 23, 2015

The monitors were so badly out of sync (not time-wise, but otherwise) and I'd tried various things to fix them (exporting maps, editing them, importing, them, etc.). I finally had to call the cluster a loss. I had very little there that wasn't recoverable. I'll need to setup some real-world simulations once things are up and running. Not being able to take a cold power off to the cluster is not something I'm comfortable with in a production environment...

Dan

Q-wulf · Nov 23, 2015

The only time i had seen this on my Proxmox Nodes was during evaluation of Proxmox4 with Ceph:Hammer.
We used a 3-Node test-cluster. (you needs a minimum of 3 Mon's for Cluster-Quorum)

During one of our tests a Monitor node became "stuck" causing a loss of quorum. A simple restart of all Mons via CLI fixed that. I have not been able to reproduce that afterwards since we use at least 10+ Ceph Nodes/mons per cluster (split over multiple Datarooms) and our Datarooms are backed by Diesel generators. I also was unable to get a monitor "stuck" afterwards, not for lack of trying tho.

We did extensive cold power-Cycling tests. the issue you described has not popped up.

edit: did you try a restart of the Ceph-Mons ?
Did you take the steps udo suggested ?

Proxmox 4 CEPH won't start after cold reboot

dgeist

Member

udo

Distinguished Member

udo

Distinguished Member

dgeist

Member

Q-wulf

Renowned Member

We value your privacy