Ceph cluster-Could it work with only one node alive.

fxandrei

Renowned Member
Jan 10, 2013
163
17
83
So i have followed the ceph config tutorial.
I have made a cluster of 3 servers (identical), with a full mesh network.
Everything is ok.
If one node\server goes down everything is still ok. I can still use the cluster, in a degraded state.

But if only one node is up i cannot acces ceph anymore.
For example ceph status just hangs and does nothing.

Is there anything i can do so that ceph would still work with only one node alive ?
 
You should be able to get our data from ceph, as there should be one copy left. But you need to specify the MON manually on the CLI, as the ceph tool tries to reach any MON in the config and could hit the dead ones first.
 
So how would i do that ?!
So i modified /etc/ceph/ceph.config so that only the one mon and host is defined.
I tried them "ceph status" but it does not do anything .... it just hangs.

Could i use maybe rbd or rados directly ?
 
So how would i do that ?!
...

If you have to ask, please do not consider going this way (expertes only). Instead make sure, that you always have enough nodes/hosts/osds online.
 
I thought i would get this answer :).
So i guess i will go the way of making sure the nodes/hosts/osds are alive and well :).
 
So i have managed to recover a cluster that had one node dead. I mean i reinstalled the dead node and managed to re-add it to the proxmox cluster, and ceph cluster, using that link.
But i cannot seem to manage it with ony one node alive.

So im trying to export the mon map, and i get this :
Code:
ceph-mon -i hp1-s1 --extract-monmap /tmp/work/monmap
2019-04-02 13:17:48.749606 7fa19c0b9100 -1 rocksdb: IO error: lock /var/lib/ceph/mon/ceph-hp1-s1/store.db/LOCK: Resource temporarily unavailable
2019-04-02 13:17:48.749611 7fa19c0b9100 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-hp1-s1': (22) Invalid argument

So the directory exists and has files:
Code:
 ls /var/lib/ceph/mon/ceph-hp1-s1/
keyring  kv_backend  store.db

Anyway, i dont actually know what to do now.
It would be nice to be able to recover this cluster (for now im just testing).
 
So it seems that the mon was running.
So i ran :
Code:
systemctl status ceph-mon@hp1-s1.service

Not i made the export of the monmap..
Il continue and get back here :)
 
So i managed to get everything up from the only remaining node by using:
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#removing-monitors
and this for the osds:
https://ceph.com/planet/recovering-from-a-complete-node-failure/

So everything was running : the proxmox cluster, and the ceph cluster.
But after i rebooted all the nodes there seems to be a problem.

All the services seem to be up
I ran:
systemctl status ceph\*.service ceph\*.target

But if i run "ceph status" it just hangs.
 
So in a sense i have managed to get everything working again, in a sense (more about that in a bit).
So my problem was that after restart the mon on the 2 failed nodes (that i have reinstalled) had isues with their service.
So the mon service on node 2 and 3 was not starting.
But, if i ran "ceph-mon -i <mon_service_name> --public-addr <node_ip>:6789" it was seen in the cluster, and was working.

So the last thing i was trying is to recreate the mons on the failed node.
What i was doing is trying to remove anything related to the mon (service, files, etc).
The problem is that i had deleted by accident the mon the healthy node (node 1).
So i ended up recreating the mons on all the nodes.
And it seemed that everything was ok.

But after i restarted all the nodes, i saw that the osds and the pool are gone.

Their definition was on the monitors right !?

Is there a way to recover this !?

Anyway, im gonna try do redo the steps with the osd, and then see what i can do with the pool.
 
  • Like
Reactions: fxandrei
So i have followed the ceph config tutorial.
I have made a cluster of 3 servers (identical), with a full mesh network.
Everything is ok.
If one node\server goes down everything is still ok. I can still use the cluster, in a degraded state.

But if only one node is up i cannot acces ceph anymore.
For example ceph status just hangs and does nothing.

Is there anything i can do so that ceph would still work with only one node alive ?

i have same error, one alone node not work, I enable allow_one osd true, and mini size 1 but same not work
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!