Ceph cluster-Could it work with only one node alive.

fxandrei · Mar 29, 2019

So i have followed the ceph config tutorial.
I have made a cluster of 3 servers (identical), with a full mesh network.
Everything is ok.
If one node\server goes down everything is still ok. I can still use the cluster, in a degraded state.

But if only one node is up i cannot acces ceph anymore.
For example ceph status just hangs and does nothing.

Is there anything i can do so that ceph would still work with only one node alive ?

Alwin · Mar 29, 2019

You should be able to get our data from ceph, as there should be one copy left. But you need to specify the MON manually on the CLI, as the ceph tool tries to reach any MON in the config and could hit the dead ones first.

fxandrei · Mar 31, 2019

So how would i do that ?!
So i modified /etc/ceph/ceph.config so that only the one mon and host is defined.
I tried them "ceph status" but it does not do anything .... it just hangs.

Could i use maybe rbd or rados directly ?

tom · Mar 31, 2019

fxandrei said:
So how would i do that ?!
...

If you have to ask, please do not consider going this way (expertes only). Instead make sure, that you always have enough nodes/hosts/osds online.

fxandrei · Apr 1, 2019

I thought i would get this answer

.
So i guess i will go the way of making sure the nodes/hosts/osds are alive and well

.

Alwin · Apr 2, 2019

This is really not what you would want to do, as it involves the removal of all faulty MONs by manually editing the monstore and a successful startup of one remaining MON. http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#removing-monitors

fxandrei · Apr 2, 2019

So i have managed to recover a cluster that had one node dead. I mean i reinstalled the dead node and managed to re-add it to the proxmox cluster, and ceph cluster, using that link.
But i cannot seem to manage it with ony one node alive.

So im trying to export the mon map, and i get this :

Code:

ceph-mon -i hp1-s1 --extract-monmap /tmp/work/monmap
2019-04-02 13:17:48.749606 7fa19c0b9100 -1 rocksdb: IO error: lock /var/lib/ceph/mon/ceph-hp1-s1/store.db/LOCK: Resource temporarily unavailable
2019-04-02 13:17:48.749611 7fa19c0b9100 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-hp1-s1': (22) Invalid argument

So the directory exists and has files:

Code:

 ls /var/lib/ceph/mon/ceph-hp1-s1/
keyring  kv_backend  store.db

Anyway, i dont actually know what to do now.
It would be nice to be able to recover this cluster (for now im just testing).

fxandrei · Apr 2, 2019

So it seems that the mon was running.
So i ran :

Code:

systemctl status ceph-mon@hp1-s1.service

Not i made the export of the monmap..
Il continue and get back here

fxandrei · Apr 2, 2019

So i managed to get everything up from the only remaining node by using:
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#removing-monitors
and this for the osds:
https://ceph.com/planet/recovering-from-a-complete-node-failure/

So everything was running : the proxmox cluster, and the ceph cluster.
But after i rebooted all the nodes there seems to be a problem.

All the services seem to be up
I ran:
systemctl status ceph\*.service ceph\*.target

But if i run "ceph status" it just hangs.

fxandrei · Apr 3, 2019

So in a sense i have managed to get everything working again, in a sense (more about that in a bit).
So my problem was that after restart the mon on the 2 failed nodes (that i have reinstalled) had isues with their service.
So the mon service on node 2 and 3 was not starting.
But, if i ran "ceph-mon -i <mon_service_name> --public-addr <node_ip>:6789" it was seen in the cluster, and was working.

So the last thing i was trying is to recreate the mons on the failed node.
What i was doing is trying to remove anything related to the mon (service, files, etc).
The problem is that i had deleted by accident the mon the healthy node (node 1).
So i ended up recreating the mons on all the nodes.
And it seemed that everything was ok.

But after i restarted all the nodes, i saw that the osds and the pool are gone.

Their definition was on the monitors right !?

Is there a way to recover this !?

Anyway, im gonna try do redo the steps with the osd, and then see what i can do with the pool.

Alwin · Apr 3, 2019

With what you did, you lost your data. Underlying all of this in data recovery scenario lies the byzantine fault[1] and before going into production with your cluster, I suggest you read up on Ceph's architecture [2] and how to handle it.

[1] https://en.wikipedia.org/wiki/Byzantine_fault
[2] http://docs.ceph.com/docs/luminous/architecture/

jesusdleguiza · Sep 30, 2024

fxandrei said:
So i have followed the ceph config tutorial.
I have made a cluster of 3 servers (identical), with a full mesh network.
Everything is ok.
If one node\server goes down everything is still ok. I can still use the cluster, in a degraded state.

But if only one node is up i cannot acces ceph anymore.
For example ceph status just hangs and does nothing.

Is there anything i can do so that ceph would still work with only one node alive ?

i have same error, one alone node not work, I enable allow_one osd true, and mini size 1 but same not work

Search

Search

Ceph cluster-Could it work with only one node alive.

fxandrei

Renowned Member

Alwin

Proxmox Retired Staff

fxandrei

Renowned Member

tom

Proxmox Staff Member

fxandrei

Renowned Member

Alwin

Proxmox Retired Staff

fxandrei

Renowned Member

fxandrei

Renowned Member

fxandrei

Renowned Member

fxandrei

Renowned Member

Alwin

Proxmox Retired Staff

jesusdleguiza

New Member

We value your privacy