Deleted ceph on node, stupid!

Kaboom

Active Member
Mar 5, 2019
119
11
38
52
Dear Proxmox forum readers,

I did something stupid, I have a networking running 4 nodes with Ceph. On node4 I wanted to remove Ceph with 'pveceph purge' but I did not know it would remove ceph.conf on all nodes. Yes I know this is very STUPID, but maybe someone can help this fool.

If I go to Proxmox to node1 and click on Ceph I get 'rados_connect failed'. And if I click on ceph_ssd I get rbd errors.

Now everything is still up (Ceph is die hard), but how can I recover from this so everything will be working again?

Thanks a lot!
 
Last edited:
Is that the only file that is deleted on all other 3 nodes (node1, node2 and node3)?

I dunno. I've never run that command. Some quick research says its a bad command to run.

If your ceph cluster is still operational I would backup all data and reinstall. If it's not operational you might want to unplug all of the drives so they don't overwrite and send them to a data recovery center if you have no backups.
 
I don't have a ceph.conf backup, but have vzdumps locally (not on Ceph). But before I go the reinstall way I hope there is another faster way to get everything running. It looks like all other files are still in place.
 
I don't have a ceph.conf backup, but have vzdumps locally (not on Ceph). But before I go the reinstall way I hope there is another faster way to get everything running. It looks like all other files are still in place.

You don't actually need a backup. You just need the FSID. The rest can be recreated by hand. I would scour your logs for the FSID. If you can't find it you're SOL.

If these are encrypted OSD and you lost the keys on the monitor you're SOL.

Edit: Actually you're probably also going to have to recreate the monmap/rbdmap/crushmap by hand. If the /var/lib/ceph directory is deleted you're in the for a world of hurt.
 
Yes everything is still there on node1, node2 and node3.

It looks like 'only' ceph.conf has been deleted when I ran 'pveceph' purge on node4 (on node4 there are no containers nor vm's running).
 
Yes everything is still there on node1, node2 and node3.

It looks like 'only' ceph.conf has been deleted when I ran 'pveceph' purge on node4 (on node4 there are no containers nor vm's running).

Then recreate the ceph.conf and restart the monitor and cross your fingers. :)
 
I found the fsid and recreated the ceph.conf file and restarted, but I think I am missing some configs in this file (still everything up).

The default config does not have too much settings in it that will have a huge issue not being their. More tweaking and performance. Main ones normally the list of mons for commands to find your cluster.

Does ceph -s run fine on each node and report healthy?

I dont use proxmox ceph but someone else may be able to provide your with a sanitised default file.
 
ceph -s reports:

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2019-09-23 19:13:05.896532 7f91e7c95500 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster
 
/etc/init.d/ceph status

● ceph.service - PVE activate Ceph OSD disks
Loaded: loaded (/etc/systemd/system/ceph.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2019-09-23 18:05:11 CEST; 1h 13min ago
Main PID: 2266631 (code=exited, status=0/SUCCESS)

Sep 23 18:05:08 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@4.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:08 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@5.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@5.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service.
Sep 23 18:05:11 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:11 node002 systemd[1]: Started PVE activate Ceph OSD disks.
 
ceph -s reports:

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2019-09-23 19:13:05.896532 7f91e7c95500 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster

So as thought your need to atleast add your mon IP's into the ceph.conf

Any running VM will be fine as they picked these up on boot / mapping of the RBD
 
  • Like
Reactions: Kaboom
One big step further now THANKS sg90 and paradox55! I added the wrong name for the mon in ceph.conf. I have now added the IP address and ceph_ssd is working again.

Ceph under the node still gives rados_connect failed.
 
ceph -s
2019-09-23 20:35:16.098552 7f5101076700 0 librados: client.admin authentication error (1) Operation not permitted
[errno 1] error connecting to the cluster
 
Some extra info:
root@node002:~# systemctl status ceph-mon.target
● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:36 CEST; 2 weeks 4 days ago

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@node002:~# systemctl status ceph.target
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:36 CEST; 2 weeks 4 days ago

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@node002:~# systemctl status ceph-osd.target
● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-osd.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:20 CEST; 2 weeks 4 days ago
 
In /etc/pve/priv there is a file ceph.client.admin.keyring. I changed the key in this file into the key from this file /etc/pve/priv/ceph/ceph_ssd.keyring and Ceph under the node is working again.

ceph -s
cluster:
health: HEALTH_OK
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!