Deleted ceph on node, stupid!

Kaboom · Sep 23, 2019

Dear Proxmox forum readers,

I did something stupid, I have a networking running 4 nodes with Ceph. On node4 I wanted to remove Ceph with 'pveceph purge' but I did not know it would remove ceph.conf on all nodes. Yes I know this is very STUPID, but maybe someone can help this fool.

If I go to Proxmox to node1 and click on Ceph I get 'rados_connect failed'. And if I click on ceph_ssd I get rbd errors.

Now everything is still up (Ceph is die hard), but how can I recover from this so everything will be working again?

Thanks a lot!

paradox55 · Sep 23, 2019

Restore the ceph.conf from a backup or recreate it.

Kaboom · Sep 23, 2019

Is that the only file that is deleted on all other 3 nodes (node1, node2 and node3)?

paradox55 · Sep 23, 2019

Kaboom said:
Is that the only file that is deleted on all other 3 nodes (node1, node2 and node3)?

I dunno. I've never run that command. Some quick research says its a bad command to run.

If your ceph cluster is still operational I would backup all data and reinstall. If it's not operational you might want to unplug all of the drives so they don't overwrite and send them to a data recovery center if you have no backups.

Kaboom · Sep 23, 2019

I don't have a ceph.conf backup, but have vzdumps locally (not on Ceph). But before I go the reinstall way I hope there is another faster way to get everything running. It looks like all other files are still in place.

paradox55 · Sep 23, 2019

Kaboom said:
I don't have a ceph.conf backup, but have vzdumps locally (not on Ceph). But before I go the reinstall way I hope there is another faster way to get everything running. It looks like all other files are still in place.

You don't actually need a backup. You just need the FSID. The rest can be recreated by hand. I would scour your logs for the FSID. If you can't find it you're SOL.

If these are encrypted OSD and you lost the keys on the monitor you're SOL.

Edit: Actually you're probably also going to have to recreate the monmap/rbdmap/crushmap by hand. If the /var/lib/ceph directory is deleted you're in the for a world of hurt.

Kaboom · Sep 23, 2019

Do you mean with FSID all the keyfiles in /etv/pve/priv ? Those are still there.

paradox55 · Sep 23, 2019

Kaboom said:
Do you mean with FSID all the keyfiles in /etv/pve/priv ? Those are still there.

What files are missing? Is it just the ceph.conf?

Does /var/lib/ceph still exist? If so, does it have all of the files?

Kaboom · Sep 23, 2019

Yes everything is still there on node1, node2 and node3.

It looks like 'only' ceph.conf has been deleted when I ran 'pveceph' purge on node4 (on node4 there are no containers nor vm's running).

paradox55 · Sep 23, 2019

Kaboom said:
Yes everything is still there on node1, node2 and node3.

It looks like 'only' ceph.conf has been deleted when I ran 'pveceph' purge on node4 (on node4 there are no containers nor vm's running).

Then recreate the ceph.conf and restart the monitor and cross your fingers.

Kaboom · Sep 23, 2019

I found the fsid and recreated the ceph.conf file and restarted, but I think I am missing some configs in this file (still everything up).

sg90 · Sep 23, 2019

Kaboom said:
I found the fsid and recreated the ceph.conf file and restarted, but I think I am missing some configs in this file (still everything up).

The default config does not have too much settings in it that will have a huge issue not being their. More tweaking and performance. Main ones normally the list of mons for commands to find your cluster.

Does ceph -s run fine on each node and report healthy?

I dont use proxmox ceph but someone else may be able to provide your with a sanitised default file.

Kaboom · Sep 23, 2019

ceph -s reports:

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2019-09-23 19:13:05.896532 7f91e7c95500 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster

Kaboom · Sep 23, 2019

/etc/init.d/ceph status

● ceph.service - PVE activate Ceph OSD disks
Loaded: loaded (/etc/systemd/system/ceph.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2019-09-23 18:05:11 CEST; 1h 13min ago
Main PID: 2266631 (code=exited, status=0/SUCCESS)

Sep 23 18:05:08 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@4.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:08 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@5.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@5.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service.
Sep 23 18:05:09 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@1.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:10 node002 ceph-disk[2266631]: Removed /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service.
Sep 23 18:05:11 node002 ceph-disk[2266631]: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service → /lib/systemd/system/ceph-osd@.service.
Sep 23 18:05:11 node002 systemd[1]: Started PVE activate Ceph OSD disks.

sg90 · Sep 23, 2019

Kaboom said:
ceph -s reports:

unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
2019-09-23 19:13:05.896532 7f91e7c95500 -1 failed for service _ceph-mon._tcp
[errno 2] error connecting to the cluster

So as thought your need to atleast add your mon IP's into the ceph.conf

Any running VM will be fine as they picked these up on boot / mapping of the RBD

Kaboom · Sep 23, 2019

'ceph -s' takes very long now and no report.

I can see more graphics from Ceph now in Proxmox but without any data.

Kaboom · Sep 23, 2019

One big step further now THANKS sg90 and paradox55! I added the wrong name for the mon in ceph.conf. I have now added the IP address and ceph_ssd is working again.

Ceph under the node still gives rados_connect failed.

Kaboom · Sep 23, 2019

ceph -s
2019-09-23 20:35:16.098552 7f5101076700 0 librados: client.admin authentication error (1) Operation not permitted
[errno 1] error connecting to the cluster

Kaboom · Sep 23, 2019

Some extra info:
root@node002:~# systemctl status ceph-mon.target
● ceph-mon.target - ceph target allowing to start/stop all ceph-mon@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-mon.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:36 CEST; 2 weeks 4 days ago

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@node002:~# systemctl status ceph.target
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:36 CEST; 2 weeks 4 days ago

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@node002:~# systemctl status ceph-osd.target
● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service instances at once
Loaded: loaded (/lib/systemd/system/ceph-osd.target; enabled; vendor preset: enabled)
Active: active since Thu 2019-09-05 14:49:20 CEST; 2 weeks 4 days ago

Kaboom · Sep 23, 2019

In /etc/pve/priv there is a file ceph.client.admin.keyring. I changed the key in this file into the key from this file /etc/pve/priv/ceph/ceph_ssd.keyring and Ceph under the node is working again.

ceph -s
cluster:
health: HEALTH_OK

Deleted ceph on node, stupid!

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

We value your privacy