Ceph reinstallation issues

oz1cw7yymn

Well-Known Member
Feb 13, 2019
93
12
48
I am having problems configuring ceph within pve. The blocker for me is that, as I have obviously made a mistake with the configuration somewhere, I am not able to purge and restart.

In other words, I have a forked ceph installation on my newly installed 6.0 node. These are the things I have tried in order to reset the configuration and restart the ceph installation/configuration.

* pveceph purge - "unable to get monitor info from DNS SRV with service name: ceph-mon"
* rm -Rf /etc/ceph /etc/pve/ceph.conf /etc/pve/priv/ceph* /var/lib/ceph
* apt remove ceph ceph-base ceph-mon ceph-mgr ceph-osd ceph-mgr-dashboard ceph-mgr-diskprediction-local ceph-mgr-ssh (extra packages were installed during one of the partially successful attempts at ceph installation) - apt fails because the ceph*.prerm scripts in /var/lib/dpkg/info fails to stop the services
* rm ceph-{base,mds,mgr,mon,osd}.prerm in the dpkg folder
* retry of above apt remove - successful
* rm ceph-{base,mds,mgr,mon,osd}.* in the dpkg folder
* rm -Rf /etc/ceph /etc/pve/ceph.conf /etc/pve/priv/ceph* /var/lib/ceph

I've been using posts from https://forum.proxmox.com/threads/ceph-config-broken.54122/page-2 as inspiration.

After the above steps I try installing ceph cleanly, getting these results:

* pveceph install
122MB additional disk space etc etc.
- installed ceph nautilus successfully
configure ceph in GUI
public network set to default network of node
cluster network set to default network of node (I have a separate network intended for cluster)
monitor node = pve node
- error with cfs lock 'file-ceph_conf': command 'cp /etc/pve/priv/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring' failed: exit code 1 (500)


Suggestions? Or is reinstallation of the node the only solution when the ceph installation gets borked?
 
Redoing the above, but leaving the /etc/ceph folder in place (so the keyring can be copied there), I again reinstall the ceph packages and attempt the configuration in the GUI:
- Could not connect to ceph cluster despite configured monitors (500)

Seems the mon is not created/started.
 
Unfortunately not - I don't know how many times I've reinstalled pve due to wanting to tweak my ceph configuration.
 
Unfortunately not - I don't know how many times I've reinstalled pve due to wanting to tweak my ceph configuration.
Thanks for your reply, I found out that using a debian version lower (9 instead of 10) I could purge ceph and after that reconfigure it using proxmox. I tried it several times in a row and it kept working. There might be something wrong with the combination of debian buster + proxmox + ceph nautilus.
 
I remember having the same issue with 5.4 (on Debian 9) back a few months ago when I was experimenting with pve/ceph for work. Also, Nautilus is not supported on Debian 9, which is why PVE had to wait for Buster/10 to upgrade ceph.
 
I've had the same issues upgrading from PVE 5 (using only local storage, no Ceph was involved at all yet) to 6 with the intention to finally move everything to Ceph. The problematic step was pveceph createmon throwing "Could not connect to ceph cluster despite configured monitors".

I solved it (rather: messed my way around) by commenting line 202 in /usr/share/perl5/PVE/API2/Ceph/MON.pm. After that it sill complained that "monitor <hostname> already exists" so I also commented line 74 in that same file. You may have to run rm -rf /var/lib/cep/mon/* before executing pveceph createmon again. In my case it completed fine and the monitor came up. The same needs to be done for every node in the cluster.

Hint: apt install libdevel-trace-perl then executing pveceph like perl -T -d:Trace /usr/bin/pveceph mon create turned out to be very helpful to debug this.
 
I'm struggling with I think the same issue now. I have torn down my Ceph storage configuration, with a view to then rebuilding it so I get to know the process. Everything looked to be removed ok. I then ran the ceph setup again, and I was able to configure two out of the three nodes. But the old master node, will not allow me to add a monitor to it. I get the error below:

Code:
error during cfs-locked 'file-ceph_conf' operation: command 'chown ceph:ceph /var/lib/ceph/mon/ceph-nuc10i7-pve01' failed: exit code 1

Any ideas or pointers? I'm want to try and fix this, rather than default to a re-install.
 
ok - just to follow up on this. I have managed to bring ceph back to a fully working state without a re-install. As simple as it sounds, I just needed to re-create the two folders below on the node with the issue. Adding Manager and Monitors via the CLI or UI then created the sub-folders (ceph-mon.nuc10i7-pve01 and ceph-mgr.nuc10i7-pve01) for me. For some reason, I just needed to manually create the parent folders.

mkdir /var/lib/ceph/mgr
mkdir /var/lib/ceph/mon

That was it.
 
I take that back. Whilst it looks like everything is ok, it's still now. On the 'old' primary node I have a osd which is orphaned, and I can't find a way to remove it. On the other nodes, within /var/lib/ceph/osd/ I see each of the nodes listed. Whereas on the 'old' primary node, it only shows itself.

[UPDATE]
I'm getting further. On the 'old' primary node. I ran ceph osd tree. This showed me the orphaned OSD (ID was 0). From there I ran pveceph osd destroy 0 to remove it.

Everything looks ok. But I cannot understand why cluster node 2 and 3 show all the osds in /var/lib/ceph/osd/. Whereas on node 1 (the old master node), that same folder only has one osd in it.

Would anyone be able to provide some insights?
 
Last edited: