Reinstall CEPH on Proxmox 6

There seems to be some leftover. After you purged Ceph, is /var/lib/ceph/ empty? And is there no ceph.conf anymore?
Tried to purge and no ceph.conf anymore, it didn't work!
Tried many ways,
found out at last, with no this parameter, it‘s ok!
 
Yup, the ceph install on the latest 6.1.

I have some good news though. I re-installed the base 6.1 ISO (version 6.1-3) and followed along the simple installation steps for ceph:

1. install PVE
2. Install Ceph from UI
3. Configure with default network settings
4. Create PVE cluster
5. Install PVE on second node
6. Install ceph from UI
7. Add OSDs from all nodes
8. Create second monitor on second node
9. Create standby mgr on second node


The steps that I changed from my previous attempts were (The steps below would cause the steps above to fail):
1. Add no-sub repo
2. apt update && apt dist-upgrade

Something must be off with the latest 6.1-7 (?) (at least on my hardware)...

My next step will be to somehow change the cluster network to my dual 10g intel cards + 10g switch, but I'm still researching how to change the cluster-network for Ceph safely on PVE. (if anyone has a guide or experience, please message me!)
 
  • Like
Reactions: Maksel
First:
All nodes have been reinstalled and upgraded to the latest! (pve-manager: 6.1-7)

Second:
i'm want to use this parameter '-disable_cephx 1'
so, setup ceph cluster with pveceph cli, not in UI

1. pveceph install
2. pveceph init --network 10.0.115.0/24 -disable_cephx 1
3. pveceph mon create

after created first mon, next steps all in from UI,like:

Create second monitor on second node
Create standby mgr on second node
Add OSDs from all nodes
...

Note: this is the first time to create CEPH cluster on the new installed pve6 with all nodes and with latest.

all above it's ok.

---
But I want to recreate this CEPH cluster, so delete all previous configurations and purge all the ceph conf and dir's.

and again to recreate the new CEPH cluster.

still from the cli.

pveceph init --network 10.0.115.0/24 -disable_cephx 1
pveceph mon create

And then it didn't work, like #16 as described

Before that, I tried the above advice, like #7 #8 #9 #10, but still fail!

So, I think this parameter is the key to the problem
of course, this is my environment, my test :)
 
Did someone managed to fix it? I have the same issue:
Code:
# pveceph init  --disable_cephx 1 --network 172.22.255.0/24
# pveceph mon create
unable to get monitor info from DNS SRV with service name: ceph-mon
Could not connect to ceph cluster despite configured monitors
# ceph -s
unable to get monitor info from DNS SRV with service name: ceph-mon
[errno 2] error connecting to the cluster
Just reinstalled pve nodes but can not set up ceph...

Can someone give a hint on how to debug this??
 
Last edited:
I can confirm this problem. Everything was working but when from 5to6 and followed after the upgrade instruction to latest ceph i can get ceph working with the same problem in the initial post here was described.

I deleted all config directories, services, check host-file,...but it still compains with same message users report here.

very said, we have almost 10 licenses active and can´t use ceph.
 
To remove manually all Ceph services throughout the cluster. Be aware all data on Ceph will be lost.

Code:
pveceph mds destroy <mdsid>
pveceph osd destroy <osdid>
pveceph mgr destroy <mgrid>
pveceph mon destroy <monid>
rm /etc/ceph/ceph.conf #removes the link
Run the above commands (depending on service) on every node in the cluster that hosts Ceph services. The corresponding directories should then be gone as well.

Code:
systemctl stop ceph-mon@<monid>.service
systemctl disable ceph-mon@<monid>.service
rm /etc/pve/ceph.conf
rm -r /var/lib/ceph/mon/ceph-<monid>/
ATM, this needs to be done for the last MON.
 
Unfortunattely I did the upgrade to Ceph 14 and I couldn't rollback. Can I send some log to you? Is there a way to contribute debugging ?
 
I fixed it for my situation.

My final solution to reinstall CEPH on all nodes:

Code:
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/  /var/lib/ceph/mgr/  /var/lib/ceph/mds/
pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
rm /etc/init.d/ceph
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
pveceph install
 
Thank you for that detailed script, you seem to be the first person to recover a failed ceph installation.
I tried your recipe on a failed ceph node on my otherwise smoothly running PVE cluster.

After running your script, my next step was "pveceph init" -no error messages- but then "pveceph mon create" produced the message
"Could not connect to ceph cluster despite configured monitors"

Any operations on the Gui end with timeouts
CLI Checks for the ceph status with "ceph -s" just hangs with no output at all.
CLI check with "systemctl status ceph-mon@pve54" does not find a running monitor

root@pve54:~# systemctl status ceph-mon@pve54
ceph-mon@pve54.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; disabled; vendor preset: enabled)
Drop-In: /lib/systemd/system/ceph-mon@.service.d
└─ceph-after-pve-cluster.conf
Active: inactive (dead)

Enabling and starting the ceph-mon ended in a restart loop "ceph-mon@pve54.service: Start request repeated too quickly."
Any idea what else might need to be removed before recovery of CEPH?

Thanks for your patience.
 
@mbstein, in Proxmox VE 6.2 there is pveceph purge available that helps in removing Ceph prior re-installation.

After running your script, my next step was "pveceph init" -no error messages- but then "pveceph mon create" produced the message
"Could not connect to ceph cluster despite configured monitors"
The ceph.conf is still a leftover from the previous install.
 
  • Like
Reactions: mbstein
@Alwin, the pvceph purge command is part of the script provided by Younex.

Just "pveceph purge" applied to two nodes of my cluster (without any else removed) allowed for a fresh installation of Ceph,
the third node did need some directories to be (re)created, some ownerships to be restored, etc. It's running now o.k.

Thank you.
 
@Alwin, the pvceph purge command is part of the script provided by Younex.
True. I meant a more improved version of the pveceph purge is available in Proxmox VE 6.2. ;)
 
I fixed it for my situation.

My final solution to reinstall CEPH on all nodes:

Code:
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/  /var/lib/ceph/mgr/  /var/lib/ceph/mds/
pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds
rm /etc/init.d/ceph
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done
pveceph install
This helps me a lot !
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!