Not able to use pveceph purge to completely remove ceph

daniel1e6

New Member
Oct 31, 2019
11
0
1
46
I'm new to Proxmox... and Linux. Trying to completely remove ceph and I cannot. Tried pveceph purge after stopping all services and got this message below. I originally installed, uninstalled, then reinstalled Ceph because I wasn't able to add a second NVMe drive from each of the three servers which are currently on the same cluster. I had to install it the second time from the terminal. After a clean install, I discovered that the old monitors were not deleted and didn't work. I couldn't seem to delete them. When I tried to purge, it wouldn't let me (message above).

1572511998930.png

I also tried apt remove, apt autoremove, some level of upgrade. Nothing seemed to work and I couldn't seem to remove ceph-mon. Any ideas on how to completely remove this would be greatly appreciated. My next step is a fresh install of Proxmox on all 3 servers, which I'm trying to avoid.
 
'pveceph purge' purges the packages, but not the monitors.

So, a bit of a raw and rough approach to get you out of this could be
Code:
## stop all remaining ceph-services
# systemctl stop ceph-mon.target
# systemctl stop ceph-mgr.target
# systemctl stop ceph-mds.target
# systemctl stop ceph-osd.target

# avoid that they're being restarted by systemd the next boot (the low level way)
# rm -rf /etc/systemd/system/ceph*

## be really sure they're stopped:
killall -9 ceph-mon ceph-mgr ceph-mds

## then do
# rm -rf /var/lib/ceph/mon/  /var/lib/ceph/mgr/  /var/lib/ceph/mds/ 

## then retry purge
# pveceph purge
 
  • Like
Reactions: Be_Sure
Good morning. This seemed promising and is good to know for future issues. However, it didn't work in this situation. Here's what I got. Thanks!

Capture.png
 
Hi Tom, Any updates regarding this bug? These issues are definitely preventing us from meeting our production release date. More importantly, it seems like this issue would result in a catastrophic failure if we were in production - which is concerning. I look forward to your help in figuring out a solution to this bug.
 
Last edited:
@daniel
Sorry but if you new in Linux and Proxmox and Ceph - you should never go in production! Learning never ends for us too but for you its way to early to run production on Linux, any Linux!
apt and dpkg are Debian Tools, clear this situation first.
 
I appreciate the feedback. Thanks. Tom provided a solution (he seems to have lots of experience). The solution didn't work. I'm not sure this bug can be solved. It may require a new installation.
 
So, dpkg complains about a sys-v init script error, this is a fallback from systemd and normally not present (but I did not checked closely). I'd try to remove that and retry the apt remove/purge command:
Code:
rm /etc/init.d/ceph

it seems like this issue would result in a catastrophic failure if we were in production
how so? Once you setup a Ceph service you normally want to keep it, purging it from server is not a common thing to do, especially not in production.

Also, you can go to our enterprise support team to get more involved responses https://www.proxmox.com/en/proxmox-ve/pricing

I mean we do not know how you get into this situation at all, I personally purged and re-installed Ceph quite often on testing, a normal setup done over Proxmox VE tooling made above work most of the without big issues (I once deleted a bit to much, but that was my fault then). So I really would not call this a bug.

On a more general note, if you really do not know in what state this all is, and you messed around with something which you were not fully aware of it's effects, I'm not sure it's a good idea to go forward like this. It could be good to re-install the cluster, to be sure you're in a clean state. Then, before creating any relevant VM or Container set all up, test, and be sure to log your information somewhere.
Reading up on the Proxmox VE docs, and some general Linux information could be good - as said I do not know any background of this service; but once you said that you may miss "production deadlines" I got the feeling that this may be more serious than initial thought, i.e., not just a test+learn+evaluate setup but something much more serious with possible big implications on mis-setup/failures.
So, to be honest, I'd recommend enterprise support here - there are a lot of helpful people here on the community forum, but for such things and especially if you say that you do not have much experience with Linux in general or Proxmox VE, it seems more appropriate; just my two cents.
 
Hi Tom,

I'm sure there are times a reinstall is required during production. If I was in production and decided Ceph isn't working but choose to keep Proxmox, I would need a fresh install.

I followed the documentation to remove the monitors via the GUI. Then I used the documentation to attempt to remove CEPH. Like many others, I'm in the process of reformatting the drives so I can start over. Our goal has been to setup, test, and learn with the community level of support then transition to a level of support more suitable for production. In parallel, we are interviewing people to manage proxmox for our company as we scale up. The issue happened to me both times I attempted to remove and reinstall CEPH using the proxmox documentation, followed by the recommendation you provided. I'll consider upgrading to the next level of support before production release.

I may be a newbie but I followed the documentation. Others have had similar issues but I have not seen a solution. Here's what I've done. 1) Installed proxmox 2) Setup the network for private and public NICs 3) pinged private and public 4) double checked the network interface settings on each node 4) Rebooted and rechecked ping 5) Added my subscription keys (FYI - pve-enterprise.list had to be set manually). I updated for each node through the GUI and rebooted and checked again to be sure there weren't issues with the kernel 6) Created the cluster and joined the nodes with the bridge and private network set properly (once, I did this after I installed CEPH on each node. It created 3 mgrs and resulted in issues connecting the default monitors. I reinstalled PVE that time) 7) Installed a total of 3 monitors. 8) Installed a few OSDs. I have 6 NVMe 1TB drives. Proxmox is installed on each node with a dedicated drive. I was never able to add 6 OSDs (the reason for my attempt to reinstall CEPH. It seems like maybe I need to create a volume for lvm or create a CephFS. I'll have to research OSDs once the first issues are resolved. During my preliminary search, I haven't found much documentation clearly explaining how to prepare a new drive for Ceph OSDs. Then I'll create my container and VM pools.
 
when i clicked on the link it took me to your last post so the below may not be useful at all, given the above.


Hi Daniel,
I have a similar problem in that I cannot get Ceph back to square one, which led me to your post.
I am certainly no expert.

Ceph is its own file system so you dont add it like a normal drive.
Once the OSD's are added you create a cephFS and it appears like magic.

i found that sometimes drives need to be cleaned of previous partitions for them to appear in the OSD window.

cfdisk /dev/sdx or fdisk /dev/sdx usually do the job.
Remember to double check it is the right disk.
Then write the changes and the disk should appear.


use the information from this window to find your disk names. The top two are also nVME drives
1572844102496.png

hope this helps

damon
 
Last edited:
Awesome, thanks for the information. I'll definitely try that when I start adding the OSDs. Thanks!
 
Thanks for reaching out. Yes, I did see that documentation. The documentation didn't seem to explain how a drive can become recognized so an OSD can be created. I was able to see both drives under "Disk" but was not able to add the OSD for that disk. The drives were brand new when I first encountered the issue. Also, when I removed all partitions from the drive, I received a Grub Error. One of the three servers didn't result in an error after deleting the partitions. The server that didn't result in a Grub error was probably the result of GPT initialization before deleting the partitions. I haven't invested much time researching this area since I reinstalled once I received the grub errors and haven't been able to get past the current issues.

After sending proxmox my Syslogs, I received an email explaining that my kernel is very outdated. Since this is a new install, this is concerning. Could some of my issues be the result of the install not using the latest kernel. I added my subscription keys, performed all updates, and rebooted between updates. I even updated the pve-enterprise.list to be sure I received enterprise repository updates/upgrades moving forward. I suspect most paying subscribers don't receive automatic enterprise updates since this seems to require a manual change. Is this something proxmox will fix and is there a temporary workaround so I can update the kernel? Thanks.
 
Maybe the outdated kernel issue is the root cause of some of my issues. This is a new Proxmox VE 6 install on a new server with new drives. Here's part of the email I received from proxmox. A solution wasn't provided in the email. If this is a bug, is it something proxmox plans on fixing soon? Seems like a major issue if it is a bug.

"The only issue I see in your logs, that you run a quite outdated kernel => update to latest version."
 
After sending proxmox my Syslogs, I received an email explaining that my kernel is very outdated.
Can you PM me the ticket ID? I don't seem to see it on our ticket system.


The documentation didn't seem to explain how a drive can become recognized so an OSD can be created. I was able to see both drives under "Disk" but was not able to add the OSD for that disk.
In this section of the docs, you can see on the picture the OSD tab, where you can create new OSDs. This may need some further description.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_osds


"The only issue I see in your logs, that you run a quite outdated kernel => update to latest version."
Can you please post a pveversion -v?
 
  • Like
Reactions: psionic
So here we go, my buddy and I worked through the issue, I do believe this is a ceph bug. We both work as high level ITs. In short, do all the stuff listed above, once done run these 3 commands and you should have a working package again. It appears after doing a purge or removing ceph ceph-mon ceph-osd one of the shared libraries phyiscally goes bye bye however the environment still thinks the library is present.

Run initial repair on all ceph packages:
Code:
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done

Reconfigure the deb packages:
Code:
dpkg-reconfigure ceph-base
dpkg-reconfigure ceph-mds
dpkg-reconfigure ceph-common
dpkg-reconfigure ceph-fuse

Rerun the same repair script:
Code:
for i in $(apt search ceph | grep installed | awk -F/ '{print $1}'); do apt reinstall $i; done

Run the installer:
Code:
pveceph install


Should do the job!
 
It appears after doing a purge or removing ceph ceph-mon ceph-osd one of the shared libraries phyiscally goes bye bye however the environment still thinks the library is present.
Do you recall, which one it was?
 
Code:
root@pve1:/usr/lib/x86_64-linux-gnu/perl5/5.28/auto/PVE/RADOS# pveceph init



Can't load '/usr/lib/x86_64-linux-gnu/perl5/5.28/auto/PVE/RADOS/RADOS.so' for module PVE::RADOS: libceph-common.so.0: cannot open shared object file: No such file or directory at /usr/lib/x86_64-linux-gnu/perl/5.28/DynaLoader.pm line 187, <DATA> line 755.
at /usr/share/perl5/PVE/Storage/RBDPlugin.pm line 13.
Compilation failed in require at /usr/share/perl5/PVE/Storage/RBDPlugin.pm line 13, <DATA> line 755.
BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Storage/RBDPlugin.pm line 13, <DATA> line 755.
Compilation failed in require at /usr/share/perl5/PVE/Storage.pm line 32, <DATA> line 755.
BEGIN failed--compilation aborted at /usr/share/perl5/PVE/Storage.pm line 32, <DATA> line 755.
Compilation failed in require at /usr/share/perl5/PVE/CLI/pveceph.pm line 17, <DATA> line 755.
BEGIN failed--compilation aborted at /usr/share/perl5/PVE/CLI/pveceph.pm line 17, <DATA> line 755.
Compilation failed in require at /usr/bin/pveceph line 6, <DATA> line 755.
BEGIN failed--compilation aborted at /usr/bin/pveceph line 6, <DATA> line 755.
root@pve1:/usr/lib/x86_64-linux-gnu/perl5/5.28/auto/PVE/RADOS# ldconfig -v | grep libceph
ldconfig: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.28.so is the dynamic linker, ignoring

        libcephfs.so.2 -> libcephfs.so.2.0.0

I believe it was: libceph-common.so.0
Which is part of: librados2_14.2.4.1-pve1_amd64.deb
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!