[SOLVED] Help! Ceph access totally broken

Apr 19, 2022
43
5
13
We've got a quite serious problem over here.
We have added a new node to our cluster. However, when installing ceph on the new node, there was a problem due to the fact that the VLANs for ceph and osd could not communicate correctly (network problem).

As a result, we tried to uninstall ceph again:


Obviously that was a bad idea, as /etc/ceph was deleted. Because now we no longer have access to the monitors, the configuration, the GUI or the CLI ceph commands.
The OSDs are still working, it is also possible to migrate/create new machines and restart them.

When accessing the configuration via GUI I get the error:

rados_connect failed - Permission denied (500)

1740133753386.png

And i get a similar error when checking the ceph status via cli:

Code:
root@pvecloud01:/etc/ceph# ceph service status
2025-02-21T11:31:41.903+0100 7cad7fe006c0 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2025-02-21T11:31:41.903+0100 7cad850006c0 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
[errno 13] RADOS permission denied (error connecting to the cluster)

Any idea how to fix this?
 
reinstall ceph. Not whole proxmox, but just ceph inside it.
And restore backups from PBS or whatever you have.

This is usually procedure for it;
systemctl stop ceph-mon.target
systemctl stop ceph-mgr.target
systemctl stop ceph-mds.target
systemctl stop ceph-osd.target
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds -y
apt purge ceph-base ceph-mgr-modules-core -y
rm -rf /etc/ceph/*
rm -rf /etc/pve/ceph.conf
rm -rf /etc/pve/priv/ceph.*


lvremove -y /dev/ceph*
vgremove -y ceph-<press-tab-for-bash-completion>
pvremove /dev/nvme1n1

mv /var/lib/ceph/bootstrap-osd /var/lib/ceph/bootstrap-osd.old
mkdir /var/lib/ceph/bootstrap-osd
chown ceph /var/lib/ceph/* -R


And get some CEPH consultant in the future.
 
  • Like
Reactions: Sebi-S
did you run those commands on all nodes, or "just" the problematic one? if the latter, then you might just need to recreate /etc/pve/ceph.conf and the keys with the correct information.. do you have backups of /etc/pve somewhere?
 
  • Like
Reactions: Sebi-S
reinstall ceph. Not whole proxmox, but just ceph inside it.
And restore backups from PBS or whatever you have.

This is usually procedure for it;
systemctl stop ceph-mon.target
systemctl stop ceph-mgr.target
systemctl stop ceph-mds.target
systemctl stop ceph-osd.target
rm -rf /etc/systemd/system/ceph*
killall -9 ceph-mon ceph-mgr ceph-mds
rm -rf /var/lib/ceph/mon/ /var/lib/ceph/mgr/ /var/lib/ceph/mds/
pveceph purge
apt purge ceph-mon ceph-osd ceph-mgr ceph-mds -y
apt purge ceph-base ceph-mgr-modules-core -y
rm -rf /etc/ceph/*
rm -rf /etc/pve/ceph.conf
rm -rf /etc/pve/priv/ceph.*


lvremove -y /dev/ceph*
vgremove -y ceph-<press-tab-for-bash-completion>
pvremove /dev/nvme1n1

mv /var/lib/ceph/bootstrap-osd /var/lib/ceph/bootstrap-osd.old
mkdir /var/lib/ceph/bootstrap-osd
chown ceph /var/lib/ceph/* -R


And get some CEPH consultant in the future.

thank you for your fast respond. for now the datastores und osd are all working, but the OSD are not shown in ceph.

if we use you procedure, will the OSDs remain in the config or do we have a total datalost? We habe Backups, but that Cluster is about 40TB VMs. The lost of working time would be enourmous.
 
did you run those commands on all nodes, or "just" the problematic one? if the latter, then you might just need to recreate /etc/pve/ceph.conf and the keys with the correct information.. do you have backups of /etc/pve somewhere?
Sadly I don't have a backup of the host config. I ran the commands on my first node, but I get the error on every node.
I was able to write a new ceph.conf (I still had the fsid)

The only thing that seems to be missing at the moment is authorisation. Is it possible to create new keys and apply them?

I still had the /etc/pve/priv/ceph.client.admin.keyring and copied it to /etc/ceph/. But I still can't execute any ceph commands because of the error.
 
Sadly I don't have a backup of the host config. I ran the commands on my first node, but I get the error on every node.
I was able to write a new ceph.conf (I still had the fsid)

The only thing that seems to be missing at the moment is authorisation. Is it possible to create new keys and apply them?

I still had the /etc/pve/priv/ceph.client.admin.keyring and copied it to /etc/ceph/. But I still can't execute any ceph commands because of the error.

are the monitors running again now? you should be able to recreate the keys with the admin keyring
 
are the monitors running again now? you should be able to recreate the keys with the admin keyring
The monitor is up and running.

Code:
systemctl status ceph-mon@pvecloud01.service
● ceph-mon@pvecloud01.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Fri 2025-02-21 10:34:33 CET; 2h 7min ago
   Main PID: 1138526 (ceph-mon)
      Tasks: 25
     Memory: 347.3M
        CPU: 4min 11.604s
     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@pvecloud01.service
             └─1138526 /usr/bin/ceph-mon -f --cluster ceph --id pvecloud01 --setuser ceph --setgroup ceph

How to I recreate the monitor keys with the admin keyring?
 
with ceph auth list you should be able to query the mons for existing keys, and then with ceph auth get[-key] you can retrieve it.
 
Same here:

Code:
# ceph auth list
2025-02-21T12:51:32.234+0100 74b2750006c0 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
[errno 13] RADOS permission denied (error connecting to the cluster)
 
Is your /var/lib/ceph/mon/* empty ?
No, it still contains the following data:


Code:
/var/lib/ceph/mon/ceph-pvecloud01#

-rw------- 1 ceph ceph  9 Feb 21 12:56 external_log_to
-rw------- 1 ceph ceph 77 Aug 11  2022 keyring
-rw------- 1 ceph ceph  8 Aug 11  2022 kv_backend
-rw------- 1 ceph ceph  6 Feb 20 20:43 min_mon_release
drwxr-xr-x 2 ceph ceph 11 Feb 21 12:56 store.db
 
Go to store.db You will find .sst file. Copy it to your PC. Open with Notepad++ or other editor and search for "key =" It should be your admin key
 
Does it give good result? # ceph -n mon. --keyring /var/lib/ceph/mon/ceph-pvecloud01/keyring -s

this seems to work. sort of.


Code:
# ceph -n mon. --keyring /var/lib/ceph/mon/ceph-pvecloud01/keyring -s
2025-02-21T17:45:17.161+0100 76a6658006c0 -1 auth: unable to find a keyring on /etc/ceph/ceph.mon..keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2025-02-21T17:45:17.161+0100 76a6658006c0 -1 AuthRegistry(0x76a660065920) no keyring found at /etc/ceph/ceph.mon..keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
  cluster:
    id:     de7290f6-faac-4cfa-8569-502fae22c3ca
    health: HEALTH_WARN
            all OSDs are running squid or later but require_osd_release < squid
            1 subtrees have overcommitted pool target_size_bytes
            6 daemons have recently crashed
            too many PGs per OSD (320 > max 250)

  services:
    mon: 3 daemons, quorum pvecloud01,pvecloud02,pvecloud03 (age 7h)
    mgr: pvecloud01(active, since 7h), standbys: pvecloud02, pvecloud03
    osd: 15 osds: 15 up (since 7h), 15 in (since 9d)

  data:
    pools:   44 pools, 1601 pgs
    objects: 5.43M objects, 21 TiB
    usage:   60 TiB used, 44 TiB / 105 TiB avail
    pgs:     1601 active+clean

  io:
    client:   920 KiB/s rd, 41 MiB/s wr, 74 op/s rd, 659 op/s wr

The warning messages have been known before.