Cannot re-add OSD

Fred Saunier

Well-Known Member
Aug 24, 2017
55
2
48
Brussels, BE
Hello Forum,

I cannot re-add an osd to Ceph, getting the following error message:
Code:
auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory

I have found and read a number of posts relative to this issue, and tried to apply the offered solutions, but none has worked.

My environment:
proxmox 6.4-13
ceph 15.2.14

[Edit on: Adding some circumstantial information]
Soon after upgrading Ceph from 14.2.22 to 15.2.14, I unfortunately underwent a complete loss of all 3 monitors. I had to recover store.db from OSDs, as explained in this post. OSD.7 was found but oddly declared as using filestore instead of bluestore (which it didn't), and was ejected by Ceph. After data rebalance was over, I decided to destroy OSD.7 and re-add it properly to Ceph.
[/Edit off]

The disk is /dev/sdb, which was previously declared as osd.7

What I have tried so far:
Bash:
sudo ceph-volume lvm zap /dev/sdb --destroy
sudo ceph auth del osd.7
sudo sgdisk -Z /dev/sdb
sudo pveceph osd destroy 7 --cleanup

I have also created a partition with gdisk, then removed it.

The osd-7 directory is removed
Code:
sudo ls -l /var/lib/ceph/osd
total 0
drwxrwxrwt 2 ceph ceph 200 sept.  2 09:59 ceph-15
drwxrwxrwt 2 ceph ceph 200 sept.  2 09:59 ceph-16
drwxrwxrwt 2 ceph ceph 200 sept.  2 09:59 ceph-17
drwxrwxrwt 2 ceph ceph 200 sept.  2 09:59 ceph-18
drwxrwxrwt 2 ceph ceph 200 sept.  2 09:59 ceph-19

I have also rebooted the node, but that didn't help either.

What did I miss?

Code:
ceph -s
  cluster:
    id:     734b0a7e-2a96-4ea0-8a12-86d5bf965ca0
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum prox4,prox5,prox3 (age 17h)
    mgr: prox4(active, since 17h), standbys: prox5, prox3
    osd: 27 osds: 26 up (since 97m), 26 in (since 3h); 25 remapped pgs
 
  data:
    pools:   3 pools, 545 pgs
    objects: 5.73M objects, 21 TiB
    usage:   64 TiB used, 78 TiB / 142 TiB avail
    pgs:     220876/17187911 objects misplaced (1.285%)
             520 active+clean
             21  active+remapped+backfill_wait
             4   active+remapped+backfilling
 
  io:
    client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr
    recovery: 86 MiB/s, 21 objects/s

Code:
ceph osd df tree
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME     
 -1         143.72264         -  142 TiB   64 TiB   64 TiB  767 MiB  107 GiB   78 TiB  44.84  1.00    -          root default 
-13          27.29024         -   27 TiB   13 TiB   13 TiB  175 MiB   21 GiB   15 TiB  46.18  1.03    -              host prox1
 12    hdd    3.63869   1.00000  3.6 TiB  1.6 TiB  1.6 TiB   11 MiB  2.3 GiB  2.1 TiB  43.12  0.96   43      up          osd.12
 13    hdd    3.63869   1.00000  3.6 TiB  1.6 TiB  1.6 TiB   12 MiB  2.4 GiB  2.0 TiB  45.15  1.01   44      up          osd.13
 14    hdd    3.63869   1.00000  3.6 TiB  1.6 TiB  1.6 TiB   17 MiB  2.7 GiB  2.0 TiB  45.23  1.01   44      up          osd.14
 25    hdd    7.27739   1.00000  7.3 TiB  3.1 TiB  3.1 TiB   27 MiB  4.5 GiB  4.1 TiB  43.16  0.96   84      up          osd.25
 26    hdd    7.27739   1.00000  7.3 TiB  3.2 TiB  3.2 TiB   24 MiB  5.0 GiB  4.1 TiB  43.73  0.98   85      up          osd.26
 11    ssd    1.81940   1.00000  1.8 TiB  1.4 TiB  1.4 TiB   84 MiB  3.8 GiB  407 GiB  78.14  1.74   26      up          osd.11
-16          23.65155         -   24 TiB   11 TiB   11 TiB  168 MiB   19 GiB   12 TiB  48.49  1.08    -              host prox2
 16    hdd    3.63869   1.00000  3.6 TiB  1.7 TiB  1.7 TiB   16 MiB  2.8 GiB  1.9 TiB  47.79  1.07   46      up          osd.16
 17    hdd    3.63869   1.00000  3.6 TiB  1.8 TiB  1.8 TiB   19 MiB  2.9 GiB  1.9 TiB  48.87  1.09   47      up          osd.17
 18    hdd    7.27739   1.00000  7.3 TiB  3.4 TiB  3.4 TiB   27 MiB  5.1 GiB  3.9 TiB  46.78  1.04   90      up          osd.18
 19    hdd    7.27739   1.00000  7.3 TiB  3.4 TiB  3.4 TiB   29 MiB  5.4 GiB  3.9 TiB  46.17  1.03   89      up          osd.19
 15    ssd    1.81940   1.00000  1.8 TiB  1.2 TiB  1.2 TiB   76 MiB  3.2 GiB  647 GiB  65.28  1.46   21      up          osd.15
 -3          23.64957         -   22 TiB  9.6 TiB  9.6 TiB   72 MiB   15 GiB   12 TiB  44.01  0.98    -              host prox3
  9    hdd   10.91409   1.00000   11 TiB  4.8 TiB  4.8 TiB   35 MiB  7.3 GiB  6.1 TiB  43.76  0.98  127      up          osd.9
 10    hdd    7.27739   1.00000  7.3 TiB  3.2 TiB  3.2 TiB   25 MiB  4.7 GiB  4.1 TiB  43.60  0.97   84      up          osd.10
 21    hdd    3.63869   1.00000  3.6 TiB  1.7 TiB  1.7 TiB   13 MiB  2.7 GiB  2.0 TiB  45.58  1.02   44      up          osd.21
  2    ssd    1.81940         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.2
 -5          27.29024         -   27 TiB   12 TiB   12 TiB  129 MiB   23 GiB   15 TiB  44.98  1.00    -              host prox4
  5    hdd    3.63869   1.00000  3.6 TiB  1.5 TiB  1.5 TiB   15 MiB  3.1 GiB  2.1 TiB  41.92  0.93   40      up          osd.5
  6    hdd    3.63869   1.00000  3.6 TiB  1.4 TiB  1.4 TiB  8.9 MiB  2.4 GiB  2.2 TiB  39.16  0.87   39      up          osd.6
  8    hdd    3.63869   1.00000  3.6 TiB  1.7 TiB  1.7 TiB   15 MiB  3.1 GiB  2.0 TiB  46.33  1.03   43      up          osd.8
 23    hdd    7.27739   1.00000  7.3 TiB  3.2 TiB  3.1 TiB   23 MiB  5.3 GiB  4.1 TiB  43.29  0.97   83      up          osd.23
 27    hdd    7.27739   1.00000  7.3 TiB  3.1 TiB  3.1 TiB   26 MiB  5.8 GiB  4.1 TiB  43.14  0.96   82      up          osd.27
  1    ssd    1.81940   1.00000  1.8 TiB  1.3 TiB  1.3 TiB   40 MiB  3.8 GiB  481 GiB  74.19  1.65   24      up          osd.1
 -7          41.84105         -   42 TiB   18 TiB   18 TiB  223 MiB   28 GiB   24 TiB  42.25  0.94    -              host prox5
  3    hdd   10.91409   1.00000   11 TiB  4.5 TiB  4.5 TiB   35 MiB  6.6 GiB  6.4 TiB  41.03  0.91  117      up          osd.3
  4    hdd    3.63869   1.00000  3.6 TiB  1.5 TiB  1.5 TiB   16 MiB  2.4 GiB  2.2 TiB  40.41  0.90   38      up          osd.4
 20    hdd    7.27739   1.00000  7.3 TiB  3.0 TiB  3.0 TiB   21 MiB  4.5 GiB  4.3 TiB  41.02  0.91   79      up          osd.20
 22    hdd    7.27739   1.00000  7.3 TiB  2.9 TiB  2.9 TiB   24 MiB  4.2 GiB  4.4 TiB  39.44  0.88   75      up          osd.22
 24    hdd   10.91409   1.00000   11 TiB  4.4 TiB  4.4 TiB   32 MiB  6.4 GiB  6.6 TiB  39.97  0.89  114      up          osd.24
  0    ssd    1.81940   1.00000  1.8 TiB  1.5 TiB  1.5 TiB   96 MiB  4.2 GiB  314 GiB  83.14  1.85   26      up          osd.0
                          TOTAL  142 TiB   64 TiB   64 TiB  767 MiB  107 GiB   78 TiB  44.84                                   
MIN/MAX VAR: 0.87/1.85  STDDEV: 12.47

Code:
pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.128-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-5
pve-kernel-helper: 6.4-5
pve-kernel-5.4.128-1-pve: 5.4.128-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.14-pve1~bpo10
ceph-fuse: 15.2.14-pve1~bpo10
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.5-pve1~bpo10+1
 
Last edited:
Tried adding another osd in another host, and it also fails -- returning this error:
Code:
stderr: [errno 13] RADOS permission denied (error connecting to the cluster)

How can this be fixed?
 
Is the cluster so far up and running that the pool is read and writeable? If so, I recommend to either move your VMs to some other temporary storage or create backups of them and recreate the Ceph cluster from scratch. Removing all Ceph services (MON, MGR, OSD, MDS), wiping the OSD disks, running pveceph purge.

As you noticed, after losing all the monitors, the cluster is a bit in a weird situation regarding default (auth) settings. You could try to get it all back, as it should be, manually, but I personally would not trust myself to get it all right.
 
The cluster is indeed up and running, and the pools are writable.

How can I try and straighten things out manually regarding the auth settings to allow the creation OSDs?