PVE 8.1 / Reef: Unable to destroy OSD - "internal error: duplicate hostname found"

pwizard

New Member
Jan 9, 2023
19
0
1
Hello,

I want to take an old, worn-out SSD out of a 3-node Ceph Reef cluster of ours, osd.48.
Node hostnames: proxstore11 / proxstore12 / proxstore13

Via CLI I've reweighted osd.48 to 0.0.
Once PG count = 0 I've used Proxmox GUI to mark the OSD out and stop the ceph-osd@48 instance.
osd.48 is now shown as down/out in GUI (with overall Ceph health status still OK), "Destroy" button is no longer greyed out, I use it, keep "Cleanup Disks" option enabled, but clicking "Remove" leads to an immediate error 500 "internal error: duplicate hostname found: proxstore11"

Why does pveceph think so? Which file should I check for duplicates?

Thanks,

Patrick
 
I've tried to take a look in pveceph.pm (or, in this case, PVE/API2/Ceph/OSD.pm where the die()-call originates) - am I correct that the following output is the issue, with the 3 nodes all appearing twice because there is once a "function pluto" and once the "root"?

What is this sanity check for in the API, what kind of issue does it prevent? Is it correct to fire in our scenario and how else would we destroy osd.38/osd.43/osd.48?

Code:
# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                STATUS  REWEIGHT  PRI-AFF
-28         68.26454  function pluto                                  
-25         25.18105      host proxstore11                          
  3    hdd   3.63869          osd.3                up   1.00000  1.00000
 44    hdd   3.63899          osd.44               up   1.00000  1.00000
 45    hdd   3.63899          osd.45               up   1.00000  1.00000
 46    hdd   3.63899          osd.46               up   1.00000  1.00000
 47    hdd   3.63899          osd.47               up   1.00000  1.00000
  0    ssd   1.74660          osd.0                up   1.00000  1.00000
  8    ssd   1.74660          osd.8                up   1.00000  1.00000
  9    ssd   1.74660          osd.9                up   1.00000  1.00000
 14    ssd   1.74660          osd.14               up   1.00000  1.00000
 48    ssd         0          osd.48             down         0  1.00000
-22         21.54175      host proxstore12                          
  5    hdd   3.63869          osd.5                up   1.00000  1.00000
  6    hdd   3.63869          osd.6                up   1.00000  1.00000
 40    hdd   3.63899          osd.40               up   1.00000  1.00000
 41    hdd   3.63899          osd.41               up   1.00000  1.00000
  1    ssd   1.74660          osd.1                up   1.00000  1.00000
 10    ssd   1.74660          osd.10               up   1.00000  1.00000
 11    ssd   1.74660          osd.11               up   1.00000  1.00000
 15    ssd   1.74660          osd.15               up   1.00000  1.00000
 43    ssd         0          osd.43             down         0  1.00000
-13         21.54175      host proxstore13                          
  4    hdd   3.63869          osd.4                up   1.00000  1.00000
  7    hdd   3.63869          osd.7                up   1.00000  1.00000
 35    hdd   3.63899          osd.35               up   1.00000  1.00000
 36    hdd   3.63899          osd.36               up   1.00000  1.00000
  2    ssd   1.74660          osd.2                up   1.00000  1.00000
 12    ssd   1.74660          osd.12               up   1.00000  1.00000
 13    ssd   1.74660          osd.13               up   1.00000  1.00000
 16    ssd   1.74660          osd.16               up   1.00000  1.00000
 38    ssd         0          osd.38             down         0  1.00000
 -1         68.26454  root default                                    
-25         25.18105      host proxstore11                          
  3    hdd   3.63869          osd.3                up   1.00000  1.00000
 44    hdd   3.63899          osd.44               up   1.00000  1.00000
 45    hdd   3.63899          osd.45               up   1.00000  1.00000
 46    hdd   3.63899          osd.46               up   1.00000  1.00000
 47    hdd   3.63899          osd.47               up   1.00000  1.00000
  0    ssd   1.74660          osd.0                up   1.00000  1.00000
  8    ssd   1.74660          osd.8                up   1.00000  1.00000
  9    ssd   1.74660          osd.9                up   1.00000  1.00000
 14    ssd   1.74660          osd.14               up   1.00000  1.00000
 48    ssd         0          osd.48             down         0  1.00000
-22         21.54175      host proxstore12                          
  5    hdd   3.63869          osd.5                up   1.00000  1.00000
  6    hdd   3.63869          osd.6                up   1.00000  1.00000
 40    hdd   3.63899          osd.40               up   1.00000  1.00000
 41    hdd   3.63899          osd.41               up   1.00000  1.00000
  1    ssd   1.74660          osd.1                up   1.00000  1.00000
 10    ssd   1.74660          osd.10               up   1.00000  1.00000
 11    ssd   1.74660          osd.11               up   1.00000  1.00000
 15    ssd   1.74660          osd.15               up   1.00000  1.00000
 43    ssd         0          osd.43             down         0  1.00000
-13         21.54175      host proxstore13                          
  4    hdd   3.63869          osd.4                up   1.00000  1.00000
  7    hdd   3.63869          osd.7                up   1.00000  1.00000
 35    hdd   3.63899          osd.35               up   1.00000  1.00000
 36    hdd   3.63899          osd.36               up   1.00000  1.00000
  2    ssd   1.74660          osd.2                up   1.00000  1.00000
 12    ssd   1.74660          osd.12               up   1.00000  1.00000
 13    ssd   1.74660          osd.13               up   1.00000  1.00000
 16    ssd   1.74660          osd.16               up   1.00000  1.00000
 38    ssd         0          osd.38             down         0  1.00000
 
I've done the steps manually that pveceph / API2 would've taken, with no negative effects:
osd.38 / osd.43 / osd.48 have been purged, the ceph config parameter removed and the tmpfs unmounted.
Given that Luminous introduced "ceph osd purge" I've used this one instead of 3 separate commands to achieve the same thing - maybe the OSD.pm can / should be improved to make use of the "purge"?

Anyway, issue is "solved" for us - not sure if the die() that prevented us from using pveceph tooling is overly paranoid or it's correct to bail because "pveceph create" commands would never lead to duplicate hostname mentions in Ceph config and you don't want to execute destructive pveceph commands if they hadn't been built with purely pveceph before. But our OSDs are gone and that's what counts.

Best regards,

Patrick
 
Here's the complete list:
Code:
ceph osd purge 43 --yes-i-really-mean-it
ceph config rm osd.48 osd_mclock_max_capacity_iops_ssd
systemctl disable ceph-osd@48.service
umount /var/lib/ceph/osd/ceph-48/

The last 2 commands need to be executed on the Ceph node that hosts the OSD, while the former 2 commands can be executed on any Ceph node in the cluster.

Also, you might need to change "ssd" to "hdd" at the very end of command #2 - in our case it was SSD, so that's why it reads "ssd".