[SOLVED] OSD does not exists on host (500)

Jul 19, 2023
5
1
3
Hi,

We have several Proxmox hosts that have a Ceph storage and are connected to each other.
Technically Ceph works, but for one specific host there is always this error message on all of it OSDs (No matter on which host we use the Web UI):

OSD '29' does not exist on host 'SRV-Host' (500)
1709029277693.png

The configuration differs in the structure of the hostname and I suspect that this is causing a problem:

The Host that doesn't display OSDs correctly:

Code:
ceph osd metadata 29
{
    "id": 29,
...
    "hostname": "SRV-Host.DOMAIN.com",
...
}

root@SRV-Host:~# hostname
SRV-Host.DOMAIN.com

The Host that does display OSDs correctly:

Code:
ceph osd metadata 1
{
    "id": 1,
...
    "hostname": "HW-PX03",
...
}

root@HW-PX02:~# hostname
HW-PX02

Here I saw that the message comes from matching the hostname:
+ die "OSD '${osdid}' does not exists on host '${nodename}'\n"
+ if $nodename ne $metadata->{hostname};
https://lists.proxmox.com/pipermail/pve-devel/2022-December/055118.html


Can I somehow fix this problem without reinstalling or is there no way around it?

Versions: Ceph 17.2.6 (Currently Updating to 17.2.7), Proxmox PVE 5.15.131-3 / pve-manager 7.4-17 (Currently Updating to PVE 5.15.143-1)


Best Regards Yannik
 
Last edited:
Can you post the output of ceph osd df tree and maybe also the Crushmap? The Crushmap is the right half in the Ceph -> Configuration panel.
 
Sure.

“SRV-Host” is always used there and not the host name with the domain, which is why I suspect that this leads to a problem.

Code:
root@SRV-Host:~# ceph osd df tree
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1         534.79193         -  535 TiB  330 TiB  329 TiB  1.1 GiB  740 GiB  205 TiB  61.62  1.00    -          root default
 -3          98.22656         -   98 TiB   55 TiB   55 TiB  224 MiB  123 GiB   43 TiB  56.24  0.91    -              host HW-PX01
  0    hdd   16.37109   1.00000   16 TiB  9.7 TiB  9.7 TiB   15 MiB   22 GiB  6.6 TiB  59.54  0.97  158      up          osd.0
  5    hdd   16.37109   1.00000   16 TiB  8.9 TiB  8.9 TiB   48 MiB   20 GiB  7.4 TiB  54.54  0.89  150      up          osd.5
  7    hdd   16.37109   1.00000   16 TiB  8.9 TiB  8.9 TiB   33 MiB   20 GiB  7.5 TiB  54.37  0.88  151      up          osd.7
  9    hdd   16.37109   1.00000   16 TiB  9.6 TiB  9.6 TiB   28 MiB   22 GiB  6.8 TiB  58.70  0.95  155      up          osd.9
 12    hdd   16.37109   1.00000   16 TiB  8.3 TiB  8.3 TiB   44 MiB   19 GiB  8.0 TiB  50.83  0.82  151      up          osd.12
 16    hdd   16.37109   1.00000   16 TiB  9.7 TiB  9.7 TiB   56 MiB   22 GiB  6.6 TiB  59.47  0.97  164      up          osd.16
 -9          98.22656         -   98 TiB   55 TiB   55 TiB  171 MiB  123 GiB   43 TiB  56.07  0.91    -              host HW-PX02
  4    hdd   16.37109   1.00000   16 TiB  9.3 TiB  9.3 TiB   51 MiB   21 GiB  7.1 TiB  56.69  0.92  160      up          osd.4
  6    hdd   16.37109   1.00000   16 TiB  9.3 TiB  9.3 TiB   21 MiB   21 GiB  7.0 TiB  57.01  0.93  156      up          osd.6
  8    hdd   16.37109   1.00000   16 TiB  8.8 TiB  8.7 TiB   71 MiB   19 GiB  7.6 TiB  53.53  0.87  159      up          osd.8
 10    hdd   16.37109   1.00000   16 TiB  9.3 TiB  9.3 TiB   21 MiB   21 GiB  7.1 TiB  56.83  0.92  147      up          osd.10
 13    hdd   16.37109   1.00000   16 TiB  9.4 TiB  9.4 TiB  6.7 MiB   21 GiB  7.0 TiB  57.43  0.93  152      up          osd.13
 15    hdd   16.37109   1.00000   16 TiB  9.0 TiB  9.0 TiB    7 KiB   20 GiB  7.4 TiB  54.92  0.89  154      up          osd.15
 -5          81.85547         -   82 TiB   55 TiB   54 TiB  159 MiB  123 GiB   27 TiB  66.68  1.08    -              host HW-PX03
  1    hdd   16.37109   1.00000   16 TiB   12 TiB   12 TiB   49 MiB   27 GiB  4.3 TiB  73.95  1.20  184      up          osd.1
  2    hdd   16.37109   1.00000   16 TiB   11 TiB   11 TiB   17 MiB   24 GiB  5.6 TiB  66.09  1.07  174      up          osd.2
  3    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB  6.1 MiB   23 GiB  6.1 TiB  62.88  1.02  170      up          osd.3
 11    hdd   16.37109   1.00000   16 TiB   11 TiB   11 TiB   53 MiB   24 GiB  5.6 TiB  66.02  1.07  173      up          osd.11
 14    hdd   16.37109   1.00000   16 TiB   11 TiB   11 TiB   35 MiB   24 GiB  5.8 TiB  64.46  1.05  163      up          osd.14
-11          83.67477         -   84 TiB   55 TiB   55 TiB  152 MiB  121 GiB   29 TiB  65.60  1.06    -              host HW-PX04
 17    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB   29 MiB   23 GiB  6.2 TiB  62.05  1.01  165      up          osd.17
 19    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB   12 MiB   23 GiB  6.1 TiB  62.49  1.01  168      up          osd.19
 21    hdd   16.37109   1.00000   16 TiB   12 TiB   12 TiB   25 MiB   26 GiB  4.6 TiB  71.98  1.17  179      up          osd.21
 23    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB   44 MiB   23 GiB  6.1 TiB  62.50  1.01  168      up          osd.23
 32    hdd   18.19040   1.00000   18 TiB   12 TiB   12 TiB   42 MiB   26 GiB  5.7 TiB  68.62  1.11  193      up          osd.32
 -7          83.67477         -   84 TiB   55 TiB   55 TiB  207 MiB  120 GiB   29 TiB  65.46  1.06    -              host HW-PX05
 18    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB   36 MiB   23 GiB  6.2 TiB  62.09  1.01  169      up          osd.18
 20    hdd   16.37109   1.00000   16 TiB   11 TiB   11 TiB   28 MiB   24 GiB  5.5 TiB  66.22  1.07  172      up          osd.20
 22    hdd   16.37109   1.00000   16 TiB   11 TiB   11 TiB   35 MiB   25 GiB  5.1 TiB  68.93  1.12  184      up          osd.22
 24    hdd   16.37109   1.00000   16 TiB   10 TiB   10 TiB   86 MiB   23 GiB  6.3 TiB  61.69  1.00  169      up          osd.24
 33    hdd   18.19040   1.00000   18 TiB   12 TiB   12 TiB   22 MiB   24 GiB  5.8 TiB  68.08  1.10  192      up          osd.33
-13          89.13379         -   89 TiB   55 TiB   55 TiB  189 MiB  130 GiB   34 TiB  61.70  1.00    -              host SRV-Host
 25    hdd   12.73340   1.00000   13 TiB  7.3 TiB  7.3 TiB   35 MiB   18 GiB  5.4 TiB  57.49  0.93  121      up          osd.25
 26    hdd   12.73340   1.00000   13 TiB  7.7 TiB  7.7 TiB   22 MiB   17 GiB  5.0 TiB  60.59  0.98  127      up          osd.26
 27    hdd   12.73340   1.00000   13 TiB  8.0 TiB  7.9 TiB   22 MiB   19 GiB  4.8 TiB  62.57  1.02  132      up          osd.27
 28    hdd   12.73340   1.00000   13 TiB  8.2 TiB  8.2 TiB   19 MiB   19 GiB  4.6 TiB  64.18  1.04  133      up          osd.28
 29    hdd   12.73340   1.00000   13 TiB  7.2 TiB  7.2 TiB   20 MiB   17 GiB  5.5 TiB  56.45  0.92  122      up          osd.29
 30    hdd   12.73340   1.00000   13 TiB  8.3 TiB  8.3 TiB    6 KiB   20 GiB  4.4 TiB  65.51  1.06  130      up          osd.30
 31    hdd   12.73340   1.00000   13 TiB  8.3 TiB  8.3 TiB   71 MiB   20 GiB  4.4 TiB  65.14  1.06  134      up          osd.31
                          TOTAL  535 TiB  330 TiB  329 TiB  1.1 GiB  740 GiB  205 TiB  61.62
MIN/MAX VAR: 0.82/1.20  STDDEV: 5.39

Crush Map:

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host HW-PX01 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 98.22656
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 16.37109
    item osd.5 weight 16.37109
    item osd.7 weight 16.37109
    item osd.9 weight 16.37109
    item osd.12 weight 16.37109
    item osd.16 weight 16.37109
}
host HW-PX03 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 81.85547
    alg straw2
    hash 0    # rjenkins1
    item osd.1 weight 16.37109
    item osd.3 weight 16.37109
    item osd.2 weight 16.37109
    item osd.11 weight 16.37109
    item osd.14 weight 16.37109
}
host HW-PX05 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 83.67477
    alg straw2
    hash 0    # rjenkins1
    item osd.18 weight 16.37109
    item osd.20 weight 16.37109
    item osd.22 weight 16.37109
    item osd.24 weight 16.37109
    item osd.33 weight 18.19040
}
host HW-PX02 {
    id -9        # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 98.22656
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 16.37109
    item osd.6 weight 16.37109
    item osd.8 weight 16.37109
    item osd.10 weight 16.37109
    item osd.13 weight 16.37109
    item osd.15 weight 16.37109
}
host HW-PX04 {
    id -11        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    # weight 83.67477
    alg straw2
    hash 0    # rjenkins1
    item osd.17 weight 16.37109
    item osd.19 weight 16.37109
    item osd.21 weight 16.37109
    item osd.23 weight 16.37109
    item osd.32 weight 18.19040
}
host SRV-Host {
    id -13        # do not change unnecessarily
    id -14 class hdd        # do not change unnecessarily
    # weight 89.13379
    alg straw2
    hash 0    # rjenkins1
    item osd.25 weight 12.73340
    item osd.26 weight 12.73340
    item osd.27 weight 12.73340
    item osd.28 weight 12.73340
    item osd.29 weight 12.73340
    item osd.30 weight 12.73340
    item osd.31 weight 12.73340
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 534.79193
    alg straw2
    hash 0    # rjenkins1
    item HW-PX01 weight 98.22656
    item HW-PX03 weight 81.85547
    item HW-PX05 weight 83.67477
    item HW-PX02 weight 98.22656
    item HW-PX04 weight 83.67477
    item SRV-Host weight 89.13379
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule erasure-code {
    id 1
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}
rule ecrulek2m1 {
    id 2
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}
rule ecrulek4m2 {
    id 3
    type erasure
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default
    step chooseleaf indep 0 type host
    step emit
}

# end crush map
 
Hmm, are all OSDs on host SRV-Host showing the issue?
Yes, all of them.

The current situation is that the "SRV host" is the only host with an FQDN, the others only have a host name, but the FQDN is also entered everywhere for the SRV-Host.
The one server was installed a year before all the others and uses btrfs instead of zfs as a boot partition, but I don't think that should be a problem.


This means that the host name and the FQDN are probably used in the code block, also if it is noch expected.
$nodename is 'SRV-Host' (as shown in die Message) and corresponds to the host name and $metadata->{hostname} then probably corresponds to the FQDN, as "osd metadata" also outputs.


One thing I also noticed is that if I create a Monitor (Freshly created) on that Host it always has an Unknown Status, but it is running. This may be al similar Problem:

1709114652806.png
 
Can you compare the /etc/hosts and /etc/hostname files between the SRV-Host and another node? How is the hostname and FQDN defined?
 
As I said, 5 have no FQDN and only a hostname, and the one that doesn't work correctly has an FQDN defined as follows in the hostname/hosts file

Code:
root@SRV-Host:~# cat /etc/hostname
SRV-Host.DOMAIN.com
All other look like this:
Code:
root@HW-PX01:~# cat /etc/hostname
HW-PX01

And hosts on all Hosts:
Code:
...
10.1.1.24 HW-PX05.local HW-PX05
10.1.1.25 SRV-Host.local SRV-Host SRV-Host.DOMAIN.com
 
Ah okay, I assume that the FQDN in the /etc/hostname file is the cause. Having the FQDN in the /etc/hosts file should be fine.

Can you change the /etc/hostname file to just contain the hostname part? Then you could try to destroy and recreate the MON on that host to see if it shows up correctly when you hover over the service like in the screenshot.

I am looking into how to change the metadata manually for the OSDs, as recreating them will cause quite some load.

The question is, how the full FQDN ended up in /etc/hostname. The installer should not do that. So my guess is that it happened (semi) manually. Maybe through some deployment script or ansible playbook?
 
Wow, very good, the monitor is now recognized correctly.

Yes, that's a very good question, unfortunately this is the only server that I didn't install because it was installed one year before the others.
There were similar problems when it was added to the cluster, either it was added then because someone tried something or someone added it directly after installation.
But I definitely think it was entered manually.

I'm glad I don't have to reinstall it.
If necessary, I'll recreate the OSDs one after the other, which will definitely take some time with the 80TB of data, but it would also be great if there was another way.

As a test, I restarted an OSD and it now works correctly, then I restart everything OSD and then the problem should be solved.
Thank you very much for the great help.


Edit: ceph osd metadata looks now so:
Code:
ceph osd metadata 29
{
    "id": 29,
...
    "hostname": "SRV-Host",
...
}

root@SRV-Host:~# hostname
SRV-Host.DOMAIN.com

Perfect
 
Last edited:
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!