[SOLVED] Directory /var/lib/ceph/osd/ceph-<id>/ is empty

cmonty14 · Aug 27, 2019

Hi,

I finished upgrade to Proxmox 6 + Ceph Nautilus on 4 node cluster.

On 2 nodes I have identified that all directories /var/lib/ceph/osd/ceph-<id>/ are empty after rebooting.

Typically the content of this directory is this:
root@ld5508:~# ls -l /var/lib/ceph/osd/ceph-70/
insgesamt 60
-rw-r--r-- 1 root root 402 Jun 7 15:49 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jun 7 15:49 active
lrwxrwxrwx 1 ceph ceph 58 Jun 7 15:49 block -> /dev/disk/by-partuuid/d9c2755f-0542-4772-af7d-0942cf75be76
lrwxrwxrwx 1 ceph ceph 58 Jun 7 15:49 block.db -> /dev/disk/by-partuuid/e2fcac2c-d3c7-4672-84ec-20c0187d7d2a
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jun 7 15:49 bluefs
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 fsid
-rw------- 1 ceph ceph 57 Jun 7 15:49 keyring
-rw-r--r-- 1 ceph ceph 8 Jun 7 15:49 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jun 7 15:49 magic
-rw-r--r-- 1 ceph ceph 4 Jun 7 15:49 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jun 7 15:49 ready
-rw-r--r-- 1 ceph ceph 3 Aug 23 09:57 require_osd_release
-rw-r--r-- 1 ceph ceph 0 Aug 21 11:35 systemd
-rw-r--r-- 1 ceph ceph 10 Jun 7 15:49 type
-rw-r--r-- 1 ceph ceph 3 Jun 7 15:49 whoami

I'm concerned that I loose the content on the other nodes, too.

In addition to this issue I have identified that no OSDs are displayed in WebUI -> Ceph -> OSD.
The screen is completely empty.

I assume that this issue could be related to ceph.conf; here's my current configuration file:
root@ld3955:/etc/ceph# more /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.1.0/27
debug ms = 0/0
fsid = 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
mon allow pool delete = true
mon osd full ratio = .85
mon osd nearfull ratio = .75
osd crush update on start = false
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.97.206.0/24
mon_host = 10.97.206.93,10.97.206.94,10.97.206.95

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
osd journal size = 100

[mds.ld3955]
host = ld3955
mds standby for name = pve
keyring = /etc/pve/priv/ceph.mds.ld3955.keyring

[mds.ld3976]
host = ld3976
mds standby for name = pve
keyring = /etc/pve/priv/ceph.mds.ld3976.keyring

All services are running with exception of the OSDs where relevant /var/lib/ceph/osd/ceph-<id>/ are empty.
I'm re-creating the affected OSDs manually in order to fix this issue.

Can you please support to identify the root cause of these 2 issues?

THX

Alwin · Aug 27, 2019

Did you run the following steps on upgrading Luminous to Nautilus?
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus#Restart_the_OSD_daemon_on_all_nodes

cmonty14 · Aug 27, 2019

Yes. I followed the upgrade guide and executed every single step.
And actually everything was fine.

However since yesterday the issue started.
I cleaned up some packages from Debian 9, upgraded the PVE kernel and rebooted 2 of 4 nodes.
I modified /ect/pve/ceph.conf too in order to troubleshoot issues with 2 MDS services that did not start.

Alwin · Aug 27, 2019

Did it create the /etc/ceph/osd/{OSDID}-GUID.json file(s)? If not then try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.

cmonty14 · Aug 27, 2019

Alwin said:
Did it create the /etc/ceph/osd/{OSDID}-GUID.json file(s)? If not then try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.

No, I didn't create the json-files.
And directory /etc/ceph/osd/ does not exist.
root@ld5508:~# ls -l /etc/ceph/
insgesamt 16
-rw------- 1 ceph ceph 161 Mai 28 14:33 ceph.client.admin.keyring
lrwxrwxrwx 1 root root 18 Mai 28 14:33 ceph.conf -> /etc/pve/ceph.conf
-rw-r----- 1 root root 704 Aug 23 10:55 ceph.conf.backup
-rw-r--r-- 1 root root 92 Nov 19 2018 rbdmap

Alwin · Aug 27, 2019

Try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.

cmonty14 · Aug 27, 2019

root@ld5507:~# ceph-volume simple scan /dev/sda1
Running command: /sbin/cryptsetup status /dev/sda1
--> OSD 172 got scanned and metadata persisted to file: /etc/ceph/osd/172-a7de0317-05da-4df5-be08-8b4401d76f10.json
--> To take over management of this scanned OSD, and disable ceph-disk and udev, run:
--> ceph-volume simple activate 172 a7de0317-05da-4df5-be08-8b4401d76f10

What should be the next steps?
And what should be done on servers where all OSDs have been recreated from scratch, means LVM is used?

Alwin · Aug 27, 2019

c.monty said:
What should be the next steps?

--> ceph-volume simple activate 172 a7de0317-05da-4df5-be08-8b4401d76f10

c.monty said:
And what should be done on servers where all OSDs have been recreated from scratch, means LVM is used?

They are fine. As there is no ceph-disk anymore, with each OSD that gets replaced it will be created with LVM anyway.

cmonty14 · Aug 27, 2019

All right.
I've completed all activities on servers with "simple" ceph-disk(s).
The json-files in /etc/ceph/osd/ are complete now.
I understand this as a precaution measure in case files in /var/lib/ceph/osd/ceph-<id>/ are lost (again).

However I don't understand how to fix a comparable issue with LVM OSDs when files in /var/lib/ceph/osd/ceph-<id>/ get lost.
Can you please explain?

And how can I fix the issue with empty WebUI (see screenshot attached in initial posting)?

Alwin · Aug 28, 2019

c.monty said:
I understand this as a precaution measure in case files in /var/lib/ceph/osd/ceph-<id>/ are lost (again).

However I don't understand how to fix a comparable issue with LVM OSDs when files in /var/lib/ceph/osd/ceph-<id>/ get lost.
Can you please explain?

ceph-volume needs this information to activate the OSDs on startup, as it can't get the information through LVM (tags). The mounted directory is created on startup and filled with the information of the LV. It is temporary by nature.

c.monty said:
And how can I fix the issue with empty WebUI (see screenshot attached in initial posting)?

See the bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=2340

cmonty14 · Aug 28, 2019

Alwin said:
ceph-volume needs this information to activate the OSDs on startup, as it can't get the information through LVM (tags). The mounted directory is created on startup and filled with the information of the LV. It is temporary by nature.

Does this means I need to create the json-files for LVM OSDs, too?
If yes, how should I do this?
If not, how can I ensure that OSD activation on startup for LVM OSDs is working in case the files in /var/lib/ceph/osd/ceph-<id>/ are lost?

Alwin · Aug 28, 2019

c.monty said:
Does this means I need to create the json-files for LVM OSDs, too?

No.

c.monty said:
If not, how can I ensure that OSD activation on startup for LVM OSDs is working in case the files in /var/lib/ceph/osd/ceph-<id>/ are lost?

They can't be lost, they are stored in the LVs metadata lvs -o +tags.

runguyenhuu · May 11, 2020

Hi all !
I had some probblem. I must add manual osd after lost config and reboot again, config lost again.
Can I check and fix this problem ?

meuchel · Aug 7, 2020

I logged in to say hi and thank you to Alwin.
The info in this thread should be in the troubleshooting portion of the luminous to nautilus upgrade.
My upgrade went very well except my fabric adapter changed names for some reason and i had to create the json's for my existing osd's using the ceph-volume simple can /dev/diskpart1 on every osd.
i was wondering why all my osd's wouldn't start after the upgrade and my anxiety almost got me.

THANKS A MILLION Alwin!

[SOLVED] Directory /var/lib/ceph/osd/ceph-<id>/ is empty

cmonty14

Renowned Member

Attachments

Alwin

Proxmox Retired Staff

cmonty14

Renowned Member

Alwin

Proxmox Retired Staff

cmonty14

Renowned Member

Alwin

Proxmox Retired Staff

cmonty14

Renowned Member

Alwin

Proxmox Retired Staff

cmonty14

Renowned Member

Alwin

Proxmox Retired Staff

cmonty14

Renowned Member

Alwin

Proxmox Retired Staff

runguyenhuu

Active Member

meuchel

Member

We value your privacy