[SOLVED] Directory /var/lib/ceph/osd/ceph-<id>/ is empty

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,

I finished upgrade to Proxmox 6 + Ceph Nautilus on 4 node cluster.

On 2 nodes I have identified that all directories /var/lib/ceph/osd/ceph-<id>/ are empty after rebooting.

Typically the content of this directory is this:
root@ld5508:~# ls -l /var/lib/ceph/osd/ceph-70/
insgesamt 60
-rw-r--r-- 1 root root 402 Jun 7 15:49 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jun 7 15:49 active
lrwxrwxrwx 1 ceph ceph 58 Jun 7 15:49 block -> /dev/disk/by-partuuid/d9c2755f-0542-4772-af7d-0942cf75be76
lrwxrwxrwx 1 ceph ceph 58 Jun 7 15:49 block.db -> /dev/disk/by-partuuid/e2fcac2c-d3c7-4672-84ec-20c0187d7d2a
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jun 7 15:49 bluefs
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jun 7 15:49 fsid
-rw------- 1 ceph ceph 57 Jun 7 15:49 keyring
-rw-r--r-- 1 ceph ceph 8 Jun 7 15:49 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jun 7 15:49 magic
-rw-r--r-- 1 ceph ceph 4 Jun 7 15:49 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jun 7 15:49 ready
-rw-r--r-- 1 ceph ceph 3 Aug 23 09:57 require_osd_release
-rw-r--r-- 1 ceph ceph 0 Aug 21 11:35 systemd
-rw-r--r-- 1 ceph ceph 10 Jun 7 15:49 type
-rw-r--r-- 1 ceph ceph 3 Jun 7 15:49 whoami


I'm concerned that I loose the content on the other nodes, too.

In addition to this issue I have identified that no OSDs are displayed in WebUI -> Ceph -> OSD.
The screen is completely empty.

I assume that this issue could be related to ceph.conf; here's my current configuration file:
root@ld3955:/etc/ceph# more /etc/pve/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.1.0/27
debug ms = 0/0
fsid = 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
mon allow pool delete = true
mon osd full ratio = .85
mon osd nearfull ratio = .75
osd crush update on start = false
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 10.97.206.0/24
mon_host = 10.97.206.93,10.97.206.94,10.97.206.95

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[osd]
osd journal size = 100

[mds.ld3955]
host = ld3955
mds standby for name = pve
keyring = /etc/pve/priv/ceph.mds.ld3955.keyring

[mds.ld3976]
host = ld3976
mds standby for name = pve
keyring = /etc/pve/priv/ceph.mds.ld3976.keyring


All services are running with exception of the OSDs where relevant /var/lib/ceph/osd/ceph-<id>/ are empty.
I'm re-creating the affected OSDs manually in order to fix this issue.

Can you please support to identify the root cause of these 2 issues?

THX
 

Attachments

  • 2019-08-27_10-34-57.png
    2019-08-27_10-34-57.png
    121.8 KB · Views: 14
Yes. I followed the upgrade guide and executed every single step.
And actually everything was fine.

However since yesterday the issue started.
I cleaned up some packages from Debian 9, upgraded the PVE kernel and rebooted 2 of 4 nodes.
I modified /ect/pve/ceph.conf too in order to troubleshoot issues with 2 MDS services that did not start.
 
Did it create the /etc/ceph/osd/{OSDID}-GUID.json file(s)? If not then try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.
 
Did it create the /etc/ceph/osd/{OSDID}-GUID.json file(s)? If not then try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.
No, I didn't create the json-files.
And directory /etc/ceph/osd/ does not exist.
root@ld5508:~# ls -l /etc/ceph/
insgesamt 16
-rw------- 1 ceph ceph 161 Mai 28 14:33 ceph.client.admin.keyring
lrwxrwxrwx 1 root root 18 Mai 28 14:33 ceph.conf -> /etc/pve/ceph.conf
-rw-r----- 1 root root 704 Aug 23 10:55 ceph.conf.backup
-rw-r--r-- 1 root root 92 Nov 19 2018 rbdmap
 
Try to run the ceph-volume simple scan /dev/sdX1 and see if it gets created.
 
root@ld5507:~# ceph-volume simple scan /dev/sda1
Running command: /sbin/cryptsetup status /dev/sda1
--> OSD 172 got scanned and metadata persisted to file: /etc/ceph/osd/172-a7de0317-05da-4df5-be08-8b4401d76f10.json
--> To take over management of this scanned OSD, and disable ceph-disk and udev, run:
--> ceph-volume simple activate 172 a7de0317-05da-4df5-be08-8b4401d76f10


What should be the next steps?
And what should be done on servers where all OSDs have been recreated from scratch, means LVM is used?
 
Last edited:
What should be the next steps?
--> ceph-volume simple activate 172 a7de0317-05da-4df5-be08-8b4401d76f10

And what should be done on servers where all OSDs have been recreated from scratch, means LVM is used?
They are fine. As there is no ceph-disk anymore, with each OSD that gets replaced it will be created with LVM anyway.
 
All right.
I've completed all activities on servers with "simple" ceph-disk(s).
The json-files in /etc/ceph/osd/ are complete now.
I understand this as a precaution measure in case files in /var/lib/ceph/osd/ceph-<id>/ are lost (again).

However I don't understand how to fix a comparable issue with LVM OSDs when files in /var/lib/ceph/osd/ceph-<id>/ get lost.
Can you please explain?

And how can I fix the issue with empty WebUI (see screenshot attached in initial posting)?
 
I understand this as a precaution measure in case files in /var/lib/ceph/osd/ceph-<id>/ are lost (again).

However I don't understand how to fix a comparable issue with LVM OSDs when files in /var/lib/ceph/osd/ceph-<id>/ get lost.
Can you please explain?
ceph-volume needs this information to activate the OSDs on startup, as it can't get the information through LVM (tags). The mounted directory is created on startup and filled with the information of the LV. It is temporary by nature.

And how can I fix the issue with empty WebUI (see screenshot attached in initial posting)?
See the bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=2340
 
ceph-volume needs this information to activate the OSDs on startup, as it can't get the information through LVM (tags). The mounted directory is created on startup and filled with the information of the LV. It is temporary by nature.

Does this means I need to create the json-files for LVM OSDs, too?
If yes, how should I do this?
If not, how can I ensure that OSD activation on startup for LVM OSDs is working in case the files in /var/lib/ceph/osd/ceph-<id>/ are lost?
 
Does this means I need to create the json-files for LVM OSDs, too?
No.

If not, how can I ensure that OSD activation on startup for LVM OSDs is working in case the files in /var/lib/ceph/osd/ceph-<id>/ are lost?
They can't be lost, they are stored in the LVs metadata lvs -o +tags.
 
  • Like
Reactions: cmonty14
Hi all !
I had some probblem. I must add manual osd after lost config and reboot again, config lost again.
Can I check and fix this problem ?
 
I logged in to say hi and thank you to Alwin.
The info in this thread should be in the troubleshooting portion of the luminous to nautilus upgrade.
My upgrade went very well except my fabric adapter changed names for some reason and i had to create the json's for my existing osd's using the ceph-volume simple can /dev/diskpart1 on every osd.
i was wondering why all my osd's wouldn't start after the upgrade and my anxiety almost got me.

THANKS A MILLION Alwin!
 
  • Like
Reactions: Alwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!