[SOLVED] Ceph OSD activation fail

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,

I have created 3 OSDs on the same node.
root@ld4257:~# pveceph createosd /dev/sda -bluestore -journal_dev /dev/sdc4 -wal_dev /dev/sdc4
create OSD on /dev/sda (bluestore)
using device '/dev/sdc4' for block.db
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdc4 was not prepared with ceph-disk. Symlinking directly.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sda1 isize=2048 agcount=4, agsize=6336 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=25344, imaxpct=25
= sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=1728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.


However when I check OSD tree any OSD is down:
root@ld4257:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0 root default
-3 0 host ld4257
-7 0 host ld4464
-5 0 host ld4465
0 0 osd.0 down 0 1.00000
1 0 osd.1 down 1.00000 1.00000
2 0 osd.2 down 1.00000 1.00000


Then I tried to start the OSD manually:
root@ld4257:~# systemctl start ceph-osd@0

But the errorlog indicates a severe error:
2018-07-24 11:00:50.136610 7fa5fd82ee40 0 set uid:gid to 64045:64045 (ceph:ceph)
2018-07-24 11:00:50.136629 7fa5fd82ee40 0 ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable), process (unknown), pid 25037
2018-07-24 11:00:50.136843 7fa5fd82ee40 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2018-07-24 11:01:10.389479 7fd8cdfa4e40 0 set uid:gid to 64045:64045 (ceph:ceph)
2018-07-24 11:01:10.389498 7fd8cdfa4e40 0 ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable), process (unknown), pid 25087
2018-07-24 11:01:10.389708 7fd8cdfa4e40 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2018-07-24 11:01:30.592560 7f61f2f67e40 0 set uid:gid to 64045:64045 (ceph:ceph)
2018-07-24 11:01:30.592574 7f61f2f67e40 0 ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable), process (unknown), pid 25124
2018-07-24 11:01:30.592733 7f61f2f67e40 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory


Directory /var/lib/ceph/osd/ceph-0/ is empty:
root@ld4257:~# ls -l /var/lib/ceph/osd/
insgesamt 12
drwxr-xr-x 2 root root 4096 Jul 23 15:26 ceph-0
drwxr-xr-x 2 root root 4096 Jul 23 15:33 ceph-1
drwxr-xr-x 2 root root 4096 Jul 23 15:34 ceph-2
root@ld4257:~# ls -l /var/lib/ceph/osd/ceph-0/
insgesamt 0


How can I fix this error?

THX
 

I have deleted the 3 OSDs created with this command:
root@ld4257:~# pveceph createosd /dev/sda -bluestore -journal_dev /dev/sdc4 -wal_dev /dev/sdc4
create OSD on /dev/sda (bluestore)
using device '/dev/sdc4' for block.db
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdc4 was not prepared with ceph-disk. Symlinking directly.
Setting name!

partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sda1 isize=2048 agcount=4, agsize=6336 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=25344, imaxpct=25
= sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=1728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.


Then I re-created the 3 OSD with a different command:
root@ld4257:~# pveceph createosd /dev/sda -bluestore -journal_dev /dev/sdc4 -wal_dev /dev/sdc5
create OSD on /dev/sda (bluestore)
using device '/dev/sdc4' for block.db
using device '/dev/sdc5' for block.wal
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdc4 was not prepared with ceph-disk. Symlinking directly.
prepare_device: OSD will not be hot-swappable if block.wal is not the same device as the osd data
prepare_device: Block.wal /dev/sdc5 was not prepared with ceph-disk. Symlinking directly.
Setting name!

partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sda1 isize=2048 agcount=4, agsize=6336 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=25344, imaxpct=25
= sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=1728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.


Please check the highlighted output.

My conclusion is:
If I want to put WAL and Journal on SSD, this cannot reside on the same drive (or partition)!
In the first attempt Block.wal was not created.

When I put WAL and Journal on different drives / partitions the OSD is activated correctly.

Question:
Is this a feature or a bug?
 
If I want to put WAL and Journal on SSD, this cannot reside on the same drive (or partition)!
In the first attempt Block.wal was not created.
If you put the DB on its own partition the WAL will be placed there too (not visible). The WAL is always put on the fastest device from the OSD. And you do not need to specify the '-wal_dev' if you don't want to use a separate partition/device for it.
 
If you put the DB on its own partition the WAL will be placed there too (not visible). The WAL is always put on the fastest device from the OSD. And you do not need to specify the '-wal_dev' if you don't want to use a separate partition/device for it.

Understood.
That means I should put WAL only on a dedicated partition / drive if this drive is faster than Journal drive, means
  • data on HDD
  • Journal (= Block.db) on SSD
  • WAL on (= Block.wal) on NVMe
If I have only 2 different devices (in my case HDD + SSD) I must not use option -wal_dev when creating an OSD.
 
Correct. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!