Ceph OSD failure after host reboot

cmonty14

Well-Known Member
Mar 4, 2014
343
5
58
Hi,
I have configured a 3-node Ceph cluster.
Each node has 2 RAID controllers, 4 SSDs and 48 HDDs.

I used this syntax to create an OSD:
pveceph osd create /dev/sdd -bluestore -journal_dev /dev/sdv1
pveceph osd create /dev/sde -bluestore -journal_dev /dev/sdw1
pveceph osd create /dev/sdf -bluestore -journal_dev /dev/sdx1
pveceph osd create /dev/sdg -bluestore -journal_dev /dev/sdy1
[...]


This means that I created 12 Partitions on each SSD; the SSD devices have these device names:
/dev/sdv
/dev/sdw
/dev/sdx
/dev/sdy

The HDDs have the device names
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg
[...]

The issue is this:
If I add new disks to the controller and make them available to Linux, the device names can change after reboot. This actually happened in my case to the SSDs with the result that no OSD was up anymore.

Would it be possible to use a unique device name like UUID instead of the simple device name?

THX
 
ceph uses udev or something else making the point moot as the drives change designation.

Having said that, I have had this problem when the network stopped working as I have a migration and a public network setup.
 
I'm wondering if anybody else is affected by this issue and if yes, why there's no solution provided.
 
This is one more reason, why you should not use a RAID controller for Ceph or ZFS.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

I fully understand that usage of RAID controller is not recommended and HBA / JBOD should be used.
However this does not solve the issue.
Let's assume I have a server that provides 20 slots for SAS devices, but I only have 10 disks available.
When I finish Ceph setup with this 10 disks and add 10 disks later, there's a chance that the devices names /dev/sd<a-t> can be messed up after server reboot.
And UUID would help for unique drive identification.
 
Code:
ls -lah /dev/disk/by-uuid/
You will see them here.

Code:
ls -lah /var/lib/ceph/osd/ceph-3/
lrwxrwxrwx 1 ceph ceph   58 Jan  8 16:50 block -> /dev/disk/by-partuuid/c94adcc4-f51b-4ce0-bf88-da935445d8f3
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 block_uuid
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 fsid
Or here.

Code:
parted /dev/sdd print
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sdd: 34.4GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  106MB   105MB   xfs          ceph data
 2      106MB   34.4GB  34.3GB               ceph block
On service ceph-osd.target start (or restart) the ceph-disk will look through all the disks and see if it can find a partition with the name 'ceph data'. Once found it will mount it temporarily and check to which OSD.ID it belongs to and re-mount it under the appropriate '/var/lib/ceph/osd/ceph-ID/ path. With the different files in the mounted directory Ceph knows where to find the other needed partitions.
 
  • Like
Reactions: Stoiko Ivanov
Code:
ls -lah /dev/disk/by-uuid/
You will see them here.

Code:
ls -lah /var/lib/ceph/osd/ceph-3/
lrwxrwxrwx 1 ceph ceph   58 Jan  8 16:50 block -> /dev/disk/by-partuuid/c94adcc4-f51b-4ce0-bf88-da935445d8f3
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 block_uuid
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 fsid
Or here.

Code:
parted /dev/sdd print
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sdd: 34.4GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  106MB   105MB   xfs          ceph data
 2      106MB   34.4GB  34.3GB               ceph block
On service ceph-osd.target start (or restart) the ceph-disk will look through all the disks and see if it can find a partition with the name 'ceph data'. Once found it will mount it temporarily and check to which OSD.ID it belongs to and re-mount it under the appropriate '/var/lib/ceph/osd/ceph-ID/ path. With the different files in the mounted directory Ceph knows where to find the other needed partitions.

What you say is 100% correct.

However you did not consider a setup where block.db resides on a faster disk (SSD) than the main device (HDD).
Then the block.db is a link to the device and not UUID:
root@ld4257:/etc/ceph# ls -lah /var/lib/ceph/osd/ceph-0/
insgesamt 60K
drwxr-xr-x 2 ceph ceph 271 Okt 30 13:58 .
drwxr-xr-x 4 ceph ceph 4,0K Okt 30 13:49 ..
-rw-r--r-- 1 root root 402 Jul 30 2018 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jul 30 2018 active
lrwxrwxrwx 1 ceph ceph 58 Jul 30 2018 block -> /dev/disk/by-partuuid/5126bf90-dbcb-4aa9-b10b-2304e9f66ff6
lrwxrwxrwx 1 ceph ceph 9 Okt 30 13:58 block.db -> /dev/sdd1
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jul 30 2018 bluefs
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 fsid
-rw------- 1 ceph ceph 56 Jul 30 2018 keyring
-rw-r--r-- 1 ceph ceph 8 Jul 30 2018 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jul 30 2018 magic
-rw-r--r-- 1 ceph ceph 4 Jul 30 2018 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jul 30 2018 ready
-rw-r--r-- 1 ceph ceph 0 Jan 7 17:26 systemd
-rw-r--r-- 1 ceph ceph 10 Jul 30 2018 type
-rw-r--r-- 1 ceph ceph 2 Jul 30 2018 whoami


The separation of BlueStore in devices
  • main devices (the block symlink)
  • db device (the block.db symlink)
  • optional WAL device (the block.wal symlink)
is recommended by Ceph for faster storage.

Interesting enough is the content of file block.db_uuid:
root@ld4257:/etc/ceph# more /var/lib/ceph/osd/ceph-0/block.db_uuid
714bae53-978e-4b6b-b780-4253c25bcb54


This means the UUID is already known, but the symlink is not pointing to it.

At the bottom line the symlink to main device (using UUID) and db device (using device name) are inconsistent.

Therefore I would open a feature request that is creating a block device symlink using UUID when creating a OSD with command pveceph osd create <main device name> -bluestore -journal_dev <block device partition>.
 
Last edited:
pveceph osd create <main device name> -bluestore -journal_dev <block device partition>.
This command gives me the following output, below.
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdc1 was not prepared with ceph-disk. Symlinking directly.

You can set the default partition size for DB/WAL in the ceph.conf and ceph-disk will create the partition with the according size.
Code:
# pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc
Code:
root@p5c02:~# ls -lah /var/lib/ceph/osd/ceph-0/

lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block -> /dev/disk/by-partuuid/51b435dc-57f8-4fce-92b7-b0dde22e8ec3
lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block.db -> /dev/disk/by-partuuid/97453d64-a2cc-4d57-b099-e1aac336be99
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block.db_uuid
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block_uuid
 
This command gives me the following output, below.


You can set the default partition size for DB/WAL in the ceph.conf and ceph-disk will create the partition with the according size.
Code:
# pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc
Code:
root@p5c02:~# ls -lah /var/lib/ceph/osd/ceph-0/

lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block -> /dev/disk/by-partuuid/51b435dc-57f8-4fce-92b7-b0dde22e8ec3
lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block.db -> /dev/disk/by-partuuid/97453d64-a2cc-4d57-b099-e1aac336be99
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block.db_uuid
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block_uuid

Does this mean, the command pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc will create multiple partitions on block device /dev/sdc if this block device is used multiple times as db device for different main devices?
 
Does this mean, the command pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc will create multiple partitions on block device /dev/sdc if this block device is used multiple times as db device for different main devices?
Yes, ceph-disk is using the next free space to create a partition on it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!