Ceph OSD failure after host reboot

cmonty14 · Jan 18, 2019

Hi,
I have configured a 3-node Ceph cluster.
Each node has 2 RAID controllers, 4 SSDs and 48 HDDs.

I used this syntax to create an OSD:
pveceph osd create /dev/sdd -bluestore -journal_dev /dev/sdv1
pveceph osd create /dev/sde -bluestore -journal_dev /dev/sdw1
pveceph osd create /dev/sdf -bluestore -journal_dev /dev/sdx1
pveceph osd create /dev/sdg -bluestore -journal_dev /dev/sdy1
[...]

This means that I created 12 Partitions on each SSD; the SSD devices have these device names:
/dev/sdv
/dev/sdw
/dev/sdx
/dev/sdy

The HDDs have the device names
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg
[...]

The issue is this:
If I add new disks to the controller and make them available to Linux, the device names can change after reboot. This actually happened in my case to the SSDs with the result that no OSD was up anymore.

Would it be possible to use a unique device name like UUID instead of the simple device name?

THX

AlexLup · Jan 18, 2019

ceph uses udev or something else making the point moot as the drives change designation.

Having said that, I have had this problem when the network stopped working as I have a migration and a public network setup.

cmonty14 · Jan 24, 2019

I'm wondering if anybody else is affected by this issue and if yes, why there's no solution provided.

Alwin · Jan 28, 2019

This is one more reason, why you should not use a RAID controller for Ceph or ZFS.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

cmonty14 · Jan 29, 2019

Alwin said:
This is one more reason, why you should not use a RAID controller for Ceph or ZFS.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

I fully understand that usage of RAID controller is not recommended and HBA / JBOD should be used.
However this does not solve the issue.
Let's assume I have a server that provides 20 slots for SAS devices, but I only have 10 disks available.
When I finish Ceph setup with this 10 disks and add 10 disks later, there's a chance that the devices names /dev/sd<a-t> can be messed up after server reboot.
And UUID would help for unique drive identification.

Alwin · Jan 29, 2019

Code:

ls -lah /dev/disk/by-uuid/

You will see them here.

Code:

ls -lah /var/lib/ceph/osd/ceph-3/
lrwxrwxrwx 1 ceph ceph   58 Jan  8 16:50 block -> /dev/disk/by-partuuid/c94adcc4-f51b-4ce0-bf88-da935445d8f3
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 block_uuid
-rw-r--r-- 1 ceph ceph   37 Jan  8 16:50 fsid

Or here.

Code:

parted /dev/sdd print
Model: QEMU QEMU HARDDISK (scsi)
Disk /dev/sdd: 34.4GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  106MB   105MB   xfs          ceph data
 2      106MB   34.4GB  34.3GB               ceph block

On service ceph-osd.target start (or restart) the ceph-disk will look through all the disks and see if it can find a partition with the name 'ceph data'. Once found it will mount it temporarily and check to which OSD.ID it belongs to and re-mount it under the appropriate '/var/lib/ceph/osd/ceph-ID/ path. With the different files in the mounted directory Ceph knows where to find the other needed partitions.

cmonty14 · Jan 30, 2019

Alwin said:
Code:

ls -lah /dev/disk/by-uuid/

You will see them here.

Code:

ls -lah /var/lib/ceph/osd/ceph-3/ lrwxrwxrwx 1 ceph ceph 58 Jan 8 16:50 block -> /dev/disk/by-partuuid/c94adcc4-f51b-4ce0-bf88-da935445d8f3 -rw-r--r-- 1 ceph ceph 37 Jan 8 16:50 block_uuid -rw-r--r-- 1 ceph ceph 37 Jan 8 16:50 fsid

Or here.

Code:

parted /dev/sdd print Model: QEMU QEMU HARDDISK (scsi) Disk /dev/sdd: 34.4GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 106MB 105MB xfs ceph data 2 106MB 34.4GB 34.3GB ceph block

On service ceph-osd.target start (or restart) the ceph-disk will look through all the disks and see if it can find a partition with the name 'ceph data'. Once found it will mount it temporarily and check to which OSD.ID it belongs to and re-mount it under the appropriate '/var/lib/ceph/osd/ceph-ID/ path. With the different files in the mounted directory Ceph knows where to find the other needed partitions.

What you say is 100% correct.

However you did not consider a setup where block.db resides on a faster disk (SSD) than the main device (HDD).
Then the block.db is a link to the device and not UUID:
root@ld4257:/etc/ceph# ls -lah /var/lib/ceph/osd/ceph-0/
insgesamt 60K
drwxr-xr-x 2 ceph ceph 271 Okt 30 13:58 .
drwxr-xr-x 4 ceph ceph 4,0K Okt 30 13:49 ..
-rw-r--r-- 1 root root 402 Jul 30 2018 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jul 30 2018 active
lrwxrwxrwx 1 ceph ceph 58 Jul 30 2018 block -> /dev/disk/by-partuuid/5126bf90-dbcb-4aa9-b10b-2304e9f66ff6
lrwxrwxrwx 1 ceph ceph 9 Okt 30 13:58 block.db -> /dev/sdd1
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jul 30 2018 bluefs
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jul 30 2018 fsid
-rw------- 1 ceph ceph 56 Jul 30 2018 keyring
-rw-r--r-- 1 ceph ceph 8 Jul 30 2018 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jul 30 2018 magic
-rw-r--r-- 1 ceph ceph 4 Jul 30 2018 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jul 30 2018 ready
-rw-r--r-- 1 ceph ceph 0 Jan 7 17:26 systemd
-rw-r--r-- 1 ceph ceph 10 Jul 30 2018 type
-rw-r--r-- 1 ceph ceph 2 Jul 30 2018 whoami

The separation of BlueStore in devices

main devices (the block symlink)
db device (the block.db symlink)
optional WAL device (the block.wal symlink)

is recommended by Ceph for faster storage.

Interesting enough is the content of file block.db_uuid:
root@ld4257:/etc/ceph# more /var/lib/ceph/osd/ceph-0/block.db_uuid
714bae53-978e-4b6b-b780-4253c25bcb54

This means the UUID is already known, but the symlink is not pointing to it.

At the bottom line the symlink to main device (using UUID) and db device (using device name) are inconsistent.

Therefore I would open a feature request that is creating a block device symlink using UUID when creating a OSD with command pveceph osd create <main device name> -bluestore -journal_dev <block device partition>.

Alwin · Jan 30, 2019

c.monty said:
pveceph osd create <main device name> -bluestore -journal_dev <block device partition>.

This command gives me the following output, below.

prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
prepare_device: Block.db /dev/sdc1 was not prepared with ceph-disk. Symlinking directly.

You can set the default partition size for DB/WAL in the ceph.conf and ceph-disk will create the partition with the according size.

Code:

# pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc

Code:

root@p5c02:~# ls -lah /var/lib/ceph/osd/ceph-0/

lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block -> /dev/disk/by-partuuid/51b435dc-57f8-4fce-92b7-b0dde22e8ec3
lrwxrwxrwx 1 ceph ceph   58 Jan 30 11:22 block.db -> /dev/disk/by-partuuid/97453d64-a2cc-4d57-b099-e1aac336be99
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block.db_uuid
-rw-r--r-- 1 ceph ceph   37 Jan 30 11:22 block_uuid

cmonty14 · Jan 30, 2019

Alwin said:
This command gives me the following output, below.

You can set the default partition size for DB/WAL in the ceph.conf and ceph-disk will create the partition with the according size.

Code:

# pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc

Code:

root@p5c02:~# ls -lah /var/lib/ceph/osd/ceph-0/ lrwxrwxrwx 1 ceph ceph 58 Jan 30 11:22 block -> /dev/disk/by-partuuid/51b435dc-57f8-4fce-92b7-b0dde22e8ec3 lrwxrwxrwx 1 ceph ceph 58 Jan 30 11:22 block.db -> /dev/disk/by-partuuid/97453d64-a2cc-4d57-b099-e1aac336be99 -rw-r--r-- 1 ceph ceph 37 Jan 30 11:22 block.db_uuid -rw-r--r-- 1 ceph ceph 37 Jan 30 11:22 block_uuid

Does this mean, the command pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc will create multiple partitions on block device /dev/sdc if this block device is used multiple times as db device for different main devices?

Alwin · Jan 30, 2019

c.monty said:
Does this mean, the command pveceph osd create /dev/sdb -bluestore -journal_dev /dev/sdc will create multiple partitions on block device /dev/sdc if this block device is used multiple times as db device for different main devices?

Yes, ceph-disk is using the next free space to create a partition on it.

Search

Search

Ceph OSD failure after host reboot

cmonty14

Well-Known Member

AlexLup

Well-Known Member

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff