Ceph Server: why block devices and not partitions ?

Florent · Feb 19, 2014

Hi all,

I'm trying Ceph Server with Proxmox, but I have a little problem.

My servers both have 2 disks, partitioned with RAID (sda1 & sdb1 mirrored).

And I would like to use sda2 & sdb2 for my OSDs in Ceph.

But it seems that all scripts from PVE are using block devices directly and not partitions.

Is there any workaround to use partitions instead of device directly ?

I was looking in loopback devices, but I was not able to create a virtual "sdz" listed in /sys/block (PVE lists this to get all disks).

Anyone got an idea ?

Thank you;

dietmar · Feb 19, 2014

It makes absolutely no sense to use software RAID for Ceph.
You need to use whole disks instead.

Florent · Feb 19, 2014

Yes sorry I forgot to say that my Raid device is used to store system files.
I want to use sda2 & sdb2 for Ceph, which are not Raid-ed.

dietmar · Feb 19, 2014

Our ceph implementation only work with whole disks.

Florent · Feb 19, 2014

Ok, is there any reason for that ?

dietmar · Feb 19, 2014

Florent said:
Ok, is there any reason for that ?

Management (replacing damaged OSDs) is much easier.

zystem · Mar 13, 2014

But you can add partitions support for cli utils. I do not need full support of partitions in gui (existing is suitable). Can you add partitions support to "pveceph createosd /dev/sd[X]" like "pveceph createosd /dev/sdd4"?

Florent · Mar 13, 2014

zystem said:
But you can add partitions support for cli utils. I do not need full support of partitions in gui (existing is suitable). Can you add partitions support to "pveceph createosd /dev/sd[X]" like "pveceph createosd /dev/sdd4"?

Look at sources... the way they do it does not permit to do what you want.

But have a look to my post (http://forum.proxmox.com/threads/17909-Ceph-server-feedback), I give the solution to do it.

zystem · Mar 14, 2014

Yes it sounds reasonable. But why do not add partitions support for journal disk? Or add ability to enter path to journals in gui instead of disk? I want store journals on first disk there proxmox root partition installed. I have no infinite space for additional disks in server case. Why do not use fast disk for journals AND proxmox installation?

dietmar · Mar 14, 2014

zystem said:
I want store journals on first disk there proxmox root partition installed.?

I think that this makes no sense, because of performance issues.

avladulescu · Dec 21, 2015

dietmar said:
I think that this makes no sense, because of performance issues.

dietmar, let me give you an example of use case for such requirements.

Consider the following case - 5 proxmox nodes HA cluster setup, each with 1 x SSD Kingston V300 @ 120G for journal and 1 x Intel 750 400G NVMe drives.

The cluster is still under performance benchmark and testing and testing the NVMes like direct local storage provides with fio iops ranging (inside VM) between 101k read, 64k writes and 64k read/22k write in combined.

Now, with ceph cluster setup for Kingstone drives for journal and the Intels as OSDs, the performance drops down to 25354 for iops read (inside the same VM -- after storage migration to new setup), so no need to tell the rest as you might already figured it out.

Ceph configuration is the basic one provided in the pveceph deployment procedure and the network connection between the nodes is not done via 1 Gbps links.

I suspect that the slowdown part here comes out of the poor performance levels of the Kingstone drives, but in order to confirm this I will manually setup the journal and OSD into the Intel 750's on different partitions.

The key and final expected result here, would be to achieve storage redundancy (and not high storage volumes) and high IOPS count for database application.

So what do you think, would it worth implementing the partition layout in this case ? because the NVMe's are capable of sustaining journal and data rw at once with no bargue (considering one card tech specs).

David Herselman · Mar 13, 2019

The key objectives around Ceph is for it to be an easily managed, reliable and scalable storage architecture. Replacing an OSD should be as simple as replacing the old drive and running a single command which then brings it in to service.

Typical Ceph deployments have OSD counts in the hundreds or thousands, not the quantities a typical Proxmox cluster operates with.

To answer your question though, yes you most definitely can, but it creates a nuancial setup which would be much harder to troubleshoot.

We have a cluster with 6 servers where each has two SSDs partitioned to provide a raid 1 Operating System volume and non-raid journal partitions for 4 x FileStore spinners. We wanted to re-utilise all these drives and replace 6 discs in each server with only 2 much larger and higher performance SSDs.

We subsequently partitioned the new drives to provide:

FakeMBR boot partition (1 MiB)
raid1 parition for Operating System (10GiB)
raid1 partition for Swap (1 GiB)
small BlueStore OSD metadata partition (100 MiB)
BlueStore OSD data partition (balance)

Code:

  DEV=sdb
  parted --script /dev/$DEV mklabel gpt;
  parted --script /dev/$DEV mkpart bbp 2048s 4095s;
  parted --script /dev/$DEV mkpart non-fs 4096s 20975615s;
  parted --script /dev/$DEV mkpart non-fs 20975616s 23072767s;
  parted --script /dev/$DEV mkpart '"ceph data"' 23072768s 23277567s;
  parted --script /dev/$DEV mkpart '"ceph block"' 23277568s -- -1;
  parted --script /dev/$DEV set 1 bios_grub on;
  parted --script /dev/$DEV set 2 raid on;
  parted --script /dev/$DEV set 3 raid on;
  sgdisk -t 4:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/$DEV;
  sgdisk -t 5:cafecafe-9b03-4f30-b4c6-b4b80ceff106 /dev/$DEV;

  mdadm -a /dev/md0 /dev/"$DEV"2;
  mdadm -a /dev/md1 /dev/"$DEV"3;
  update-grub;
  grub-install /dev/$DEV;

PS: Those sgdisk UUIDs are type codes that identify Ceph GPT partitions.
PS: The partition sector sizes are based on 512b sectors. Divide these by 8 if you have 4Kn AF drives.

I'm pedantic and like having predictable OSD numbering, herewith the process to destroy and free up the existing OSDs:

Code:

  OSD=8;
  ceph osd out $OSD;
  systemctl stop ceph-osd@$OSD;
  umount /var/lib/ceph/osd/ceph-$OSD;
  rmdir /var/lib/ceph/osd/ceph-$OSD;
  ceph osd destroy $OSD --yes-i-really-mean-it;
  ceph osd crush remove osd.$OSD;
  ceph osd rm $OSD;

To create a BlueStore OSD using custom partitions:

Code:

  DEV=sdb;
  OSD=8;
  mkfs -t xfs -f -i size=2048 -- /dev/"$DEV"4;
  mkdir /var/lib/ceph/osd/ceph-$OSD;
  mount -o noatime /dev/"$DEV"4 /var/lib/ceph/osd/ceph-$OSD;
  cd /var/lib/ceph/osd/ceph-$OSD;
  echo bluestore > type;
  blkid -o udev -p /dev/"$DEV"5 | grep UUID;
    # Copy & paste the ID_PART_ENTRY_UUID line in to your shell prompt
  ln -s /dev/disk/by-partuuid/$ID_PART_ENTRY_UUID block;
  echo "$ID_PART_ENTRY_UUID" > block_uuid;
  chown ceph.ceph . -R;
  ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-$OSD/activate.monmap;
  ceph-osd --setuser ceph -i $OSD --mkkey --mkfs;
  ceph osd new `cat /var/lib/ceph/osd/ceph-$OSD/fsid` $OSD;
  ceph auth add osd.$OSD osd 'allow *' mon 'allow profile osd' mgr 'allow profile osd' -i /var/lib/ceph/osd/ceph-$OSD/keyring;

Final step is to test self-activation. First unmount and then trigger udev:

Code:

cd /root;
umount /var/lib/ceph/osd/ceph-$OSD;
echo add > /sys/class/block/"$DEV"4/uevent;

noahmehl · Sep 15, 2019

I have created a script for Proxmox v6.0.7 that takes a 2TB Intel P4600 HHHL nvme, and creates 4 OSDs with about ~500GB size each. This assumes that your OSD numbers start at 0, and that matches the partition count. This could be further updated to take the next OSD #.

Code:

#!/bin/bash
set -x

OSD_UUID=$(uuidgen -r)
VG_UUID=$(uuidgen -r)
PTYPE_UUID=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
OSD_KEY=$(ceph-authtool --gen-print-key)
OSD_START_PARTITION="$((($OSD_NUMBER - $OSD_OFFSET) * 500))GB"
OSD_END_PARTITION="$((($OSD_NUMBER - $OSD_OFFSET + 1) * 500))GB"
PARTITION_NUMBER=$(($OSD_NUMBER - $OSD_OFFSET + 1))

if [ -z $OSD_OFFSET ]
    then OSD_OFFSET=0
fi

if [ $(($OSD_NUMBER - $OSD_OFFSET)) == 0 ]
    then
        OSD_START_PARTITION=1
fi

parted --script $DEVICE mkpart '"temp"' $OSD_START_PARTITION $OSD_END_PARTITION
sgdisk --change-name="$PARTITION_NUMBER:ceph data" --partition-guid="$PARTITION_NUMBER:$OSD_UUID" --typecode="$PARTITION_NUMBER:$PTYPE_UUID" -- $DEVICE
partprobe
vgcreate -s 1G --force --yes ceph-$VG_UUID "$DEVICE"p"$PARTITION_NUMBER"
lvcreate --yes -l 100%FREE -n osd-block-$OSD_UUID ceph-$VG_UUID
ceph-volume lvm prepare --bluestore --data /dev/ceph-$VG_UUID/osd-block-$OSD_UUID --osd-fsid $OSD_UUID --crush-device-class nvme
ceph-volume lvm activate --bluestore $OSD_NUMBER $OSD_UUID

You would run this like:

# for i in {0..3}; do DEVICE=/dev/nvme0n1 OSD_NUMBER=$i OSD_OFFSET=0 ./nvme-create.sh; done

Search

Search

Ceph Server: why block devices and not partitions ?

Florent

Member

dietmar

Proxmox Staff Member

Florent

Member

dietmar

Proxmox Staff Member

Florent

Member

dietmar

Proxmox Staff Member

zystem

New Member

Florent

Member

zystem

New Member

dietmar

Proxmox Staff Member

avladulescu

Renowned Member

David Herselman

Renowned Member

noahmehl

Active Member

We value your privacy