Ceph Server: why block devices and not partitions ?

Florent

Member
Apr 3, 2012
91
4
8
Hi all,

I'm trying Ceph Server with Proxmox, but I have a little problem.

My servers both have 2 disks, partitioned with RAID (sda1 & sdb1 mirrored).

And I would like to use sda2 & sdb2 for my OSDs in Ceph.

But it seems that all scripts from PVE are using block devices directly and not partitions.

Is there any workaround to use partitions instead of device directly ?

I was looking in loopback devices, but I was not able to create a virtual "sdz" listed in /sys/block (PVE lists this to get all disks).

Anyone got an idea ?

Thank you;
 
It makes absolutely no sense to use software RAID for Ceph.
You need to use whole disks instead.
 
Yes sorry I forgot to say that my Raid device is used to store system files.
I want to use sda2 & sdb2 for Ceph, which are not Raid-ed.
 
But you can add partitions support for cli utils. I do not need full support of partitions in gui (existing is suitable). Can you add partitions support to "pveceph createosd /dev/sd[X]" like "pveceph createosd /dev/sdd4"?
 
But you can add partitions support for cli utils. I do not need full support of partitions in gui (existing is suitable). Can you add partitions support to "pveceph createosd /dev/sd[X]" like "pveceph createosd /dev/sdd4"?
Look at sources... the way they do it does not permit to do what you want.

But have a look to my post (http://forum.proxmox.com/threads/17909-Ceph-server-feedback), I give the solution to do it.
 
Yes it sounds reasonable. But why do not add partitions support for journal disk? Or add ability to enter path to journals in gui instead of disk? I want store journals on first disk there proxmox root partition installed. I have no infinite space for additional disks in server case. Why do not use fast disk for journals AND proxmox installation?
 
I think that this makes no sense, because of performance issues.

dietmar, let me give you an example of use case for such requirements.

Consider the following case - 5 proxmox nodes HA cluster setup, each with 1 x SSD Kingston V300 @ 120G for journal and 1 x Intel 750 400G NVMe drives.

The cluster is still under performance benchmark and testing and testing the NVMes like direct local storage provides with fio iops ranging (inside VM) between 101k read, 64k writes and 64k read/22k write in combined.

Now, with ceph cluster setup for Kingstone drives for journal and the Intels as OSDs, the performance drops down to 25354 for iops read (inside the same VM -- after storage migration to new setup), so no need to tell the rest as you might already figured it out.

Ceph configuration is the basic one provided in the pveceph deployment procedure and the network connection between the nodes is not done via 1 Gbps links.

I suspect that the slowdown part here comes out of the poor performance levels of the Kingstone drives, but in order to confirm this I will manually setup the journal and OSD into the Intel 750's on different partitions.

The key and final expected result here, would be to achieve storage redundancy (and not high storage volumes) and high IOPS count for database application.

So what do you think, would it worth implementing the partition layout in this case ? because the NVMe's are capable of sustaining journal and data rw at once with no bargue (considering one card tech specs).
 
The key objectives around Ceph is for it to be an easily managed, reliable and scalable storage architecture. Replacing an OSD should be as simple as replacing the old drive and running a single command which then brings it in to service.

Typical Ceph deployments have OSD counts in the hundreds or thousands, not the quantities a typical Proxmox cluster operates with.

To answer your question though, yes you most definitely can, but it creates a nuancial setup which would be much harder to troubleshoot.


We have a cluster with 6 servers where each has two SSDs partitioned to provide a raid 1 Operating System volume and non-raid journal partitions for 4 x FileStore spinners. We wanted to re-utilise all these drives and replace 6 discs in each server with only 2 much larger and higher performance SSDs.

We subsequently partitioned the new drives to provide:
  • FakeMBR boot partition (1 MiB)
  • raid1 parition for Operating System (10GiB)
  • raid1 partition for Swap (1 GiB)
  • small BlueStore OSD metadata partition (100 MiB)
  • BlueStore OSD data partition (balance)

Code:
  DEV=sdb
  parted --script /dev/$DEV mklabel gpt;
  parted --script /dev/$DEV mkpart bbp 2048s 4095s;
  parted --script /dev/$DEV mkpart non-fs 4096s 20975615s;
  parted --script /dev/$DEV mkpart non-fs 20975616s 23072767s;
  parted --script /dev/$DEV mkpart '"ceph data"' 23072768s 23277567s;
  parted --script /dev/$DEV mkpart '"ceph block"' 23277568s -- -1;
  parted --script /dev/$DEV set 1 bios_grub on;
  parted --script /dev/$DEV set 2 raid on;
  parted --script /dev/$DEV set 3 raid on;
  sgdisk -t 4:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/$DEV;
  sgdisk -t 5:cafecafe-9b03-4f30-b4c6-b4b80ceff106 /dev/$DEV;

  mdadm -a /dev/md0 /dev/"$DEV"2;
  mdadm -a /dev/md1 /dev/"$DEV"3;
  update-grub;
  grub-install /dev/$DEV;

PS: Those sgdisk UUIDs are type codes that identify Ceph GPT partitions.
PS: The partition sector sizes are based on 512b sectors. Divide these by 8 if you have 4Kn AF drives.


I'm pedantic and like having predictable OSD numbering, herewith the process to destroy and free up the existing OSDs:
Code:
  OSD=8;
  ceph osd out $OSD;
  systemctl stop ceph-osd@$OSD;
  umount /var/lib/ceph/osd/ceph-$OSD;
  rmdir /var/lib/ceph/osd/ceph-$OSD;
  ceph osd destroy $OSD --yes-i-really-mean-it;
  ceph osd crush remove osd.$OSD;
  ceph osd rm $OSD;


To create a BlueStore OSD using custom partitions:
Code:
  DEV=sdb;
  OSD=8;
  mkfs -t xfs -f -i size=2048 -- /dev/"$DEV"4;
  mkdir /var/lib/ceph/osd/ceph-$OSD;
  mount -o noatime /dev/"$DEV"4 /var/lib/ceph/osd/ceph-$OSD;
  cd /var/lib/ceph/osd/ceph-$OSD;
  echo bluestore > type;
  blkid -o udev -p /dev/"$DEV"5 | grep UUID;
    # Copy & paste the ID_PART_ENTRY_UUID line in to your shell prompt
  ln -s /dev/disk/by-partuuid/$ID_PART_ENTRY_UUID block;
  echo "$ID_PART_ENTRY_UUID" > block_uuid;
  chown ceph.ceph . -R;
  ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-$OSD/activate.monmap;
  ceph-osd --setuser ceph -i $OSD --mkkey --mkfs;
  ceph osd new `cat /var/lib/ceph/osd/ceph-$OSD/fsid` $OSD;
  ceph auth add osd.$OSD osd 'allow *' mon 'allow profile osd' mgr 'allow profile osd' -i /var/lib/ceph/osd/ceph-$OSD/keyring;


Final step is to test self-activation. First unmount and then trigger udev:
Code:
cd /root;
umount /var/lib/ceph/osd/ceph-$OSD;
echo add > /sys/class/block/"$DEV"4/uevent;
 
Last edited:
I have created a script for Proxmox v6.0.7 that takes a 2TB Intel P4600 HHHL nvme, and creates 4 OSDs with about ~500GB size each. This assumes that your OSD numbers start at 0, and that matches the partition count. This could be further updated to take the next OSD #.

Code:
#!/bin/bash
set -x

OSD_UUID=$(uuidgen -r)
VG_UUID=$(uuidgen -r)
PTYPE_UUID=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
OSD_KEY=$(ceph-authtool --gen-print-key)
OSD_START_PARTITION="$((($OSD_NUMBER - $OSD_OFFSET) * 500))GB"
OSD_END_PARTITION="$((($OSD_NUMBER - $OSD_OFFSET + 1) * 500))GB"
PARTITION_NUMBER=$(($OSD_NUMBER - $OSD_OFFSET + 1))

if [ -z $OSD_OFFSET ]
    then OSD_OFFSET=0
fi

if [ $(($OSD_NUMBER - $OSD_OFFSET)) == 0 ]
    then
        OSD_START_PARTITION=1
fi

parted --script $DEVICE mkpart '"temp"' $OSD_START_PARTITION $OSD_END_PARTITION
sgdisk --change-name="$PARTITION_NUMBER:ceph data" --partition-guid="$PARTITION_NUMBER:$OSD_UUID" --typecode="$PARTITION_NUMBER:$PTYPE_UUID" -- $DEVICE
partprobe
vgcreate -s 1G --force --yes ceph-$VG_UUID "$DEVICE"p"$PARTITION_NUMBER"
lvcreate --yes -l 100%FREE -n osd-block-$OSD_UUID ceph-$VG_UUID
ceph-volume lvm prepare --bluestore --data /dev/ceph-$VG_UUID/osd-block-$OSD_UUID --osd-fsid $OSD_UUID --crush-device-class nvme
ceph-volume lvm activate --bluestore $OSD_NUMBER $OSD_UUID

You would run this like:

# for i in {0..3}; do DEVICE=/dev/nvme0n1 OSD_NUMBER=$i OSD_OFFSET=0 ./nvme-create.sh; done
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!