[SOLVED] CEPH: FileStore to BlueStore

Belokan

Active Member
Apr 27, 2016
155
16
38
Hello,

I'd like to move my CEPH environment from Filestore to Bluestore.
Could you confirm that it is not feasible "online" but I have to destroy and then re-create my OSDs ?

In this case, does it look correct ?

1.- Move disk(s) from RDB to NFS (for instance).
2.- Destroy CEPH pool(s)
3.- Stop/Out/Destroy each OSD
4.- Recreate OSDs with Bluestore option ticked
5.- Recreate pool(s)
6.- Move back disk(s) from NFS to RDB

Thanks in advance !
 
It can be done as described in your listing, but why not just move one osd out a time? This spares you the move to the NFS server and back.

I would go with this:
  1. mark OSD out
  2. stop OSD
  3. let recovery happen
  4. (after health is back to OK)
  5. remove OSD
  6. create OSD as bluestore OSD
  7. let recovery happen again
  8. (after health is back to OK)
  9. take next OSD out
In any case it is always good to have backup!

For reference: https://pve.proxmox.com/pve-docs/chapter-pveceph.html
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/
 
  • Like
Reactions: Belokan
Hello,

I'd like to move my CEPH environment from Filestore to Bluestore.
Could you confirm that it is not feasible "online" but I have to destroy and then re-create my OSDs ?

In this case, does it look correct ?

1.- Move disk(s) from RDB to NFS (for instance).
2.- Destroy CEPH pool(s)
3.- Stop/Out/Destroy each OSD
4.- Recreate OSDs with Bluestore option ticked
5.- Recreate pool(s)
6.- Move back disk(s) from NFS to RDB

Thanks in advance !

I have done this: for each osd on each cluster member ...

Code:
ID=$1
echo "wait for cluster ok"
while ! ceph health | grep HEALTH_OK ; do echo -n "."; sleep 10 ; done
echo "ceph osd out $ID"
ceph osd out $ID
sleep 10
while ! ceph health | grep HEALTH_OK ; do sleep 10 ; done
echo "systemctl stop ceph-osd@$ID.service"
systemctl stop ceph-osd@$ID.service
sleep 60
DEVICE=`mount | grep /var/lib/ceph/osd/ceph-$ID| cut -f1 -d"p"`

umount /var/lib/ceph/osd/ceph-$ID
echo "ceph-disk zap $DEVICE"
ceph-disk zap $DEVICE
ceph osd destroy $ID --yes-i-really-mean-it
echo "ceph-disk prepare --bluestore $DEVICE --osd-id $ID"
ceph-disk prepare --bluestore $DEVICE --osd-id $ID
sleep 10;
ceph osd metadata $ID
ceph -s
echo "wait for cluster ok"
while ! ceph health | grep HEALTH_OK ; do echo -n "."; sleep 10 ; done
ceph -s
echo " proceed with next"
 
Many thanks, this was extremely useful and we're on our way migrating our OSDs over to bluestore.

If it helps others, I would recommend that the DB partition reference the partuuid instead of eg /dev/sda4 though, so some slight changes.

We have existing file store OSDs which have journals pointing at partitions on SSD discs. We typically run the OS as software RAID1 on the same SSDs. The following is subsequently the overview of a small cluster where each host has 2 x SSD drives and 2 x spinners.

Status:
Code:
[root@kvm1a ~]# parted /dev/sda p
Model: ATA INTEL SSDSC2BB48 (scsi)
Disk /dev/sda: 480GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name    Flags
 1      1049kB  2097kB  1049kB               bbp     bios_grub   # Fake MBR
 2      2097kB  8391MB  8389MB               non-fs  raid        # root file system (/) with discard enabled
 3      8590MB  51.5GB  42.9GB               non-fs              # 40GB journal for /dev/sdc spinner
 7      343GB   480GB   137GB                non-fs  raid        # swap with discard enabled


[root@kvm1a ~]# grep 'sda' /proc/partitions
   8        0  468851544 sda
   8        1       1024 sda1    # Fake MBR
   8        2    8192000 sda2    # raid1 member for root file system (/)
   8        3   41943040 sda3    # journal partition for /dev/sdc
   8        7  134217728 sda7    # raid1 member for swap


[root@kvm1a ~]# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda2[1] sdb2[2]
      8187904 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md1 : active raid1 sda7[1] sdb7[2]
      134086656 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>


[root@kvm1a ~]# cat /etc/fstab
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/md0 / ext4 errors=remount-ro,discard 0 1
/dev/md1 none swap sw,discard 0 0
proc /proc proc defaults 0 0


[root@kvm1a ~]# ps auxfww | grep osd
root      7319  0.0  0.0  12788   944 pts/1    S+   10:00   0:00          \_ grep osd
ceph      2462  3.2  1.2 1381880 625156 ?      Ssl  Nov15 207:57 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
ceph      2726  3.3  1.1 1331580 579068 ?      Ssl  Nov15 219:49 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph


[root@kvm1a ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       5.45517 root default
-2       1.81839     host kvm1a
 0   hdd 0.90919         osd.0      up  1.00000 1.00000
 1   hdd 0.90919         osd.1      up  1.00000 1.00000
-3       1.81839     host kvm1b
 2   hdd 0.90919         osd.2      up  1.00000 1.00000
 3   hdd 0.90919         osd.3      up  1.00000 1.00000
-4       1.81839     host kvm1c
 4   hdd 0.90919         osd.4      up  1.00000 1.00000
 5   hdd 0.90919         osd.5      up  1.00000 1.00000


[root@kvm1a ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
 0   hdd 0.90919  1.00000  931G  388G  542G 41.74 1.01 141
 1   hdd 0.90919  1.00000  931G  384G  546G 41.32 1.00 139
 2   hdd 0.90919  1.00000  931G  402G  528G 43.18 1.04 143
 3   hdd 0.90919  1.00000  931G  369G  561G 39.64 0.96 137
 4   hdd 0.90919  1.00000  931G  339G  591G 36.43 0.88 124
 5   hdd 0.90919  1.00000  931G  433G  497G 46.55 1.12 156
                    TOTAL 5586G 2317G 3269G 41.48
MIN/MAX VAR: 0.88/1.12  STDDEV: 3.10


We run our storage pool with 3-way replication and didn't want to wait for everything originally stored on osd.0 to replicate to osd.1 before bringing osd.0 back online. The cluster will start replicating data to osd.1 the moment you 'out' osd.0 but the replacement bluestore OSD should come online about a minute later and avoids too much unnecessary moving of data.

Define variables for the exercise:
Code:
ID='0';
DEVICE=`mount | grep /var/lib/ceph/osd/ceph-$ID | perl -pe 's/^(.*?)1 .*/\1/g'`;
echo $DEVICE
  # NB: Should be eg '/dev/sdc'
JOURNAL=
  # Lookup original with: dir /var/lib/ceph/osd/ceph-$ID/journal
  # Reference with:       dir /dev/disk/by-partuuid
  #   eg: /dev/disk/by-partuuid/0b733181-05fa-41ea-9272-822b595bdd47

Destroy the OSD and re-create using bluestore:
Code:
ceph osd out $ID;
sleep 5;
systemctl stop ceph-osd@$ID.service;
sleep 5;
umount /var/lib/ceph/osd/ceph-$ID;
ceph-disk zap $DEVICE;
ceph osd destroy $ID --yes-i-really-mean-it;
#ceph-disk prepare --bluestore $DEVICE --osd-id $ID;
ceph-disk prepare --bluestore $DEVICE --block.db $JOURNAL --osd-id $ID;
ceph osd metadata $ID;

NB: Wait for Ceph to completely heal before moving on to the next OSD!
PS: Advanced users may consider doing this on all OSDs on a chosen host but beware of the overhead this creates...
 
Last edited:
PS: Advanced users may consider doing this on all OSDs on a chosen host but beware of the overhead this creates...
It should also be noted, as a word of caution, that with a min_size of 2 (default value), no other disk in the cluster must fail. A(ll) pool(s) will go into RO mode, when it hits the min_size, leaving you with a interruption of your VMs/CTs on writing.
 
Alwin, correct... We run with size 3 (replicas) with min_size 1; We have 5 hosts, of which 3 are monitors and only have 20 OSDs...

PS: I glanced over a thread where someone stated that there was a concern with running 3/1 pools, citing a discussion in the Ceph mailing list but I couldn't find a reference to this and don't quite understand the concern with this. Anyone care to elaborate?
 
To anyone doing this, wait a period of time after marking the OSD out and stopping the service. Data will be returned as having completed once it lands in only the primary placement group. You should change this back to 2, to ensure any OSD failure scenarios are covered. You would however prefer to run with min_size 1, when you are busy with maintenance and already running in a degraded state.

In other words:
Code:
OSDs='8 9 10 11';
DISKs='sdc sdd sde sdf';
SSDs='sda3 sda4 sdb3 sdb4';
    for ID in $OSDs; do
      ceph osd out $ID;
    done
watch ceph -s;
    for ID in $OSDs; do
      systemctl stop ceph-osd@$ID.service;
      umount /var/lib/ceph/osd/ceph-$ID;
      ceph osd destroy $ID --yes-i-really-mean-it;
    done
    for DEV in $DISKs; do
      ceph-disk zap /dev/$DEV;
    done
    for DEV in $DISKs $SSDs; do
      blkdiscard /dev/$DEV;
    done


Quick correlation commands, when working in FileStore OSDs with SSD journals:
Code:
df -h | grep ceph | sort;
for f in /var/lib/ceph/osd/ceph-*/journal; do echo -ne "$f:\t"; a=`ls -l $f`; dir ${a#* -> }; done;


Reference commands to change RBD pool settings:
Code:
[admin@kvm5f ~]# ceph osd pool ls
rbd
cephfs_data
cephfs_metadata
[admin@kvm5f ~]# ceph osd pool get cephfs_data min_size
min_size: 1
[admin@kvm5f ~]# ceph osd pool set cephfs_data min_size 2
set pool 2 min_size to 2
[admin@kvm5f ~]# ceph osd pool get cephfs_data min_size
min_size: 2
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!