Move Ceph journal to SSD?

i would have already done this... if i would know how big those partitions should be.. .also can i put wal and the db on one partition or does it need a partition for each? thanks for clarifying this

can i use one extra partion for wal and db ? or do i have to use seperate partions (means 2 on nvme disk per osd)


btw. i don't think that ceph is suitable for running vm containers in it without ssd acceleration. Even with my 8 servers and 88 OSD's it is very slow . After going over to bluestore i got less timeouts but still - not really usable.

Edit: actually following the instructions from "Mastering Ceph"
seems like creating the disks manage also creating of the right partions and the size for those partitions alone. Im actually testing it and let you know about the results to help other users who might have the same questions.
 
Last edited:
here my final script which seems to work well and keeps the old ID's intact

Code:
#!/usr/bin/env bash
ids=($1)

function migrate {
    echo migrating OSD $1;
    ID=$1
    re='^[0-9]+$'
    if ! [[ $ID =~ $re ]] ; then
       echo "error: OSD Id is needed" >&2; exit 1
    fi

    # get the device name from ID of OSD
    DEVICE=$(mount | grep /var/lib/ceph/osd/ceph-$ID | grep -o '\/dev\/[a-z.-]*')
    echo Device: $DEVICE
    # check if the drive needs to be converted
    if ceph osd metadata $ID | grep osd_objectstore | grep 'filestore'  ; then
     echo filestore found - converting
    else
     echo bluestore found... exiting
#     return
    fi

    ceph osd count-metadata osd_objectstore
    ceph osd out $ID
    while ! ceph osd safe-to-destroy $ID; do sleep 10; done
    echo Destroying OSD $ID  $DEVICE in 5 seconds... hit ctrl-c top abort
    sleep 5
    echo stopping..
    systemctl stop ceph-osd@$ID
    echo destroying...
    systemctl kill ceph-osd@$ID
    sleep 5
    umount /var/lib/ceph/osd/ceph-$ID
    sleep 5
    echo zap disk
    ceph-disk zap $DEVICE
    sleep 5
    echo osd remove
    ceph osd destroy $ID --yes-i-really-mean-it
    echo prepare bluestore on $DEVICE with id $ID
    sleep 5
    ceph-disk prepare --bluestore $DEVICE --osd-id $ID  --block.wal /dev/nvme0n1 --block.db /dev/nvme0n1
    echo finished converting OSD.$ID
}
# parse command line parameters pass drive ids like ./blue "1 2 3 4 5"
for i in "${ids[@]}"
do
   :
   migrate $i
done

Which give me the following partion table on my /dev/mvme01n1

Code:
parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Unknown (unknown)
Disk /dev/nvme0n1: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name            Flags
 1      1049kB  1075MB  1074MB               ceph block.db
 2      1075MB  1679MB  604MB                ceph block.wal
 3      1679MB  2753MB  1074MB               ceph block.db
 4      2753MB  3356MB  604MB                ceph block.wal
 5      3356MB  4430MB  1074MB               ceph block.db
 6      4430MB  5034MB  604MB                ceph block.wal
 7      5034MB  6108MB  1074MB               ceph block.db
 8      6108MB  6712MB  604MB                ceph block.wal
 9      6712MB  7786MB  1074MB               ceph block.db
10      7786MB  8390MB  604MB                ceph block.wal
11      8390MB  9463MB  1074MB               ceph block.db
12      9463MB  10.1GB  604MB                ceph block.wal
13      10.1GB  11.1GB  1074MB               ceph block.db
14      11.1GB  11.7GB  604MB                ceph block.wal
15      11.7GB  12.8GB  1074MB               ceph block.db
16      12.8GB  13.4GB  604MB                ceph block.wal
17      13.4GB  14.5GB  1074MB               ceph block.db
18      14.5GB  15.1GB  604MB                ceph block.wal
19      15.1GB  16.2GB  1074MB               ceph block.db
20      16.2GB  16.8GB  604MB                ceph block.wal
21      16.8GB  17.9GB  1074MB               ceph block.db
22      17.9GB  18.5GB  604MB                ceph block.wal

do you think those partions which got automatically created are big enough? seems like in the documentation those standards should suit. Somehow (compared to filestore setup) it needs much less data on the drive than expected which keeps a lot for my new OSDPool. this will be partition 23 than...

as soon as i find out on how todo this, i gonna upgrade this thread. Any recommendations are welcome ! Hope this help some other people who might run into the same situation adding nvme's to an existing setup

tip: if you ceph osd out x y z all your osd's on one host and wait until it is backfilled to your other hosts before starting this script it just take some minutes.. otherwise it will mark out each disk, wait until no pgs are left on it before it destroys them. it should be safe running it disk by disk but gonna take ages !
 
Last edited:
...

Which give me the following partion table on my /dev/mvme01n1

Code:
parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Unknown (unknown)
Disk /dev/nvme0n1: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name            Flags
 1      1049kB  1075MB  1074MB               ceph block.db
 2      1075MB  1679MB  604MB                ceph block.wal
 3      1679MB  2753MB  1074MB               ceph block.db
 4      2753MB  3356MB  604MB                ceph block.wal
 5      3356MB  4430MB  1074MB               ceph block.db
 6      4430MB  5034MB  604MB                ceph block.wal
 7      5034MB  6108MB  1074MB               ceph block.db
 8      6108MB  6712MB  604MB                ceph block.wal
 9      6712MB  7786MB  1074MB               ceph block.db
10      7786MB  8390MB  604MB                ceph block.wal
11      8390MB  9463MB  1074MB               ceph block.db
12      9463MB  10.1GB  604MB                ceph block.wal
13      10.1GB  11.1GB  1074MB               ceph block.db
14      11.1GB  11.7GB  604MB                ceph block.wal
15      11.7GB  12.8GB  1074MB               ceph block.db
16      12.8GB  13.4GB  604MB                ceph block.wal
17      13.4GB  14.5GB  1074MB               ceph block.db
18      14.5GB  15.1GB  604MB                ceph block.wal
19      15.1GB  16.2GB  1074MB               ceph block.db
20      16.2GB  16.8GB  604MB                ceph block.wal
21      16.8GB  17.9GB  1074MB               ceph block.db
22      17.9GB  18.5GB  604MB                ceph block.wal
Hi,
this recommends db-only on nvme: http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
Code:
If there is only a small amount of fast storage available (e.g., less than a gigabyte), we recommend using it as a WAL device.
If there is more, provisioning a DB device makes more sense.
The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fix).
Udo
 
Hey Tommytom,

Sorry for digging an old thread. Can you please explain how this script works. I recently installed and SSD and want to move my journal to it.

what does this $1 indicate ? Can I run this script all OSDs ?


here my final script which seems to work well and keeps the old ID's intact

Code:
#!/usr/bin/env bash
ids=($1)

function migrate {
    echo migrating OSD $1;
    ID=$1
    re='^[0-9]+$'
    if ! [[ $ID =~ $re ]] ; then
       echo "error: OSD Id is needed" >&2; exit 1
    fi

    # get the device name from ID of OSD
    DEVICE=$(mount | grep /var/lib/ceph/osd/ceph-$ID | grep -o '\/dev\/[a-z.-]*')
    echo Device: $DEVICE
    # check if the drive needs to be converted
    if ceph osd metadata $ID | grep osd_objectstore | grep 'filestore'  ; then
     echo filestore found - converting
    else
     echo bluestore found... exiting
#     return
    fi

    ceph osd count-metadata osd_objectstore
    ceph osd out $ID
    while ! ceph osd safe-to-destroy $ID; do sleep 10; done
    echo Destroying OSD $ID  $DEVICE in 5 seconds... hit ctrl-c top abort
    sleep 5
    echo stopping..
    systemctl stop ceph-osd@$ID
    echo destroying...
    systemctl kill ceph-osd@$ID
    sleep 5
    umount /var/lib/ceph/osd/ceph-$ID
    sleep 5
    echo zap disk
    ceph-disk zap $DEVICE
    sleep 5
    echo osd remove
    ceph osd destroy $ID --yes-i-really-mean-it
    echo prepare bluestore on $DEVICE with id $ID
    sleep 5
    ceph-disk prepare --bluestore $DEVICE --osd-id $ID  --block.wal /dev/nvme0n1 --block.db /dev/nvme0n1
    echo finished converting OSD.$ID
}
# parse command line parameters pass drive ids like ./blue "1 2 3 4 5"
for i in "${ids[@]}"
do
   :
   migrate $i
done

Which give me the following partion table on my /dev/mvme01n1

Code:
parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Unknown (unknown)
Disk /dev/nvme0n1: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name            Flags
 1      1049kB  1075MB  1074MB               ceph block.db
 2      1075MB  1679MB  604MB                ceph block.wal
 3      1679MB  2753MB  1074MB               ceph block.db
 4      2753MB  3356MB  604MB                ceph block.wal
 5      3356MB  4430MB  1074MB               ceph block.db
 6      4430MB  5034MB  604MB                ceph block.wal
 7      5034MB  6108MB  1074MB               ceph block.db
 8      6108MB  6712MB  604MB                ceph block.wal
 9      6712MB  7786MB  1074MB               ceph block.db
10      7786MB  8390MB  604MB                ceph block.wal
11      8390MB  9463MB  1074MB               ceph block.db
12      9463MB  10.1GB  604MB                ceph block.wal
13      10.1GB  11.1GB  1074MB               ceph block.db
14      11.1GB  11.7GB  604MB                ceph block.wal
15      11.7GB  12.8GB  1074MB               ceph block.db
16      12.8GB  13.4GB  604MB                ceph block.wal
17      13.4GB  14.5GB  1074MB               ceph block.db
18      14.5GB  15.1GB  604MB                ceph block.wal
19      15.1GB  16.2GB  1074MB               ceph block.db
20      16.2GB  16.8GB  604MB                ceph block.wal
21      16.8GB  17.9GB  1074MB               ceph block.db
22      17.9GB  18.5GB  604MB                ceph block.wal

do you think those partions which got automatically created are big enough? seems like in the documentation those standards should suit. Somehow (compared to filestore setup) it needs much less data on the drive than expected which keeps a lot for my new OSDPool. this will be partition 23 than...

as soon as i find out on how todo this, i gonna upgrade this thread. Any recommendations are welcome ! Hope this help some other people who might run into the same situation adding nvme's to an existing setup

tip: if you ceph osd out x y z all your osd's on one host and wait until it is backfilled to your other hosts before starting this script it just take some minutes.. otherwise it will mark out each disk, wait until no pgs are left on it before it destroys them. it should be safe running it disk by disk but gonna take ages !
 
Ok I figured out how to run it but here is the error I am getting

./ceph_migration.sh "4"

migrating OSD 4

Device: /dev/sdn /dev/sdp /dev/sdo

bluestore found... exiting

{

"bluestore": 62

}

osd.4 is already out.

OSD(s) 4 are safe to destroy without reducing data durability.

Destroying OSD 4 /dev/sdn /dev/sdp /dev/sdo in 5 seconds... hit ctrl-c top abort

stopping..

destroying...

umount: /var/lib/ceph/osd/ceph-4: not mounted

zap disk

wipefs: error: /dev/sdn1: probing initialization failed: Device or resource busy

ceph-disk: Error: Command '['/sbin/wipefs', '--all', '/dev/sdn1']' returned non-zero exit status 1


osd remove

destroyed osd.4

prepare bluestore on /dev/sdn /dev/sdp /dev/sdo with id 4

ceph-disk: Error: Device is mounted: /dev/sdn1

finished converting OSD.4

root@hosting:~#
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!