Multiple OSD's for NVMe devices

WSL · Aug 29, 2018

We're building a 6 node proxmox/ceph cluster with a mix of NVMe and SATA SSD storage devices.

Reading the ceph tuning guide, the suggestion is 4 OSD's per NVMe device - http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

The SSD pool is configured and working as expected

Is it possible to achieve this using the pve-ceph tooling?

wolfgang · Aug 30, 2018

Hi,

yes, you can do that but you have to partition your NVME manually before.
Then you can insert the partition instead of the block dev in the command.

Code:

pveceph createosd /dev/nvme0n1p1 ....

WSL · Aug 30, 2018

Many thanks.

WSL · Aug 30, 2018

I've been progressing down this route. I put together a quick and dirty python script to carve up a device:

Code:

#!/usr/bin/python


disk='nvme0n1'
disk_size=1591
meta_part_size=1
counter=0
partitions=5

# calculate the data partition size
data_part_size=int((disk_size - (partitions*meta_part_size))/partitions)


# zap the disk
cmd='ceph-disk zap /dev/{}\n'.format(disk)
cmd+='# create partitions for OSD metadata and storage\n'

for i in range(partitions):
    # metadata part
    if counter == 0:
        cmd+='parted /dev/{} -a optimal mkpart primary 1049kB 1GB\n'.format(disk)
       
    else:
        cmd+='parted /dev/{} -a optimal mkpart primary {}GB {}GB\n'.format(disk, counter, counter+meta_part_size)   
    counter+=meta_part_size
    #data part
    cmd+='parted /dev/{} -a optimal mkpart primary {}GB {}GB\n'.format(disk, counter, counter+data_part_size)
    counter+=data_part_size


cmd+='\n\n# proxmox osd creation\n'

for i in range(1, (2*partitions)+1, 2):
   
    cmd+='pveceph createosd /dev/{0}p{1} --journal_dev /dev/{0}p{2}\n'.format(disk, i+1, i) 
   

print cmd

This outputs the following:

Code:

ceph-disk zap /dev/nvme0n1
# create partitions for OSD metadata and storage
parted /dev/nvme0n1 -a optimal mkpart primary 1049kB 1GB
parted /dev/nvme0n1 -a optimal mkpart primary 1GB 318GB
parted /dev/nvme0n1 -a optimal mkpart primary 318GB 319GB
parted /dev/nvme0n1 -a optimal mkpart primary 319GB 636GB
parted /dev/nvme0n1 -a optimal mkpart primary 636GB 637GB
parted /dev/nvme0n1 -a optimal mkpart primary 637GB 954GB
parted /dev/nvme0n1 -a optimal mkpart primary 954GB 955GB
parted /dev/nvme0n1 -a optimal mkpart primary 955GB 1272GB
parted /dev/nvme0n1 -a optimal mkpart primary 1272GB 1273GB
parted /dev/nvme0n1 -a optimal mkpart primary 1273GB 1590GB


# proxmox osd creation
pveceph createosd /dev/nvme0n1p2 --journal_dev /dev/nvme0n1p1
pveceph createosd /dev/nvme0n1p4 --journal_dev /dev/nvme0n1p3
pveceph createosd /dev/nvme0n1p6 --journal_dev /dev/nvme0n1p5
pveceph createosd /dev/nvme0n1p8 --journal_dev /dev/nvme0n1p7
pveceph createosd /dev/nvme0n1p10 --journal_dev /dev/nvme0n1p9

When I run this, the partitions get set correctly:

Code:

Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 85C3157D-C4E7-4ECF-8504-486EB2291929

Device               Start        End   Sectors   Size Type
/dev/nvme0n1p1        2048    1953791   1951744   953M Linux filesystem
/dev/nvme0n1p2     1953792  621092863 619139072 295.2G Linux filesystem
/dev/nvme0n1p3   621092864  623046655   1953792   954M Linux filesystem
/dev/nvme0n1p4   623046656 1242187775 619141120 295.2G Linux filesystem
/dev/nvme0n1p5  1242187776 1244141567   1953792   954M Linux filesystem
/dev/nvme0n1p6  1244141568 1863280639 619139072 295.2G Linux filesystem
/dev/nvme0n1p7  1863280640 1865234431   1953792   954M Linux filesystem
/dev/nvme0n1p8  1865234432 2484375551 619141120 295.2G Linux filesystem
/dev/nvme0n1p9  2484375552 2486327295   1951744   953M Linux filesystem
/dev/nvme0n1p10 2486327296 3105468415 619141120 295.2G Linux filesystem

However I hit the following issue with the pveceph createosd command:

Code:

pveceph createosd /dev/nvme0n1p2 --journal_dev /dev/nvme0n1p1
unable to get device info for 'nvme0n1p2'

Any ideas?

wolfgang · Aug 30, 2018

This makes no sense to put the jornal_dev on the same disk.

WSL · Aug 30, 2018

The reasoning is in here - http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

Any ideas on why I can't create the OSD with the partition (as opposed to the device) reference.

If it's not possible, I can do it directly using ceph commands, but I'd prefer to use the pveceph wrapper where possible.

Thanks in advance
Matt

wolfgang · Aug 31, 2018

Journal devices make sense if they are faster than the data device.
In your case, it isn't.
In your case, you have to write more date for the journal without speedup.

WSL said:
Any ideas on why I can't create the OSD with the partition (as opposed to the device) reference.

I guess because you use bluestore and bluestore block.db need the data type to calculate the minimum partition size.

Freemind · Nov 19, 2018

the best practice to achieve multiple OSD per NVMe seems to be "ceph-volume" which creates LVM for bluestore etc. - would you guys recommend that? Or is there any plan for supporting multiple OSD from the PVE GUI? Thanks for answers

wolfgang · Nov 20, 2018

Freemind said:
the best practice to achieve multiple OSD per NVMe

This is not best practice. For what this should be good for?
You can increase the worker for a more parallel workload.

The only thing that makes sense and work since ever PVE implemented ceph server is multiple WAL/DB on an SSD/NVME.

spirit · Nov 20, 2018

I think the problem is coming from the single threaded finisher.
it's has been fixed recently in nautilus/master, but for mimic, they are an option

bluestore_shard_finishers=true

https://www.spinics.net/lists/ceph-devel/msg39009.html
"
Currently the option bluestore_shard_finishers is set as false. As a
result there is one finisher to handle bluestore IO completion.

In NVMe scenario, it becomes the bottleneck.
In my cluster with one OSD on NVMe disk,
With finisher number is 1:
write: IOPS=20.6k, BW=80.5MiB/s (84.4MB/s)(70.8GiB/899908msec)
And when set finisher numbers to 8:
write: IOPS=41.1k, BW=161MiB/s (168MB/s)(141GiB/899916msec)

"

Freemind · Nov 20, 2018

@spirit: thanks for that info!

@wolfgang: in our case we have 4TB NVMe and additional nodes with just 1TB NVMe - so we wanted to create same size OSDs for not having to adjust any weights for distribution. Is the multiple OSD per NVMe a bad idea? Like the links from user "WSL" show, there seems to be a speed improvement when using multiple OSD on NVMe.

The other thing is scalability, in our case 1TB OSDs would be perfect for scaling the cluster - am I wrong?

wolfgang · Nov 21, 2018

Freemind said:
- so we wanted to create same size OSDs for not having to adjust any weights for distribution

You don't have to adjust the weights. Because if you make 4 separate OSD on one NVME you will have in total the same weight as you have with a single OSD. because the weight depends on the OSD size and this is the same on a single or multiple on one disk.

We have made benchmarks if multiple OSD are faster on high-end enterprise SSD but there are no different.

Freemind said:
The other thing is scalability, in our case 1TB OSDs would be perfect for scaling the cluster - am I wrong?

I do not understand. You should have on each node nearly the same amount of OSD space. The distribution is normally (Default PVE crush map) done on node level and not on OSD level.

Search

Search

Multiple OSD's for NVMe devices

WSL

Member

wolfgang

Proxmox Retired Staff

WSL

Member

WSL

Member

wolfgang

Proxmox Retired Staff

WSL

Member

wolfgang

Proxmox Retired Staff

Freemind

Member

wolfgang

Proxmox Retired Staff

spirit

Distinguished Member

Freemind

Member

wolfgang

Proxmox Retired Staff