Multiple OSD's for NVMe devices

Hi,

yes, you can do that but you have to partition your NVME manually before.
Then you can insert the partition instead of the block dev in the command.

Code:
pveceph createosd /dev/nvme0n1p1 ....
 
I've been progressing down this route. I put together a quick and dirty python script to carve up a device:

Code:
#!/usr/bin/python


disk='nvme0n1'
disk_size=1591
meta_part_size=1
counter=0
partitions=5

# calculate the data partition size
data_part_size=int((disk_size - (partitions*meta_part_size))/partitions)


# zap the disk
cmd='ceph-disk zap /dev/{}\n'.format(disk)
cmd+='# create partitions for OSD metadata and storage\n'

for i in range(partitions):
    # metadata part
    if counter == 0:
        cmd+='parted /dev/{} -a optimal mkpart primary 1049kB 1GB\n'.format(disk)
       
    else:
        cmd+='parted /dev/{} -a optimal mkpart primary {}GB {}GB\n'.format(disk, counter, counter+meta_part_size)   
    counter+=meta_part_size
    #data part
    cmd+='parted /dev/{} -a optimal mkpart primary {}GB {}GB\n'.format(disk, counter, counter+data_part_size)
    counter+=data_part_size


cmd+='\n\n# proxmox osd creation\n'

for i in range(1, (2*partitions)+1, 2):
   
    cmd+='pveceph createosd /dev/{0}p{1} --journal_dev /dev/{0}p{2}\n'.format(disk, i+1, i) 
   

print cmd

This outputs the following:

Code:
ceph-disk zap /dev/nvme0n1
# create partitions for OSD metadata and storage
parted /dev/nvme0n1 -a optimal mkpart primary 1049kB 1GB
parted /dev/nvme0n1 -a optimal mkpart primary 1GB 318GB
parted /dev/nvme0n1 -a optimal mkpart primary 318GB 319GB
parted /dev/nvme0n1 -a optimal mkpart primary 319GB 636GB
parted /dev/nvme0n1 -a optimal mkpart primary 636GB 637GB
parted /dev/nvme0n1 -a optimal mkpart primary 637GB 954GB
parted /dev/nvme0n1 -a optimal mkpart primary 954GB 955GB
parted /dev/nvme0n1 -a optimal mkpart primary 955GB 1272GB
parted /dev/nvme0n1 -a optimal mkpart primary 1272GB 1273GB
parted /dev/nvme0n1 -a optimal mkpart primary 1273GB 1590GB


# proxmox osd creation
pveceph createosd /dev/nvme0n1p2 --journal_dev /dev/nvme0n1p1
pveceph createosd /dev/nvme0n1p4 --journal_dev /dev/nvme0n1p3
pveceph createosd /dev/nvme0n1p6 --journal_dev /dev/nvme0n1p5
pveceph createosd /dev/nvme0n1p8 --journal_dev /dev/nvme0n1p7
pveceph createosd /dev/nvme0n1p10 --journal_dev /dev/nvme0n1p9

When I run this, the partitions get set correctly:

Code:
Disk /dev/nvme0n1: 1.5 TiB, 1600321314816 bytes, 3125627568 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 85C3157D-C4E7-4ECF-8504-486EB2291929

Device               Start        End   Sectors   Size Type
/dev/nvme0n1p1        2048    1953791   1951744   953M Linux filesystem
/dev/nvme0n1p2     1953792  621092863 619139072 295.2G Linux filesystem
/dev/nvme0n1p3   621092864  623046655   1953792   954M Linux filesystem
/dev/nvme0n1p4   623046656 1242187775 619141120 295.2G Linux filesystem
/dev/nvme0n1p5  1242187776 1244141567   1953792   954M Linux filesystem
/dev/nvme0n1p6  1244141568 1863280639 619139072 295.2G Linux filesystem
/dev/nvme0n1p7  1863280640 1865234431   1953792   954M Linux filesystem
/dev/nvme0n1p8  1865234432 2484375551 619141120 295.2G Linux filesystem
/dev/nvme0n1p9  2484375552 2486327295   1951744   953M Linux filesystem
/dev/nvme0n1p10 2486327296 3105468415 619141120 295.2G Linux filesystem

However I hit the following issue with the pveceph createosd command:

Code:
pveceph createosd /dev/nvme0n1p2 --journal_dev /dev/nvme0n1p1
unable to get device info for 'nvme0n1p2'

Any ideas?
 
This makes no sense to put the jornal_dev on the same disk.
 
Journal devices make sense if they are faster than the data device.
In your case, it isn't.
In your case, you have to write more date for the journal without speedup.
Any ideas on why I can't create the OSD with the partition (as opposed to the device) reference.
I guess because you use bluestore and bluestore block.db need the data type to calculate the minimum partition size.
 
the best practice to achieve multiple OSD per NVMe seems to be "ceph-volume" which creates LVM for bluestore etc. - would you guys recommend that? Or is there any plan for supporting multiple OSD from the PVE GUI? Thanks for answers
 
the best practice to achieve multiple OSD per NVMe
This is not best practice. For what this should be good for?
You can increase the worker for a more parallel workload.

The only thing that makes sense and work since ever PVE implemented ceph server is multiple WAL/DB on an SSD/NVME.
 
I think the problem is coming from the single threaded finisher.
it's has been fixed recently in nautilus/master, but for mimic, they are an option

bluestore_shard_finishers=true

https://www.spinics.net/lists/ceph-devel/msg39009.html
"
Currently the option bluestore_shard_finishers is set as false. As a
result there is one finisher to handle bluestore IO completion.

In NVMe scenario, it becomes the bottleneck.
In my cluster with one OSD on NVMe disk,
With finisher number is 1:
write: IOPS=20.6k, BW=80.5MiB/s (84.4MB/s)(70.8GiB/899908msec)
And when set finisher numbers to 8:
write: IOPS=41.1k, BW=161MiB/s (168MB/s)(141GiB/899916msec)

"
 
@spirit: thanks for that info!

@wolfgang: in our case we have 4TB NVMe and additional nodes with just 1TB NVMe - so we wanted to create same size OSDs for not having to adjust any weights for distribution. Is the multiple OSD per NVMe a bad idea? Like the links from user "WSL" show, there seems to be a speed improvement when using multiple OSD on NVMe.

The other thing is scalability, in our case 1TB OSDs would be perfect for scaling the cluster - am I wrong?
 
- so we wanted to create same size OSDs for not having to adjust any weights for distribution
You don't have to adjust the weights. Because if you make 4 separate OSD on one NVME you will have in total the same weight as you have with a single OSD. because the weight depends on the OSD size and this is the same on a single or multiple on one disk.

We have made benchmarks if multiple OSD are faster on high-end enterprise SSD but there are no different.

The other thing is scalability, in our case 1TB OSDs would be perfect for scaling the cluster - am I wrong?
I do not understand. You should have on each node nearly the same amount of OSD space. The distribution is normally (Default PVE crush map) done on node level and not on OSD level.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!