PVE6 pveceph create osd: unable to get device info

Ingo S

Renowned Member
Oct 16, 2016
347
41
68
41
We have a 6 Node Cluster, of which are 4 Ceph Nodes, 8HDDs each Node and one Enterprise NVME SSD per Node. In the last few days serveral HDDs died and have to be replaced.
Back when I setup the ceph storage, I created a partition on the ssd for every osd, to serve as WAL device.

When I try to create a new osd i get an error message.
Bash:
root@vm-2:~# pveceph createosd /dev/sde -wal_dev /dev/nvme0n1p5
unable to get device info for '/dev/nvme0n1p5' for type wal_dev
root@vm-2:~#

Since the partitions have a naming scheme of /dev/nvmeXnYpZ I believe pveceph does not accept this as a valid devicepath, so I am unable to create OSDs with the NVME SSD as WAL Device.
Bash:
root@vm-2:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
[...]
sdd                  8:48   0   3.7T  0 disk 
├─sdd1               8:49   0   100M  0 part /var/lib/ceph/osd/ceph-14
└─sdd2               8:50   0   3.7T  0 part 
sde                  8:64   0   3.7T  0 disk 
sdf                  8:80   0   3.7T  0 disk 
├─sdf1               8:81   0   100M  0 part /var/lib/ceph/osd/ceph-16
└─sdf2               8:82   0   3.7T  0 part 
[...]
nvme0n1            259:0    0 349.3G  0 disk 
├─nvme0n1p1        259:1    0  43.7G  0 part 
├─nvme0n1p2        259:2    0  43.7G  0 part 
├─nvme0n1p3        259:3    0  43.7G  0 part 
├─nvme0n1p4        259:4    0  43.7G  0 part 
├─nvme0n1p5        259:5    0  43.7G  0 part 
├─nvme0n1p6        259:6    0  43.7G  0 part 
├─nvme0n1p7        259:7    0  43.7G  0 part 
└─nvme0n1p8        259:8    0  43.7G  0 part

How can I handle this?

greetings from the North Sea
 
root@vm-2:~# pveceph createosd /dev/sde -wal_dev /dev/nvme0n1p5 unable to get device info for '/dev/nvme0n1p5' for type wal_dev
This message seems to come from ceph-volume directly. Can you try to create a OSD with ceph-volume lvm create -h? So we can rule out one end.
 
I issued the following command:
Bash:
root@vm-2:~# ceph-volume lvm create --bluestore --data /dev/sde --block.wal /dev/nvme0n1p5

And it succeded... No Errormessages. OSD.15 has been successfully created.

Bash:
sde                                                                                                     8:64   0   3.7T  0 disk
└─ceph--bc8e2c75--1e88--4bc4--a35a--3e5dc4513daf-osd--block--87b92612--855a--4b7a--b816--7638cb0448eb 253:5    0   3.7T  0 lvm
 
I took a look around what this lvm based OSD is all about and came to some problematic things:

Imagine a softly defective HDD, with occasional read errors and reallocated sector counts in a somewhat big server with about 32 drives.
These soft failures are not really recognised by the HBA. S.M.A.R.T or the Kernel report these errors but you do not always get a conclusive device name. On some controllers it is something like /dev/bus/0

If such an error occurs, usually the OSD will shut down and be down and out. With "normal" device based OSDs, i can then lookup the corresponding device by mount |grep "ceph-<osd-num>".
I have written a script that can lookup this device name to a slot/port on various LSI HBAs in the server. This makes it really easy and above all safe to replace the drive.

With lvm based OSDs you just get "tmpfs" as the devicename. Using lvs, pvs or vgs you get a wild UUID like number which has no connection to the OSD name. This makes it really tricky if not impossible.
Is there an easy way to get the OSDs device name?
 
Is there an easy way to get the OSDs device name?
ceph-volume lvm list or ceph-volume inventory --format json-pretty, the latter needs the json formatting to get OSD ID and the disk.
 
  • Like
Reactions: Ingo S
Thank you, this helps.

Will it be possible to create new OSDs with pveceph createosd or should I stick lvm method?

Maybe its just a wrong way of doing so, creating a separate partition for each WAL? Im not sure but, can I put multiple WALs onto the same single device? Like:
Bash:
pveceph createosd /dev/sd[a-h] -wal_dev /dev/nvme0n1
 
The option -wal_dev only separates the WAL (write-ahead log) from the data device, but not the DB. If you use the -db_dev instead, it will move the DB + WAL onto the new device. This is in the majority of cases, what you want to do, so small writes will land on the faster device.

Our tooling usually handles partitions or whole devices and passes them to ceph-volume underneath. If you are using PVE 6 with Ceph, ofc. In PVE 5 the tool ceph-disk was responsible for the OSDs and used pare partitions instead.

EDIT: what version of Ceph do you run?
 
We are on PVE 6 with Ceph Nautilus (14.2.4)

Well somehow it wasn't clear to me, that -db_dev moves DB+WAL to the device. I thought it just moves DB, while -wal_dev moves just the WAL. The ceph-volume tool complains if you want to use the same dev for WAL and DB.

In my question I was really aiming for this: Does using a whole device for DB and/or WAL for multiple OSDs work, or does the creation of the second OSD overwrite the DB/WAL of the first OSD ?
 
Well somehow it wasn't clear to me, that -db_dev moves DB+WAL to the device. I thought it just moves DB, while -wal_dev moves just the WAL. The ceph-volume tool complains if you want to use the same dev for WAL and DB.
The WAL lives alongside the DB if only the DB is placed on a different device, these two parts are necessary for a working RocksDB.

In my question I was really aiming for this: Does using a whole device for DB and/or WAL for multiple OSDs work, or does the creation of the second OSD overwrite the DB/WAL of the first OSD ?
Our tooling or in turn ceph-volume will handle the partition/LV creation. In the command just use the full device and a separate partition will be created. You can easily verify this by creating two OSDs with the same -db_dev (eg. /dev/sdX).

While a little older, the blog entry gives an overview on Bluestore.
https://ceph.com/community/new-luminous-bluestore/
 
Is there any way to change the location of the WAL after OSD creation? Would it be smart to put the WAL for all OSDs onto a single NVMe drive or should these be mirrored?
 
@vRod, you can offline with the bluestore-tool but why not re-create the OSDs? As said above, to move the WAL alone will not give you much.
 
@Alwin Thanks for clearing that up. I think I will just leave it as it is, since the OSD's themselves are also SSD's.
 
So, just for clarification:
The WAL lives alongside the DB if only the DB is placed on a different device, these two parts are necessary for a working RocksDB.
[...]
You can easily verify this by creating two OSDs with the same -db_dev (eg. /dev/sdX).
Right now i cannot use pveceph createosd because it does not accept the partition I give as an argument, since it expects an entire disk. This disk is completely used by the partitions I prepared manually beforehand OSD creation.
If I use ceph-volume lvm create with the --block.db argument, will this put the DB + WAL on the specified device or just DB?
I am asking because you cannot use --block.wal <dev> and --block.db <dev> at the same time with the same target WAL/DB device and this confuses me.
 
If I use ceph-volume lvm create with the --block.db argument, will this put the DB + WAL on the specified device or just DB?
This will place the DB+WAL on the device. Size of the partition needs to be in 3, 30, 300 GiB, this is how RocksDB merges its data files.

I am asking because you cannot use --block.wal <dev> and --block.db <dev> at the same time with the same target WAL/DB device and this confuses me.
That's because of the above, as it already lives together.
 
  • Like
Reactions: Ingo S
Thanks very much for clearing this up.
I had a missunderstanding of DB and WAL devices, so since our cluster is a bit below our expectation regarding performance, I will rebuild all OSDs to use our NVME SSD for DB too, one after one.

Thx, this can be considered closed...
 
Update:
I still cannot create OSDs with pveceph osd create. Our NVME Cache Disk is 375GB, but on creation of an OSD pveceph complains that the disk is too small:
Bash:
root@vm-3:~# pveceph osd create /dev/sda -db_dev /dev/nvme0n1
create OSD on /dev/sda (bluestore)
creating block.db on '/dev/nvme0n1'
'/dev/nvme0n1' is smaller than requested size '400022516531' bytes
OSD Disk Size is 4TB per Disk.
So anyways, i decided to stick with ceph volume lvm to create the OSDs

Sidenote:
I think this might clear something up for others who are puzzled by poor IO performance of Ceph with Hard Disks:
If you have a number of HDDS in your cluster and use NVME Disks to speed it up with caching etc.
->ALWAYS PUT DB AND WAL on SSD. Never ever put WAL alone on the SSD. Doing so will greatly decrease performance.

A Node with HDDs and just WAL on SSD looks like this during rebalance:
Bash:
root@vm-1:~# iostat
Linux 5.0.21-2-pve (vm-1)       10/11/2019      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.01    0.00    1.65    3.21    0.00   92.13

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1         724.67       503.95      4570.31  663746270 6019460756
sda               3.17        82.51        51.83  108673475   68265121
sdb              36.20      5363.48      1308.66 7064128826 1723611816
sdc              39.29      5015.09      1446.57 6605272811 1905254548
sdi              44.63      5338.25      1763.93 7030900434 2323238196
sdd              43.23      5689.22      1712.40 7493154025 2255365288
sdf              39.17      5035.98      1478.80 6632785114 1947696716
sdg              36.24      4779.53      1498.51 6295022326 1973661100
sdh              39.45      4786.72      1261.61 6304488269 1661636224
While a Node with DB AND WAL on SSD looks like this:
Code:
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1         448.39       708.55      5429.24    1487468   11397664
sda             311.70       311.64     69841.68     654220  146619331
sdb             330.86       308.85     74452.06     648368  156297959
sdc             290.79       305.45     64997.29     641232  136449459
sdd             238.35       303.87     52840.50     637912  110928588
sde             220.87       301.83     48735.68     633644  102311296
sdf             224.75       299.21     49540.29     628136  104000427
sdg             131.18       296.42     27698.22     622284   58147148
sdh              20.19       294.59      1666.33     618432    3498144
sdi              26.76       390.38       946.91     819519    1987865
Notice the great difference in IO performance on the hard disks! Performance increases about fivefold.
I never really came to think about it the right way until now, but: I am pretty sure this is because every object that is written causes a write to the RocksDB. If this is your spinner, it has to reallocate its Head to a completely different place, update a few Bytes in the DB, then write the next Object... THIS IS awfully SLOW!!!
Just tell PVE to put your DB on the SSD and DB + WAL will live there very happily and your spinner will give MUCH better performance.

I will now recreate every single OSD and rearrange them with DB on SSD oer the weekend...
 
  • Like
Reactions: Alwin
I am glad that it worked.

'/dev/nvme0n1' is smaller than requested size '400022516531' bytes
If no size is specified, pveceph takes 10% of the data disk size for the size of the DB+WAL.

If you have a number of HDDS in your cluster and use NVME Disks to speed it up with caching etc.
In OSD context, this is not caching, merely putting the DB (+WAL) onto a faster device. For Ceph caching will happen on other levels. ;)
 
Our tooling or in turn ceph-volume will handle the partition/LV creation. In the command just use the full device and a separate partition will be created.
Related to this, would you please clarify for me, we have an SSD for the root drive, and I intend to use free, unpartitioned space on that SSD for the db - is it safe to specify -db_disk /dev/sda, and my current root filesystem, swap, etc. partitions will not be overwritten?

We have gpt partition table on sda, with sda1 - sda4 in use, I'm hoping pveceph will just create sda5 partition of db_size and use that. If not, I'll just partition and setup with ceph volume lvm.

Thanks
 
Last edited:
Related to this, would you please clarify for me, we have an SSD for the root drive, and I intend to use free, unpartitioned space on that SSD for the db - is it safe to specify -db_disk /dev/sda, and my current root filesystem, swap, etc. partitions will not be overwritten?

We have gpt partition table on sda, with sda1 - sda4 in use, I'm hoping pveceph will just create sda5 partition of db_size and use that. If not, I'll just partition and setup with ceph volume lvm.

Thanks
In browsing some of the source, it appears that 2 cases will be handled when examining a db_disk device, a disk with gpt partition label will have a new partition created for the block.db, and a device with a volume group named 'ceph' will have a new lvm created within that group. I don't know offhand if any partition on the device could hold a 'ceph' vg, or if it needs to be the raw, unpartitioned device with a vg label, but in my case a gpt partition sounds promising.
 
Do not mix the OS disk with Ceph. Ceph will thrash the performance of the disk. Besides that the OS might grind to a halt, also Ceph won't benefit.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!