[SOLVED] Proxmox Ceph implementation seems to be broken (OSD Creation)

RockG

New Member
Dec 5, 2018
7
6
3
55
Preface:
I have a hybrid Ceph environment using 16 SATA spinners and 2 Intel Optane NVMe PCIe cards (intended for DB and WAL). Because of enumeration issues on reboot, the NVMe cards can flip their /dev/{names}. This will cause a full cluster re balance if the /dev/{names} flip. The recommendation from Intel is to create separate DB and Wal partitions on the NVMe and name them in Parted so they use:
/dev/disk/by-partlabel/osd-device-0-db
/dev/disk/by-partlabel/osd-device-0-wal
Then add to the ceph.conf :
bluestore block db path = /dev/disk/by-partlabel/osd-device-0-db
bluestore block wal path = /dev/disk/by-partlabel/osd-device-0-wa
l
Issues the Proxmox GUI:
When creating on OSD, you can select only one device for the DB and WAL together, you cannot separate the two on separate devices. The only way to size this partition is to specify the "bluestore_block_db_size" in the ceph.conf file. This is where my confidence is lost. If the DB and WAL are on this same partition, then why is the size of the partition that Proxmox creates, only the size of what is specified in the ceph.conf file for the "bluestore_block_db_size", and not the concatenation of "bluestore_block_db_size" and "bluestore_block_wal_size"? It is my belief that we are not really using the WAL on the Optane drives. Plus in this configuration method, I am unable to specify a separate "bluestore_block_wal_path" in the ceph.conf file Intel's recommendation.

Issues with the Proxmox CLI:
I have tried to create the OSD’s from a command line but that is broken As well. Using the method with "pveceph createosd" to create the OSD. I can specify the separate “/dev/disk/by-partlabel/” for both the WAL and DB but it fails creating the OSD. -It is left in a partial created state (no Device Class set to HDD on the OSD, and in the Crushmap the GUI says the OSD is FILESTORE and not BLUESTORE) which does not work. I have also tried the “ceph-disk prepare” and the “ceph-volume lvm prepare” methods which also break the same way as noted below.

Here is a command example that I ran, and the limited log output.

Command:
pveceph createosd /dev/sdr --bluestore --wal_dev /dev/disk/by-partlabel/osd-device-79-wal --journal_dev /dev/disk/by-partlabel/osd-device-79-db

Log:
2018-12-04 11:59:25.836454 7f2168ac1e00 0 set uid:gid to 64045:64045 (ceph:ceph)
2018-12-04 11:59:25.836468 7f2168ac1e00 0 ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable), process ceph-osd, pid 56236
2018-12-04 11:59:25.839160 7f2168ac1e00 1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) mkfs path /var/lib/ceph/tmp/mnt.Zljbva
2018-12-04 11:59:25.840085 7f2168ac1e00 -1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) _setup_block_symlink_or_file failed to create block.wal symlink to /dev
/disk/by-partlabel/osd-device-79-wal: (17) File exists
2018-12-04 11:59:25.840098 7f2168ac1e00 -1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) mkfs failed, (17) File exists
2018-12-04 11:59:25.840100 7f2168ac1e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (17) File exists
2018-12-04 11:59:25.840159 7f2168ac1e00 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.Zljbva: (17) File exists


Proxmox Version 5.2-9

Any help would be greatly appreciated, I am fairly new to Ceph so forgive me if I missed anything.
 
Last edited:
  • Like
Reactions: dmulk
@RockG, with our tooling the WAL/DB is placed on the same device on its own partition. As by default the WAL is 512MB and can reside on the same partition. See our docs for creation an OSD through the CLI.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_bluestore

AFAIK, the ceph-disk tool doesn't support disk by-partlabel and doesn't need to. Every partition containing to an OSD is getting a PARTUUID and when the OSD data disk is mounted (or the 100MB partition) ceph checks which UUID belongs to its OSD and activates it. This way the numeration change shouldn't matter.
Code:
root@p5:~# ls -lah /var/lib/ceph/osd/ceph-2
total 64K
drwxr-xr-x 2 ceph ceph  310 Dec  6 09:19 .
drwxr-xr-x 5 ceph ceph 4.0K Nov 26 17:33 ..
-rw-r--r-- 1 root root  393 Dec  6 09:19 activate.monmap
-rw-r--r-- 1 ceph ceph    3 Dec  6 09:19 active
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block -> /dev/disk/by-partuuid/bb133c52-1f9f-44a0-a59a-318b24e5d192
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.db -> /dev/disk/by-partuuid/f07a3c18-8241-4b74-ae2e-51d9fcd4e1b6
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.db_uuid
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block_uuid
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.wal -> /dev/disk/by-partuuid/3474bb2b-bf5d-44cb-bbf7-2419f995b599
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.wal_uuid
-rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 bluefs
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 ceph_fsid
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 fsid
-rw------- 1 ceph ceph   56 Dec  6 09:19 keyring
-rw-r--r-- 1 ceph ceph    8 Dec  6 09:19 kv_backend
-rw-r--r-- 1 ceph ceph   21 Dec  6 09:19 magic
-rw-r--r-- 1 ceph ceph    4 Dec  6 09:19 mkfs_done
-rw-r--r-- 1 ceph ceph    6 Dec  6 09:19 ready
-rw-r--r-- 1 ceph ceph    0 Dec  6 09:19 systemd
-rw-r--r-- 1 ceph ceph   10 Dec  6 09:19 type
-rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 whoami
Besides the tooling, a possible reason of the changing device naming might be that the PCIe slots of both NVMe are multiplexed (16x -> 8x8x) and this will cause more trouble later on, especially when the system is under heavy load.

Please post the link to the paper recommending the creation by partlabel.
 
  • Like
Reactions: dmulk
Alwin,
Thanks for getting back to me. I believe using the "disk by-partlabel" was in an effort to get around the problem that "pveceph createosd" seems broken. Which is why I am posting. I just want to be able to create OSD's, and have a separate db and wal partition with my size specs on the NVMe and the proper symlink in the osd directory. This is not working. I tried it again. Please see the example/results below:

In my ceph.conf, I have the following lines in there to specify the partition sizes that I need:
bluestore_block_db_size = 42949672960
bluestore_block_wal_size = 2147483648
^ This only seems to work for the block_db and not the block_wal

Here is the command that I ran to create the OSD:
pveceph createosd /dev/sdr -journal_dev /dev/nvme0n1 -wal_dev /dev/nvme0n1

Here is the output:
using device '/dev/nvme0n1' for block.db
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
Setting name!
partNum is 7
REALLY setting name!
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sdr1 isize=2048 agcount=4, agsize=6400 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=25600, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=1608, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
^No sign of creating a wal patition

Here is my lsblk on the MVMe drive:
nvme0n1 259:1 0 349.3G 0 disk
├─nvme0n1p1 259:3 0 42G 0 part
├─nvme0n1p2 259:5 0 42G 0 part
├─nvme0n1p3 259:6 0 42G 0 part
├─nvme0n1p4 259:8 0 42G 0 part
├─nvme0n1p5 259:10 0 42G 0 part
├─nvme0n1p6 259:13 0 42G 0 part
├─nvme0n1p7 259:14 0 42G 0 part
└─nvme0n1p8 259:17 0 40G 0 part
^Only 1 partition was created (nvme0n1p8) for the db and no WAL. Even if they are using the same partition, I specified the size for the WAL in my ceph.conf, so the partition should be 42G, here is the actual block size which matches my ceph.conf for the blockdb size only:
root@iw-ceph-05:~# blockdev --getsize64 /dev/nvme0n1p8
42949672960

Either way, if the wal and db are sharing the same partiton (even though I specified "-wal_dev /dev/nvme0n1) in the connamd line, I still have no symlink to the WAL. See below output of ls -lah /var/lib/ceph/osd/ceph-79:

drwxr-xr-x 2 ceph ceph 271 Dec 6 09:00 .
drwxr-xr-x 18 ceph ceph 18 Dec 4 17:24 ..
-rw-r--r-- 1 root root 438 Dec 6 09:00 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Dec 6 09:00 active
lrwxrwxrwx 1 ceph ceph 58 Dec 6 09:00 block -> /dev/disk/by-partuuid/39179be6-8ae1-4ec8-b98a-2639e1e95c7f
lrwxrwxrwx 1 ceph ceph 58 Dec 6 09:00 block.db -> /dev/disk/by-partuuid/e52d3576-cccb-466b-b252-fcc81970b88b
-rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 block_uuid
-rw-r--r-- 1 ceph ceph 2 Dec 6 09:00 bluefs
-rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 fsid
-rw------- 1 ceph ceph 57 Dec 6 09:00 keyring
-rw-r--r-- 1 ceph ceph 8 Dec 6 09:00 kv_backend
-rw-r--r-- 1 ceph ceph 21 Dec 6 09:00 magic
-rw-r--r-- 1 ceph ceph 4 Dec 6 09:00 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Dec 6 09:00 ready
-rw-r--r-- 1 ceph ceph 0 Dec 6 09:00 systemd
-rw-r--r-- 1 ceph ceph 10 Dec 6 09:00 type
-rw-r--r-- 1 ceph ceph 3 Dec 6 09:00 whoami

Please advise, perhaps something is wrong with my "pveceph createosd" command. How to I put the DB and WAL on /dev/nvme0n1 in separate partitions? On the same partition would be fine as long as I was confident that is working, but because there is no symlink upon creation )in /var/lib/ceph/osd/ceph-79), I feel that something is broken.
 
Last edited:
  • Like
Reactions: dmulk
  • Like
Reactions: dmulk
Hi Alwin,
Thanks again for your response. Since I do have two NVME devices, I was able to work around the issue by putting the DB and WAL separated on each of them. This is not ideal however since I'd rather have only half of my OSD's affected if one of the NVMe's goes down. I have rebuilt 2 of the 5 nodes this way. On Monday, I will try your suggestion with using the ceph-disk prepare method and see exactly where the issue is and report back.

In response to your question, I have the sizes for the DB and WAL specified in the [global] portion of the ceph.conf.

I was also hoping you can answer a related question in regards to the UUID:

/dev/disk/by-uuid/ shows only the uuid pointing to the data partition on the block device (example sda1)

/dev/disk/by-partuuid/ shows all partitions for all devices - block-data, block, DB, and WAL (example sda1, sda2, nvme0n1p1, nvme1n1p1)

/dev/disk/by-partlabel/ shows all partition for all devices - block-data, block, DB, and WAL like the example above because I named the partitions in Parted.

So does ceph use /dev/disk/by-partuuid/ exclusively? If you specify /dev/disk/by-partlabel/ you can specify the following on the [osd.#] portion of the ceph.conf like so:
bluestore block db path = /dev/disk/by-partlabel/osd-device-$id-db
bluestore block wal path = /dev/disk/by-partlabel/osd-device-$id-wal

So my question is since ceph-disk only uses the UUID, if the above is specified in the ceph.conf, does /dev/disk/by-partlabel/ take priority? -I just want a little bit more of an insurance policy in the case of the NVMe's swapping their device names again. (BTW I will follow up with the hardware manufacturer on the NVMe multiplexing issue -thank you for bringing that to my attention.
 
Code:
root@p5:~# ls -lah /var/lib/ceph/osd/ceph-2
total 64K
drwxr-xr-x 2 ceph ceph  310 Dec  6 09:19 .
drwxr-xr-x 5 ceph ceph 4.0K Nov 26 17:33 ..
-rw-r--r-- 1 root root  393 Dec  6 09:19 activate.monmap
-rw-r--r-- 1 ceph ceph    3 Dec  6 09:19 active
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block -> /dev/disk/by-partuuid/bb133c52-1f9f-44a0-a59a-318b24e5d192
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.db -> /dev/disk/by-partuuid/f07a3c18-8241-4b74-ae2e-51d9fcd4e1b6
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.db_uuid
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block_uuid
lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.wal -> /dev/disk/by-partuuid/3474bb2b-bf5d-44cb-bbf7-2419f995b599
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.wal_uuid
-rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 bluefs
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 ceph_fsid
-rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 fsid
-rw------- 1 ceph ceph   56 Dec  6 09:19 keyring
-rw-r--r-- 1 ceph ceph    8 Dec  6 09:19 kv_backend
-rw-r--r-- 1 ceph ceph   21 Dec  6 09:19 magic
-rw-r--r-- 1 ceph ceph    4 Dec  6 09:19 mkfs_done
-rw-r--r-- 1 ceph ceph    6 Dec  6 09:19 ready
-rw-r--r-- 1 ceph ceph    0 Dec  6 09:19 systemd
-rw-r--r-- 1 ceph ceph   10 Dec  6 09:19 type
-rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 whoami
This is from an OSD I created. The ceph-osd.target mounts the 100MB partition into /tmp/ and checks, which OSD.ID it belongs to and mounts the partition to its final location. The OSD daemon then starts and sees to which partuuid its DB/WAL & data belong. Labels may not be unique, whereas the uuid is.
 
Hi Alwin,
I have it all figured out now. Both the GUI and the "pveceph createosd" command line are broken (only broken when the same device is specified for a separate db and wal).. -They will not add another partition for the wal. In the documentation you pointed me to, the documentation is also wrong. they specify:
ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>
when it should be:
ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block.db <db-device>
^^^they put a dash instead of a period^^^

The following command works for me now:
ceph-disk prepare --bluestore /dev/sda --block.db /dev/nvme0n1 --block.wal /dev/nvme0n1

Once you have a resolution for your "pveceph createosd" command line wrapper, I'd be happy to re-title this thread with [Solved] in case any others are having the same issues with the same Proxmox/Ceph versions I am running.
 
Last edited:
Alwin,
I really appreciate all of the help. I am up and running now with the partitioning exactly how I need it. You have been extremely helpful!!!! I am going to change the title of this post to add [Solved]. For reference, I have opened up a Bugzilla ticket which can be found here:

https://bugzilla.proxmox.com/show_bug.cgi?id=2031
 
  • Like
Reactions: dmulk
Rock, from reading Alwin's reply on Bugzilla I see he is saying only the bluestore_block_db_size value will be used when both the db and WAL are on the same device and that only one partition will be created for both using the size specified. Are you wanting to specify the WAL separately only to control the enumeration issue per Intel's recommendations, or are you also wanting to increase the default size of the WAL? I see you are specifying 2GBs, much larger than the default size of the WAL. Thx
 
Hi Adam,
The enumeration is no longer an issue. So the problem is that the default WAL size is too small. So with the GUI or pveceph createosd command, they both seem to go with the default size, and ignore what is specified in the ceph.conf file. Also there is no symbolic link that is created in /var/lib/ceph/osd/ceph-0 for block.wal ->

The command ceph-disk prepare works however. So I am all set up. I just wanted to post this because of my concern of the lack of a block.wal symlink and I think it is important to be able to specify the wal size especially with varying drive capacities. -I was not sure if this was intended or a bug since the ceph-disk prepare command seems to work as expected.
 
Understood Rock. Here is my take on what's happening. Not sure if this will make this any clearer to Alwin??

From https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_bluestore

“pveceph createosd /dev/sd[X]

Note In order to select a disk in the GUI, to be more failsafe, the disk needs to have a GPT [7] partition table. You can create this with gdisk /dev/sd(x). If there is no GPT, you cannot select the disk as DB/WAL.

If you want to use a separate DB/WAL device for your OSDs, you can specify it through the -journal_dev option. The WAL is placed with the DB, if not specified separately.”

The man page for pveceph creatosd lists both a -journal_dev and -wal_dev options.

Rock specified both the db and WAL in his pveceph command and expected both to be created independently.

It would follow that Rock’s pveceph command below should create a partition for both the db and WAL.

pveceph createosd /dev/sdr -journal_dev /dev/nvme0n1 -wal_dev /dev/nvme0n1

One would think the pveceph createosd wrapper would invoke the ceph-disk command below, which does work. It creates two separate partitions on the device, one for the db and one for the WAL.

ceph-disk prepare --bluestore /dev/sda --block.db /dev/nvme0n1 --block.wal /dev/nvme0n1

It appears a pveceph createosd does not properly execute the associated ceph-disk command with both the db and WAL specified. The result is a db which I would assume also contains the WAL with the default size of 500MB. I’m not sure how one could confirm this.

It also appears that the pveceph “wrapper” only passes the --block.db /dev/nvme0n1 option to ceph-disk. The result is a single partition presumably with an embedded WAL. The GUI appears to be doing the same thing as it only allows the simple option of specifying an alternate storage location for db/WAL.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!