Proxmox Ceph implementation seems to be broken (OSD Creation)

Discussion in 'Proxmox VE: Installation and configuration' started by RockG, Dec 5, 2018.

  1. RockG

    RockG New Member

    Joined:
    Dec 5, 2018
    Messages:
    4
    Likes Received:
    4
    Preface:
    I have a hybrid Ceph environment using 16 SATA spinners and 2 Intel Optane NVMe PCIe cards (intended for DB and WAL). Because of enumeration issues on reboot, the NVMe cards can flip their /dev/{names}. This will cause a full cluster re balance if the /dev/{names} flip. The recommendation from Intel is to create separate DB and Wal partitions on the NVMe and name them in Parted so they use:
    /dev/disk/by-partlabel/osd-device-0-db
    /dev/disk/by-partlabel/osd-device-0-wal
    Then add to the ceph.conf :
    bluestore block db path = /dev/disk/by-partlabel/osd-device-0-db
    bluestore block wal path = /dev/disk/by-partlabel/osd-device-0-wa
    l
    Issues the Proxmox GUI:
    When creating on OSD, you can select only one device for the DB and WAL together, you cannot separate the two on separate devices. The only way to size this partition is to specify the "bluestore_block_db_size" in the ceph.conf file. This is where my confidence is lost. If the DB and WAL are on this same partition, then why is the size of the partition that Proxmox creates, only the size of what is specified in the ceph.conf file for the "bluestore_block_db_size", and not the concatenation of "bluestore_block_db_size" and "bluestore_block_wal_size"? It is my belief that we are not really using the WAL on the Optane drives. Plus in this configuration method, I am unable to specify a separate "bluestore_block_wal_path" in the ceph.conf file Intel's recommendation.

    Issues with the Proxmox CLI:
    I have tried to create the OSD’s from a command line but that is broken As well. Using the method with "pveceph createosd" to create the OSD. I can specify the separate “/dev/disk/by-partlabel/” for both the WAL and DB but it fails creating the OSD. -It is left in a partial created state (no Device Class set to HDD on the OSD, and in the Crushmap the GUI says the OSD is FILESTORE and not BLUESTORE) which does not work. I have also tried the “ceph-disk prepare” and the “ceph-volume lvm prepare” methods which also break the same way as noted below.

    Here is a command example that I ran, and the limited log output.

    Command:
    pveceph createosd /dev/sdr --bluestore --wal_dev /dev/disk/by-partlabel/osd-device-79-wal --journal_dev /dev/disk/by-partlabel/osd-device-79-db

    Log:
    2018-12-04 11:59:25.836454 7f2168ac1e00 0 set uid:gid to 64045:64045 (ceph:ceph)
    2018-12-04 11:59:25.836468 7f2168ac1e00 0 ceph version 12.2.8 (6f01265ca03a6b9d7f3b7f759d8894bb9dbb6840) luminous (stable), process ceph-osd, pid 56236
    2018-12-04 11:59:25.839160 7f2168ac1e00 1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) mkfs path /var/lib/ceph/tmp/mnt.Zljbva
    2018-12-04 11:59:25.840085 7f2168ac1e00 -1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) _setup_block_symlink_or_file failed to create block.wal symlink to /dev
    /disk/by-partlabel/osd-device-79-wal: (17) File exists
    2018-12-04 11:59:25.840098 7f2168ac1e00 -1 bluestore(/var/lib/ceph/tmp/mnt.Zljbva) mkfs failed, (17) File exists
    2018-12-04 11:59:25.840100 7f2168ac1e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (17) File exists
    2018-12-04 11:59:25.840159 7f2168ac1e00 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.Zljbva: (17) File exists


    Proxmox Version 5.2-9

    Any help would be greatly appreciated, I am fairly new to Ceph so forgive me if I missed anything.
     
    #1 RockG, Dec 5, 2018
    Last edited: Dec 5, 2018
    dmulk likes this.
  2. dmulk

    dmulk Member

    Joined:
    Jan 24, 2017
    Messages:
    48
    Likes Received:
    2
    Yikes. This looks pretty bad...
     
  3. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    @RockG, with our tooling the WAL/DB is placed on the same device on its own partition. As by default the WAL is 512MB and can reside on the same partition. See our docs for creation an OSD through the CLI.
    https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_bluestore

    AFAIK, the ceph-disk tool doesn't support disk by-partlabel and doesn't need to. Every partition containing to an OSD is getting a PARTUUID and when the OSD data disk is mounted (or the 100MB partition) ceph checks which UUID belongs to its OSD and activates it. This way the numeration change shouldn't matter.
    Code:
    root@p5:~# ls -lah /var/lib/ceph/osd/ceph-2
    total 64K
    drwxr-xr-x 2 ceph ceph  310 Dec  6 09:19 .
    drwxr-xr-x 5 ceph ceph 4.0K Nov 26 17:33 ..
    -rw-r--r-- 1 root root  393 Dec  6 09:19 activate.monmap
    -rw-r--r-- 1 ceph ceph    3 Dec  6 09:19 active
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block -> /dev/disk/by-partuuid/bb133c52-1f9f-44a0-a59a-318b24e5d192
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.db -> /dev/disk/by-partuuid/f07a3c18-8241-4b74-ae2e-51d9fcd4e1b6
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.db_uuid
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block_uuid
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.wal -> /dev/disk/by-partuuid/3474bb2b-bf5d-44cb-bbf7-2419f995b599
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.wal_uuid
    -rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 bluefs
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 ceph_fsid
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 fsid
    -rw------- 1 ceph ceph   56 Dec  6 09:19 keyring
    -rw-r--r-- 1 ceph ceph    8 Dec  6 09:19 kv_backend
    -rw-r--r-- 1 ceph ceph   21 Dec  6 09:19 magic
    -rw-r--r-- 1 ceph ceph    4 Dec  6 09:19 mkfs_done
    -rw-r--r-- 1 ceph ceph    6 Dec  6 09:19 ready
    -rw-r--r-- 1 ceph ceph    0 Dec  6 09:19 systemd
    -rw-r--r-- 1 ceph ceph   10 Dec  6 09:19 type
    -rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 whoami
    Besides the tooling, a possible reason of the changing device naming might be that the PCIe slots of both NVMe are multiplexed (16x -> 8x8x) and this will cause more trouble later on, especially when the system is under heavy load.

    Please post the link to the paper recommending the creation by partlabel.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk likes this.
  4. RockG

    RockG New Member

    Joined:
    Dec 5, 2018
    Messages:
    4
    Likes Received:
    4
    Alwin,
    Thanks for getting back to me. I believe using the "disk by-partlabel" was in an effort to get around the problem that "pveceph createosd" seems broken. Which is why I am posting. I just want to be able to create OSD's, and have a separate db and wal partition with my size specs on the NVMe and the proper symlink in the osd directory. This is not working. I tried it again. Please see the example/results below:

    In my ceph.conf, I have the following lines in there to specify the partition sizes that I need:
    bluestore_block_db_size = 42949672960
    bluestore_block_wal_size = 2147483648
    ^ This only seems to work for the block_db and not the block_wal

    Here is the command that I ran to create the OSD:
    pveceph createosd /dev/sdr -journal_dev /dev/nvme0n1 -wal_dev /dev/nvme0n1

    Here is the output:
    using device '/dev/nvme0n1' for block.db
    Creating new GPT entries.
    GPT data structures destroyed! You may now partition the disk using fdisk or
    other utilities.
    Creating new GPT entries.
    The operation has completed successfully.
    Setting name!
    partNum is 0
    REALLY setting name!
    The operation has completed successfully.
    prepare_device: OSD will not be hot-swappable if block.db is not the same device as the osd data
    Setting name!
    partNum is 7
    REALLY setting name!
    Warning: The kernel is still using the old partition table.
    The new table will be used at the next reboot or after you
    run partprobe(8) or kpartx(8)
    The operation has completed successfully.
    Warning: The kernel is still using the old partition table.
    The new table will be used at the next reboot or after you
    run partprobe(8) or kpartx(8)
    The operation has completed successfully.
    Setting name!
    partNum is 1
    REALLY setting name!
    The operation has completed successfully.
    The operation has completed successfully.
    meta-data=/dev/sdr1 isize=2048 agcount=4, agsize=6400 blks
    = sectsz=4096 attr=2, projid32bit=1
    = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
    data = bsize=4096 blocks=25600, imaxpct=25
    = sunit=0 swidth=0 blks
    naming =version 2 bsize=4096 ascii-ci=0 ftype=1
    log =internal log bsize=4096 blocks=1608, version=2
    = sectsz=4096 sunit=1 blks, lazy-count=1
    realtime =none extsz=4096 blocks=0, rtextents=0
    Warning: The kernel is still using the old partition table.
    The new table will be used at the next reboot or after you
    run partprobe(8) or kpartx(8)
    The operation has completed successfully.
    ^No sign of creating a wal patition

    Here is my lsblk on the MVMe drive:
    nvme0n1 259:1 0 349.3G 0 disk
    ├─nvme0n1p1 259:3 0 42G 0 part
    ├─nvme0n1p2 259:5 0 42G 0 part
    ├─nvme0n1p3 259:6 0 42G 0 part
    ├─nvme0n1p4 259:8 0 42G 0 part
    ├─nvme0n1p5 259:10 0 42G 0 part
    ├─nvme0n1p6 259:13 0 42G 0 part
    ├─nvme0n1p7 259:14 0 42G 0 part
    └─nvme0n1p8 259:17 0 40G 0 part
    ^Only 1 partition was created (nvme0n1p8) for the db and no WAL. Even if they are using the same partition, I specified the size for the WAL in my ceph.conf, so the partition should be 42G, here is the actual block size which matches my ceph.conf for the blockdb size only:
    root@iw-ceph-05:~# blockdev --getsize64 /dev/nvme0n1p8
    42949672960

    Either way, if the wal and db are sharing the same partiton (even though I specified "-wal_dev /dev/nvme0n1) in the connamd line, I still have no symlink to the WAL. See below output of ls -lah /var/lib/ceph/osd/ceph-79:

    drwxr-xr-x 2 ceph ceph 271 Dec 6 09:00 .
    drwxr-xr-x 18 ceph ceph 18 Dec 4 17:24 ..
    -rw-r--r-- 1 root root 438 Dec 6 09:00 activate.monmap
    -rw-r--r-- 1 ceph ceph 3 Dec 6 09:00 active
    lrwxrwxrwx 1 ceph ceph 58 Dec 6 09:00 block -> /dev/disk/by-partuuid/39179be6-8ae1-4ec8-b98a-2639e1e95c7f
    lrwxrwxrwx 1 ceph ceph 58 Dec 6 09:00 block.db -> /dev/disk/by-partuuid/e52d3576-cccb-466b-b252-fcc81970b88b
    -rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 block.db_uuid
    -rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 block_uuid
    -rw-r--r-- 1 ceph ceph 2 Dec 6 09:00 bluefs
    -rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 ceph_fsid
    -rw-r--r-- 1 ceph ceph 37 Dec 6 09:00 fsid
    -rw------- 1 ceph ceph 57 Dec 6 09:00 keyring
    -rw-r--r-- 1 ceph ceph 8 Dec 6 09:00 kv_backend
    -rw-r--r-- 1 ceph ceph 21 Dec 6 09:00 magic
    -rw-r--r-- 1 ceph ceph 4 Dec 6 09:00 mkfs_done
    -rw-r--r-- 1 ceph ceph 6 Dec 6 09:00 ready
    -rw-r--r-- 1 ceph ceph 0 Dec 6 09:00 systemd
    -rw-r--r-- 1 ceph ceph 10 Dec 6 09:00 type
    -rw-r--r-- 1 ceph ceph 3 Dec 6 09:00 whoami

    Please advise, perhaps something is wrong with my "pveceph createosd" command. How to I put the DB and WAL on /dev/nvme0n1 in separate partitions? On the same partition would be fine as long as I was confident that is working, but because there is no symlink upon creation )in /var/lib/ceph/osd/ceph-79), I feel that something is broken.
     
    #4 RockG, Dec 6, 2018
    Last edited: Dec 6, 2018
    dmulk likes this.
  5. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    In which section in the ceph.conf did you put these?

    Our code is just a wrapper for ceph-disk, so may you please test the setup with ceph-disk directly. This way we could find out if ceph-disk or pveceph has the issue. Thanks.
    http://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk likes this.
  6. RockG

    RockG New Member

    Joined:
    Dec 5, 2018
    Messages:
    4
    Likes Received:
    4
    Hi Alwin,
    Thanks again for your response. Since I do have two NVME devices, I was able to work around the issue by putting the DB and WAL separated on each of them. This is not ideal however since I'd rather have only half of my OSD's affected if one of the NVMe's goes down. I have rebuilt 2 of the 5 nodes this way. On Monday, I will try your suggestion with using the ceph-disk prepare method and see exactly where the issue is and report back.

    In response to your question, I have the sizes for the DB and WAL specified in the [global] portion of the ceph.conf.

    I was also hoping you can answer a related question in regards to the UUID:

    /dev/disk/by-uuid/ shows only the uuid pointing to the data partition on the block device (example sda1)

    /dev/disk/by-partuuid/ shows all partitions for all devices - block-data, block, DB, and WAL (example sda1, sda2, nvme0n1p1, nvme1n1p1)

    /dev/disk/by-partlabel/ shows all partition for all devices - block-data, block, DB, and WAL like the example above because I named the partitions in Parted.

    So does ceph use /dev/disk/by-partuuid/ exclusively? If you specify /dev/disk/by-partlabel/ you can specify the following on the [osd.#] portion of the ceph.conf like so:
    bluestore block db path = /dev/disk/by-partlabel/osd-device-$id-db
    bluestore block wal path = /dev/disk/by-partlabel/osd-device-$id-wal

    So my question is since ceph-disk only uses the UUID, if the above is specified in the ceph.conf, does /dev/disk/by-partlabel/ take priority? -I just want a little bit more of an insurance policy in the case of the NVMe's swapping their device names again. (BTW I will follow up with the hardware manufacturer on the NVMe multiplexing issue -thank you for bringing that to my attention.
     
  7. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Code:
    root@p5:~# ls -lah /var/lib/ceph/osd/ceph-2
    total 64K
    drwxr-xr-x 2 ceph ceph  310 Dec  6 09:19 .
    drwxr-xr-x 5 ceph ceph 4.0K Nov 26 17:33 ..
    -rw-r--r-- 1 root root  393 Dec  6 09:19 activate.monmap
    -rw-r--r-- 1 ceph ceph    3 Dec  6 09:19 active
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block -> /dev/disk/by-partuuid/bb133c52-1f9f-44a0-a59a-318b24e5d192
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.db -> /dev/disk/by-partuuid/f07a3c18-8241-4b74-ae2e-51d9fcd4e1b6
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.db_uuid
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block_uuid
    lrwxrwxrwx 1 ceph ceph   58 Dec  6 09:19 block.wal -> /dev/disk/by-partuuid/3474bb2b-bf5d-44cb-bbf7-2419f995b599
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 block.wal_uuid
    -rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 bluefs
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 ceph_fsid
    -rw-r--r-- 1 ceph ceph   37 Dec  6 09:19 fsid
    -rw------- 1 ceph ceph   56 Dec  6 09:19 keyring
    -rw-r--r-- 1 ceph ceph    8 Dec  6 09:19 kv_backend
    -rw-r--r-- 1 ceph ceph   21 Dec  6 09:19 magic
    -rw-r--r-- 1 ceph ceph    4 Dec  6 09:19 mkfs_done
    -rw-r--r-- 1 ceph ceph    6 Dec  6 09:19 ready
    -rw-r--r-- 1 ceph ceph    0 Dec  6 09:19 systemd
    -rw-r--r-- 1 ceph ceph   10 Dec  6 09:19 type
    -rw-r--r-- 1 ceph ceph    2 Dec  6 09:19 whoami
    This is from an OSD I created. The ceph-osd.target mounts the 100MB partition into /tmp/ and checks, which OSD.ID it belongs to and mounts the partition to its final location. The OSD daemon then starts and sees to which partuuid its DB/WAL & data belong. Labels may not be unique, whereas the uuid is.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. RockG

    RockG New Member

    Joined:
    Dec 5, 2018
    Messages:
    4
    Likes Received:
    4
    Hi Alwin,
    I have it all figured out now. Both the GUI and the "pveceph createosd" command line are broken (only broken when the same device is specified for a separate db and wal).. -They will not add another partition for the wal. In the documentation you pointed me to, the documentation is also wrong. they specify:
    ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>
    when it should be:
    ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block.db <db-device>
    ^^^they put a dash instead of a period^^^

    The following command works for me now:
    ceph-disk prepare --bluestore /dev/sda --block.db /dev/nvme0n1 --block.wal /dev/nvme0n1

    Once you have a resolution for your "pveceph createosd" command line wrapper, I'd be happy to re-title this thread with [Solved] in case any others are having the same issues with the same Proxmox/Ceph versions I am running.
     
    #8 RockG, Dec 10, 2018 at 20:59
    Last edited: Dec 10, 2018 at 21:22
    AlexLup and dmulk like this.
  9. dmulk

    dmulk Member

    Joined:
    Jan 24, 2017
    Messages:
    48
    Likes Received:
    2
    Nice work man!
     
    AlexLup likes this.
  10. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    1,741
    Likes Received:
    151
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    dmulk likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice