Ceph OSD creation fails

Max P

Active Member
Jun 19, 2018
14
0
41
54
Hi,
I am trying to create an OSD on one of our nodes in our 4 node cluster and I am getting this error:
Code:
command 'ceph-volume lvm create --cluster-fsid e9f42f14-bed0-4839-894b-0ca3e598320e --block.db '' --data /dev/sdi' failed: exit code 1

System state before trying to create the OSD (via the webUI; /dev/sdi is the disk for the new OSD and /dev/nvme0n1 is where the block.db for this OSD should be placed):
Code:
root@pve4:~# lsblk
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
......
sdh                    8:112  1   7.3T  0 disk
├─sdh1                 8:113  1   100M  0 part /var/lib/ceph/osd/ceph-44
└─sdh2                 8:114  1   7.3T  0 part
sdi                    8:128  1   7.3T  0 disk
sdj                    8:144  1   7.3T  0 disk
├─sdj1                 8:145  1   100M  0 part /var/lib/ceph/osd/ceph-25
└─sdj2                 8:146  1   7.3T  0 part
....
nvme0n1              259:0    0 260.9G  0 disk
├─nvme0n1p1          259:1    0    20G  0 part
├─nvme0n1p3          259:2    0    20G  0 part
├─nvme0n1p4          259:3    0    20G  0 part
├─nvme0n1p6          259:4    0    20G  0 part
├─nvme0n1p7          259:5    0    20G  0 part
├─nvme0n1p8          259:6    0    20G  0 part
├─nvme0n1p9          259:7    0    20G  0 part
├─nvme0n1p10         259:8    0    20G  0 part
├─nvme0n1p11         259:9    0    20G  0 part
├─nvme0n1p12         259:10   0    20G  0 part
└─nvme0n1p13         259:11   0    20G  0 part
...

root@pve4:~# gdisk -l /dev/nvme0n1
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/nvme0n1: 547002288 sectors, 260.8 GiB
Model: INTEL SSDPED1D280GA
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 19AFB808-D8FA-4819-B95C-DBF93CD6AECF
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 547002254
Partitions will be aligned on 2048-sector boundaries
Total free space is 79337325 sectors (37.8 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        41945087   20.0 GiB    FFFF  ceph block.db
   3        83888128       125831167   20.0 GiB    FFFF  ceph block.db
   4       125831168       167774207   20.0 GiB    FFFF  ceph block.db
   6       209717248       251660287   20.0 GiB    FFFF  ceph block.db
   7       251660288       293603327   20.0 GiB    FFFF  ceph block.db
   8       293603328       335546367   20.0 GiB    FFFF  ceph block.db
   9       335546368       377489407   20.0 GiB    FFFF  ceph block.db
  10       377489408       419432447   20.0 GiB    FFFF  ceph block.db
  11       419432448       461375487   20.0 GiB    FFFF  ceph block.db
  12       461375488       503318527   20.0 GiB    FFFF  ceph block.db
  13       503318528       545261567   20.0 GiB    FFFF  ceph block.db
......

Here is the full error message from the webUI (I selected /dev/sdi and /dev/nvme0n1 for block.db and chose 3GB, since I now learned that our initial size of 20GB isn't well suited for rocksdb, but I got the same error when using our default size of 20GB):
Code:
create OSD on /dev/sdi (bluestore)
creating block.db on '/dev/nvme0n1'
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
Use of uninitialized value $part_or_lv in concatenation (.) or string at /usr/share/perl5/PVE/API2/Ceph/OSD.pm line 465.
using '' for block.db
wipe disk/partition: /dev/sdi
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.954768 s, 220 MB/s
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 761d95ef-7526-4c85-b33d-759bea2da16e
Running command: /sbin/vgcreate --force --yes ceph-3823a6f7-bb18-4f78-be2e-30689655458a /dev/sdi
 stdout: Physical volume "/dev/sdi" successfully created.
 stdout: Volume group "ceph-3823a6f7-bb18-4f78-be2e-30689655458a" successfully created
Running command: /sbin/lvcreate --yes -l 1907721 -n osd-block-761d95ef-7526-4c85-b33d-759bea2da16e ceph-3823a6f7-bb18-4f78-be2e-30689655458a
 stdout: Logical volume "osd-block-761d95ef-7526-4c85-b33d-759bea2da16e" created.
--> blkid could not detect a PARTUUID for device:
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.24 --yes-i-really-mean-it
 stderr: purged osd.24
-->  RuntimeError: unable to use device
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid e9f42f14-bed0-4839-894b-0ca3e598320e --block.db '' --data /dev/sdi' failed: exit code 1

The line |using '' for block.db| looks suspicious, like there is a null/empty value.

System state after the failed OSD creation attempt:
Code:
root@pve4:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
....
sdh                                                                                                     8:112  1   7.3T  0 disk
├─sdh1                                                                                                  8:113  1   100M  0 part /var/lib/ceph/osd/ceph-44
└─sdh2                                                                                                  8:114  1   7.3T  0 part
sdi                                                                                                     8:128  1   7.3T  0 disk
└─ceph--3823a6f7--bb18--4f78--be2e--30689655458a-osd--block--761d95ef--7526--4c85--b33d--759bea2da16e 253:6    0   7.3T  0 lvm
sdj                                                                                                     8:144  1   7.3T  0 disk
├─sdj1                                                                                                  8:145  1   100M  0 part /var/lib/ceph/osd/ceph-25
└─sdj2                                                                                                  8:146  1   7.3T  0 part
...
nvme0n1                                                                                               259:0    0 260.9G  0 disk
├─nvme0n1p1                                                                                           259:1    0    20G  0 part
├─nvme0n1p3                                                                                           259:2    0    20G  0 part
├─nvme0n1p4                                                                                           259:3    0    20G  0 part
├─nvme0n1p6                                                                                           259:4    0    20G  0 part
├─nvme0n1p7                                                                                           259:5    0    20G  0 part
├─nvme0n1p8                                                                                           259:6    0    20G  0 part
├─nvme0n1p9                                                                                           259:7    0    20G  0 part
├─nvme0n1p10                                                                                          259:8    0    20G  0 part
├─nvme0n1p11                                                                                          259:9    0    20G  0 part
├─nvme0n1p12                                                                                          259:10   0    20G  0 part
└─nvme0n1p13                                                                                          259:11   0    20G  0 part
...

root@pve4:~# gdisk -l /dev/nvme0n1
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/nvme0n1: 547002288 sectors, 260.8 GiB
Model: INTEL SSDPED1D280GA
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 19AFB808-D8FA-4819-B95C-DBF93CD6AECF
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 547002254
Partitions will be aligned on 2048-sector boundaries
Total free space is 79337325 sectors (37.8 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        41945087   20.0 GiB    FFFF  ceph block.db
   3        83888128       125831167   20.0 GiB    FFFF  ceph block.db
   4       125831168       167774207   20.0 GiB    FFFF  ceph block.db
   6       209717248       251660287   20.0 GiB    FFFF  ceph block.db
   7       251660288       293603327   20.0 GiB    FFFF  ceph block.db
   8       293603328       335546367   20.0 GiB    FFFF  ceph block.db
   9       335546368       377489407   20.0 GiB    FFFF  ceph block.db
  10       377489408       419432447   20.0 GiB    FFFF  ceph block.db
  11       419432448       461375487   20.0 GiB    FFFF  ceph block.db
  12       461375488       503318527   20.0 GiB    FFFF  ceph block.db
  13       503318528       545261567   20.0 GiB    FFFF  ceph block.db
  14        41945088        48236543   3.0 GiB     8300

So there is a new partition (#14) created and there is space for it (even enough for a 20GB partition).

I updated and rebooted this system yesterday, so it should be up to date.
Code:
root@pve4:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-4.15: 5.4-19
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 14.2.19-pve1
ceph-fuse: 14.2.19-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.13-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Any ideas what the reason could be?
 
IIRC we had a similar case a few weeks ago. The TL;DR was that for some reasons fetching the partitions for the DB device did not work consistently. Something about the partition tables and what the kernel knows being not 100% consistent.

The workaround was to first remove all OSDs in a node, wipe all the OSD and DB/WAL disks (sgdisk -Z /dev/...) and to recreate them newly. With more recent versions of Ceph, the DB device is not using paritions anymore but LVM for the DB/WAL volumes of the different OSDs.

This can take a while as you should only do it one node a time and give ceph time to heal itself between nodes.
 
I initially also thought that there were partition table inconsistencies, but that's why I rebooted the node, thinking that it would fix it since it has to reread the tables again.

The cluster was initially set up with proxmox 5 (including ceph) and we upgraded it last year (including ceph). But that should be a supported upgrade path, right?

We were thinking of upgrading the SSDs (with the rocksdb sharding sizes in mind), so we could also just add the new SSD (only requiring short downtime) and then we could destroy one OSD at a time and recreate it with the new SSD for DB/WAL and only have one OSD down at a time instead of having the whole node's data distributed to the other nodes, right?

Or is this better suited for a ticket?
 
We were thinking of upgrading the SSDs (with the rocksdb sharding sizes in mind), so we could also just add the new SSD (only requiring short downtime) and then we could destroy one OSD at a time and recreate it with the new SSD for DB/WAL and only have one OSD down at a time instead of having the whole node's data distributed to the other nodes, right?
Sure, sounds even better.

I am not sure if a ticket is needed if you can handle the change yourself :)
 
Would it then be better to first upgrade to the newest ceph version (I think v15 is also supported on proxmox right now) and then create all OSDs new one by one ?
I am not sure which changes are in v15, but if there are also default disk layout changes then this way we wouldn't have to do it again after the v15 upgrade.
 
If you want to upgrade anyway then yes. AFAIR there are no major disk layout changes from nautilus to octopus. The big one is that the PG autoscaler is enabled by default.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!