Ceph OSD not initializing

Eduardo Taboada

Active Member
Dec 16, 2017
11
0
41
61
Recently I put new drives into a Proxmox cluster with Ceph, when I create the new OSD, the process hangs and keep in creating for a long time. I wait for almos one hour after I stop it.

Then the OSD appears but down and as outdated

Proxmox Version 6,2,11
Ceph Version 14.2.11





Ceph OSD log
create OSD on /dev/sde (bluestore)
wipe disk/partition: /dev/sde
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.603464 s, 348 MB/s
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 3efe28f5-c30d-475f-b104-a7cd9a76fd0e
Running command: /sbin/vgcreate --force --yes ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6 /dev/sde
stdout: Physical volume "/dev/sde" successfully created.
stdout: Volume group "ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6" successfully created
Running command: /sbin/lvcreate --yes -l 953861 -n osd-block-3efe28f5-c30d-475f-b104-a7cd9a76fd0e ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6
stdout: Logical volume "osd-block-3efe28f5-c30d-475f-b104-a7cd9a76fd0e" created.
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-9
--> Executable selinuxenabled not in PATH: /sbin:/bin:/usr/sbin:/usr/bin
Running command: /bin/chown -h ceph:ceph /dev/ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6/osd-block-3efe28f5-c30d-475f-b104-a7cd9a76fd0e
Running command: /bin/chown -R ceph:ceph /dev/dm-10
Running command: /bin/ln -s /dev/ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6/osd-block-3efe28f5-c30d-475f-b104-a7cd9a76fd0e /var/lib/ceph/osd/ceph-9/block
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-9/activate.monmap
stderr: 2020-09-21 15:09:31.699 7f07da8cf700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2020-09-21 15:09:31.699 7f07da8cf700 -1 AuthRegistry(0x7f07d4081d88) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: got monmap epoch 3
Running command: /bin/ceph-authtool /var/lib/ceph/osd/ceph-9/keyring --create-keyring --name osd.9 --add-key AQAKpmhf6J70FBAAjymgFRQxOQfcgZQJqCt60A==
stdout: creating /var/lib/ceph/osd/ceph-9/keyring
added entity osd.9 auth(key=
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/keyring
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/
Running command: /bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 9 --monmap /var/lib/ceph/osd/ceph-9/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-9/ --osd-uuid 3efe28f5-c30d-475f-b104-a7cd9a76fd0e --setuser ceph --setgroup ceph
--> ceph-volume lvm prepare successful for: /dev/sde
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 2373833d-e7ca-4ddf-8af4-aeeb8c710949 --data /dev/sde' failed: received interrupt
 

Attachments

  • Captura de pantalla 2020-09-21 a las 21.25.55.png
    Captura de pantalla 2020-09-21 a las 21.25.55.png
    34.8 KB · Views: 11
The problem is here

Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-13/activate.monmap
stderr: 2020-09-22 09:06:49.013 7faf3d862700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2020-09-22 09:06:49.013 7faf3d862700 -1 AuthRegistry(0x7faf38081d88) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: got monmap epoch 3

ls -la /var/lib/ceph/bootstrap-osd
total 12
drwxr-xr-x 2 ceph ceph 4096 Oct 30 2019 .
drwxr-x--- 14 ceph ceph 4096 Oct 30 2019 ..
-rw-r--r-- 1 root root 71 Oct 30 2019 ceph.keyring

ls -la /etc/pve/priv/
total 4
drwx------ 2 root www-data 0 Oct 30 2019 .
drwxr-xr-x 2 root www-data 0 Jan 1 1970 ..
-rw------- 1 root www-data 1679 Sep 21 18:11 authkey.key
-rw------- 1 root www-data 2365 Dec 9 2019 authorized_keys
drwx------ 2 root www-data 0 Oct 30 2019 ceph
-rw------- 1 root www-data 151 Oct 30 2019 ceph.client.admin.keyring
-rw------- 1 root www-data 228 Oct 30 2019 ceph.mon.keyring
-rw------- 1 root www-data 2364 Nov 1 2019 known_hosts
drwx------ 2 root www-data 0 Oct 30 2019 lock
-rw------- 1 root www-data 3243 Oct 30 2019 pve-root-ca.key
-rw------- 1 root www-data 3 Jul 12 04:01 pve-root-ca.srl

Ceph Config
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 172.30.X.X/24
fsid = 2373833d-e7ca-4ddf-8af4-aeeb8c710949
mon_allow_pool_delete = true
mon_host = 10.0.4.1 10.0.4.2 10.0.4.3
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.X.X.X/28 [client]
keyring = /etc/pve/priv/$cluster.$name.keyring
 
pveversion -v
<code>
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.11-pve1
ceph-fuse: 14.2.11-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-2
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
</code>
 
mon_host = 10.0.4.1 10.0.4.2 10.0.4.3
You may want to XXX these private IPs as well. ;)
In any case, these are private and not routed on the internet.

TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 2373833d-e7ca-4ddf-8af4-aeeb8c710949 --data /dev/sde' failed: received interrupt
The ceph-volume output may be a little bit confusing. I can only see the 'received interrupt' in the output.

You will need to remove the leftovers by hand. The ceph osd tree should list those 4x OSDs without an associated node.
 
You may want to XXX these private IPs as well. ;)
In any case, these are private and not routed on the internet.
Yes these IP are not routed.

The ceph-volume output may be a little bit confusing. I can only see the 'received interrupt' in the output.

You will need to remove the leftovers by hand. The ceph osd tree should list those 4x OSDs without an associated node.
In the Ceph OSD tree these OSD don't appear.
Yes, yo can only see 'received interrupt' because after 2 hours I cancel the process
 
I I'm trying to restart the process again I have made a

ceph osd out osd.9
ceph osd safe-to-destroy osd.9
systemctl stop ceph-osd@9.service
pveceph osd destroy 9

And when I try to zap the device the process is running for at least 1 hour, after I kill it
 
And when I try to zap the device the process is running for at least 1 hour, after I kill it
How do you zap the device? pveceph can run also a cleanup, pveceph osd destroy 9 --cleanup.
 
Code:
ceph-volume lvm zap /dev/sdf --destroy

Because after pveceph osd destroy 9 I have no OSD

Bash:
pveceph osd destroy 9 --cleanup
no such OSD '9'

But folder exists

Bash:
drwxrwxrwt  2 ceph ceph  180 Sep 12 09:47 ceph-0
drwxrwxrwt  2 ceph ceph  180 Sep 12 09:47 ceph-1
drwxrwxrwt  2 ceph ceph  300 Sep 22 08:58 ceph-12
drwxrwxrwt  2 ceph ceph  180 Sep 12 09:47 ceph-2
drwxrwxrwt  2 ceph ceph  300 Sep 22 08:58 ceph-9

But there are differences between existing OSD
This if for Ceph-0

Bash:
ls -la ceph-0/
total 28
drwxrwxrwt 2 ceph ceph  180 Sep 12 09:47 .
drwxr-xr-x 7 ceph ceph 4096 Sep 21 21:16 ..
lrwxrwxrwx 1 ceph ceph   93 Sep 12 09:47 block -> /dev/ceph-3261c290-035f-41c8-9696-96d6fc78e48f/osd-block-d1424ce3-2e2f-44ce-8c1d-19231ad1b716
-rw------- 1 ceph ceph   37 Sep 12 09:47 ceph_fsid
-rw------- 1 ceph ceph   37 Sep 12 09:47 fsid
-rw------- 1 ceph ceph   55 Sep 12 09:47 keyring
-rw------- 1 ceph ceph    6 Sep 12 09:47 ready
-rw------- 1 ceph ceph   10 Sep 12 09:47 type
-rw------- 1 ceph ceph    2 Sep 12 09:47 whoami


And Ceph-9
Bash:
ls -la ceph-9
total 52
drwxrwxrwt 2 ceph ceph  300 Sep 22 08:58 .
drwxr-xr-x 7 ceph ceph 4096 Sep 21 21:16 ..
-rw-r--r-- 1 ceph ceph  472 Sep 21 15:09 activate.monmap
lrwxrwxrwx 1 ceph ceph   93 Sep 21 15:09 block -> /dev/ceph-da7ee397-fb11-4920-bd31-ee01aee9f4b6/osd-block-3efe28f5-c30d-475f-b104-a7cd9a76fd0e
-rw------- 1 ceph ceph    2 Sep 21 15:09 bluefs
-rw------- 1 ceph ceph   37 Sep 21 15:09 ceph_fsid
-rw-r--r-- 1 ceph ceph   37 Sep 21 15:09 fsid
-rw------- 1 ceph ceph   56 Sep 22 08:57 keyring
-rw------- 1 ceph ceph    8 Sep 21 15:09 kv_backend
-rw------- 1 ceph ceph   21 Sep 21 15:09 magic
-rw------- 1 ceph ceph    4 Sep 21 15:09 mkfs_done
-rw------- 1 ceph ceph   41 Sep 21 15:09 osd_key
-rw------- 1 ceph ceph    6 Sep 21 15:09 ready
-rw------- 1 ceph ceph   10 Sep 21 15:09 type
-rw------- 1 ceph ceph    2 Sep 21 15:09 whoami
 
Last edited:
Ok, it works

Code:
sgdisk -Z /dev/sde
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.

Bu I can't still create OSD in the affected disks
 

Attachments

  • Captura de pantalla 2020-09-22 a las 15.10.05.png
    Captura de pantalla 2020-09-22 a las 15.10.05.png
    53 KB · Views: 8
  • Captura de pantalla 2020-09-22 a las 15.10.18.png
    Captura de pantalla 2020-09-22 a las 15.10.18.png
    30.6 KB · Views: 9
By hand run a dd for the first 200 MB and then a sgdisk -Z /dev/sdX. And it could also need a reboot, as the kernel might not be able to release the device.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!