Recreated Ceph cluster throwing errors

chrispage1

Member
Sep 1, 2021
88
44
23
32
When I created my Proxmox cluster, I installed Ceph and did some testing.

We had a few issues with Ceph so I followed some guides on uninstalling it from Proxmox. Now we've recreated our cluster and it's working nicely. However, when I reboot one particular node (pve01), it takes 15-20 minutes for the OSD's to fully start up again. From looking at the actual OSD logs all I have is -

Code:
2022-03-08T09:31:51.897+0000 7fd87a8f6f00  0 set uid:gid to 64045:64045 (ceph:ceph)
2022-03-08T09:31:51.897+0000 7fd87a8f6f00  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-osd, pid 4829
2022-03-08T09:31:51.897+0000 7fd87a8f6f00  0 pidfile_write: ignore empty --pid-file
2022-03-08T09:31:52.485+0000 7fd87a8f6f00  0 starting osd.3 osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
2022-03-08T09:31:52.501+0000 7fd87a8f6f00  0 load: jerasure load: lrc load: isa
2022-03-08T09:31:53.165+0000 7fd87a8f6f00  0 osd.3:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2022-03-08T09:31:55.477+0000 7fd87a8f6f00  0 bluestore(/var/lib/ceph/osd/ceph-3) _open_db_and_around read-only:0 repair:0
2022-03-08T09:49:21.838+0000 7fd87a8f6f00  0 _get_class not permitted to load sdk
2022-03-08T09:49:21.842+0000 7fd87a8f6f00  0 _get_class not permitted to load lua
2022-03-08T09:49:21.842+0000 7fd87a8f6f00  0 _get_class not permitted to load kvs
2022-03-08T09:49:21.842+0000 7fd87a8f6f00  0 <cls> ./src/cls/hello/cls_hello.cc:316: loading cls_hello
2022-03-08T09:49:21.842+0000 7fd87a8f6f00  0 <cls> ./src/cls/cephfs/cls_cephfs.cc:201: loading cephfs

You can see between 09:31 and 09:49 it was just waiting on something, and I'm not quite sure what. However, when digging further into the /var/log/ceph-volume.log file I see -

Code:
[2022-03-08 09:32:10,581][ceph_volume.process][INFO  ] Running command: /usr/sbin/ceph-volume lvm trigger 1-f5f2a63b-540d-4277-ba18-a7db63ce5359
[2022-03-08 09:32:10,592][ceph_volume.process][INFO  ] Running command: /usr/sbin/ceph-volume lvm trigger 3-eb671fc9-6db3-444e-b939-ae37ecaa1446
[2022-03-08 09:32:10,825][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.2 with osd_fsid e45faa5d-f0af-45a9-8f6f-dac037d69569
[2022-03-08 09:32:10,837][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.0 with osd_fsid 16d1d2ad-37c1-420a-bc18-ce89ea9654f9
[2022-03-08 09:32:10,844][systemd][WARNING] command returned non-zero exit status: 1
[2022-03-08 09:32:10,844][systemd][WARNING] failed activating OSD, retries left: 25
[2022-03-08 09:32:10,853][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.1 with osd_fsid f5f2a63b-540d-4277-ba18-a7db63ce5359
[2022-03-08 09:32:10,853][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.0 with osd_fsid 59992b5f-806b-4bed-9951-bca0ef4e6f0a
[2022-03-08 09:32:10,855][systemd][WARNING] command returned non-zero exit status: 1
[2022-03-08 09:32:10,855][systemd][WARNING] failed activating OSD, retries left: 25
[2022-03-08 09:32:10,865][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.3 with osd_fsid eb671fc9-6db3-444e-b939-ae37ecaa1446

So from the logs, Ceph is looking for osd.0, osd.1 & osd.3. However our node only has osd.2, osd.3, osd.8 and osd.9 - so it looks like there are some left over assets here.

Listing out my volumes with ceph-volume lvm list, I can see that these OSD's with the given FSID's definitely don't exist -

Code:
root@pve01:/var/lib/ceph/osd# ceph-volume lvm list | grep "osd fsid"
      osd fsid                  3038f5ae-c579-410b-bb6d-b3590c2834ff
      osd fsid                  b693f0d5-68de-462e-a1a8-fbdc137f4da4
      osd fsid                  4639ef09-a958-40f9-86ff-608ac651ca58
      osd fsid                  c4531f50-b192-494d-8e47-533fe780bfa3

So my question is: how can I tell Ceph to stop looking for these OSD's that no longer exist and where is this data even being stored?

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!