When I created my Proxmox cluster, I installed Ceph and did some testing.
We had a few issues with Ceph so I followed some guides on uninstalling it from Proxmox. Now we've recreated our cluster and it's working nicely. However, when I reboot one particular node (pve01), it takes 15-20 minutes for the OSD's to fully start up again. From looking at the actual OSD logs all I have is -
You can see between 09:31 and 09:49 it was just waiting on something, and I'm not quite sure what. However, when digging further into the
So from the logs, Ceph is looking for osd.0, osd.1 & osd.3. However our node only has osd.2, osd.3, osd.8 and osd.9 - so it looks like there are some left over assets here.
Listing out my volumes with
So my question is: how can I tell Ceph to stop looking for these OSD's that no longer exist and where is this data even being stored?
Thanks!
We had a few issues with Ceph so I followed some guides on uninstalling it from Proxmox. Now we've recreated our cluster and it's working nicely. However, when I reboot one particular node (pve01), it takes 15-20 minutes for the OSD's to fully start up again. From looking at the actual OSD logs all I have is -
Code:
2022-03-08T09:31:51.897+0000 7fd87a8f6f00 0 set uid:gid to 64045:64045 (ceph:ceph)
2022-03-08T09:31:51.897+0000 7fd87a8f6f00 0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-osd, pid 4829
2022-03-08T09:31:51.897+0000 7fd87a8f6f00 0 pidfile_write: ignore empty --pid-file
2022-03-08T09:31:52.485+0000 7fd87a8f6f00 0 starting osd.3 osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
2022-03-08T09:31:52.501+0000 7fd87a8f6f00 0 load: jerasure load: lrc load: isa
2022-03-08T09:31:53.165+0000 7fd87a8f6f00 0 osd.3:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196)
2022-03-08T09:31:55.477+0000 7fd87a8f6f00 0 bluestore(/var/lib/ceph/osd/ceph-3) _open_db_and_around read-only:0 repair:0
2022-03-08T09:49:21.838+0000 7fd87a8f6f00 0 _get_class not permitted to load sdk
2022-03-08T09:49:21.842+0000 7fd87a8f6f00 0 _get_class not permitted to load lua
2022-03-08T09:49:21.842+0000 7fd87a8f6f00 0 _get_class not permitted to load kvs
2022-03-08T09:49:21.842+0000 7fd87a8f6f00 0 <cls> ./src/cls/hello/cls_hello.cc:316: loading cls_hello
2022-03-08T09:49:21.842+0000 7fd87a8f6f00 0 <cls> ./src/cls/cephfs/cls_cephfs.cc:201: loading cephfs
You can see between 09:31 and 09:49 it was just waiting on something, and I'm not quite sure what. However, when digging further into the
/var/log/ceph-volume.log
file I see -
Code:
[2022-03-08 09:32:10,581][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-f5f2a63b-540d-4277-ba18-a7db63ce5359
[2022-03-08 09:32:10,592][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 3-eb671fc9-6db3-444e-b939-ae37ecaa1446
[2022-03-08 09:32:10,825][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.2 with osd_fsid e45faa5d-f0af-45a9-8f6f-dac037d69569
[2022-03-08 09:32:10,837][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 16d1d2ad-37c1-420a-bc18-ce89ea9654f9
[2022-03-08 09:32:10,844][systemd][WARNING] command returned non-zero exit status: 1
[2022-03-08 09:32:10,844][systemd][WARNING] failed activating OSD, retries left: 25
[2022-03-08 09:32:10,853][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with osd_fsid f5f2a63b-540d-4277-ba18-a7db63ce5359
[2022-03-08 09:32:10,853][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.0 with osd_fsid 59992b5f-806b-4bed-9951-bca0ef4e6f0a
[2022-03-08 09:32:10,855][systemd][WARNING] command returned non-zero exit status: 1
[2022-03-08 09:32:10,855][systemd][WARNING] failed activating OSD, retries left: 25
[2022-03-08 09:32:10,865][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.3 with osd_fsid eb671fc9-6db3-444e-b939-ae37ecaa1446
So from the logs, Ceph is looking for osd.0, osd.1 & osd.3. However our node only has osd.2, osd.3, osd.8 and osd.9 - so it looks like there are some left over assets here.
Listing out my volumes with
ceph-volume lvm list
, I can see that these OSD's with the given FSID's definitely don't exist -
Code:
root@pve01:/var/lib/ceph/osd# ceph-volume lvm list | grep "osd fsid"
osd fsid 3038f5ae-c579-410b-bb6d-b3590c2834ff
osd fsid b693f0d5-68de-462e-a1a8-fbdc137f4da4
osd fsid 4639ef09-a958-40f9-86ff-608ac651ca58
osd fsid c4531f50-b192-494d-8e47-533fe780bfa3
So my question is: how can I tell Ceph to stop looking for these OSD's that no longer exist and where is this data even being stored?
Thanks!