Ceph won't properly start after maintenance

ramblurr · Nov 28, 2021

I have a three node proxmox ceph cluster. I needed to re-organize the hardware in the rack, which required removing all the hardware, including the ceph/proxmox nodes.

Before I powered off the proxmox/ceph nodes I set norecover, noout and norebalance. Then I powered off the nodes.

Now, a few days later the rack rebuild is complete and I've booted up the nodes. The proxmox cluster joined up without issue, but the ceph cluster is in a strange state.

* All `ceph ..` commands hang
* The ceph web ui throws a timeout
* systemctl status "ceph*" shows everything active except ceph-volume@lvm-UUID.service on each node.
* journalctl for that ceph-volume service shows:

Nov 28 17:45:48 mill.mgmt.socozy.casa sh[1065]: Running command: /usr/sbin/ceph-volume lvm trigger 5-199f906e-6dae-4fa6-9c2f-2f2e927dafbf

* ceph-volume lvm activate --all hangs with:

Code:

root@mill:/var/log/ceph# ceph-volume lvm activate --all
--> Activating OSD ID 2 FSID 8fd557f9-52ed-48fd-9297-ab3ce3372841
Running command: /usr/bin/ceph --cluster ceph --name client.osd-lockbox.8fd557f9-52ed-48fd-9297-ab3ce3372841 --keyring /var/lib/ceph/osd/ceph-2/lockbox.keyring config-key get dm-crypt/osd/8fd557f9-52ed-48fd-9297-ab3ce3372841/luks

Here is a sample /etc/pve/ceph.conf from one of my three nodes:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.9.10.13/23
     fsid = ddc81dc1-c8a4-42b0-8005-ce22ef4d1635
     mon_allow_pool_delete = true
     mon_host = 10.9.10.18 10.9.10.16 10.9.10.13
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.9.10.13/23

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.ibnsina]
     public_addr = 10.9.10.16

[mon.mill]
     public_addr = 10.9.10.13

[mon.peirce]
     public_addr = 10.9.10.18

I'm really not sure how to proceed with troubleshooting this.

PVE: 7.0-11

ramblurr · Nov 28, 2021

Oh boy, nevermind. This was a networking error. The 10G network that the ceph cluster used wasn't routing properly. Sorry for the noise.

Search

Search

Ceph won't properly start after maintenance

ramblurr

New Member

ramblurr

New Member