Ceph won't properly start after maintenance

ramblurr

New Member
Oct 24, 2020
12
3
3
I have a three node proxmox ceph cluster. I needed to re-organize the hardware in the rack, which required removing all the hardware, including the ceph/proxmox nodes.

Before I powered off the proxmox/ceph nodes I set norecover, noout and norebalance. Then I powered off the nodes.

Now, a few days later the rack rebuild is complete and I've booted up the nodes. The proxmox cluster joined up without issue, but the ceph cluster is in a strange state.

* All `ceph ..` commands hang
* The ceph web ui throws a timeout
* systemctl status "ceph*" shows everything active except ceph-volume@lvm-UUID.service on each node.
* journalctl for that ceph-volume service shows:

Nov 28 17:45:48 mill.mgmt.socozy.casa sh[1065]: Running command: /usr/sbin/ceph-volume lvm trigger 5-199f906e-6dae-4fa6-9c2f-2f2e927dafbf

* ceph-volume lvm activate --all hangs with:
Code:
root@mill:/var/log/ceph# ceph-volume lvm activate --all
--> Activating OSD ID 2 FSID 8fd557f9-52ed-48fd-9297-ab3ce3372841
Running command: /usr/bin/ceph --cluster ceph --name client.osd-lockbox.8fd557f9-52ed-48fd-9297-ab3ce3372841 --keyring /var/lib/ceph/osd/ceph-2/lockbox.keyring config-key get dm-crypt/osd/8fd557f9-52ed-48fd-9297-ab3ce3372841/luks


Here is a sample /etc/pve/ceph.conf from one of my three nodes:

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.9.10.13/23
     fsid = ddc81dc1-c8a4-42b0-8005-ce22ef4d1635
     mon_allow_pool_delete = true
     mon_host = 10.9.10.18 10.9.10.16 10.9.10.13
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.9.10.13/23

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.ibnsina]
     public_addr = 10.9.10.16

[mon.mill]
     public_addr = 10.9.10.13

[mon.peirce]
     public_addr = 10.9.10.18

I'm really not sure how to proceed with troubleshooting this.

PVE: 7.0-11
 
Last edited:
Oh boy, nevermind. This was a networking error. The 10G network that the ceph cluster used wasn't routing properly. Sorry for the noise.
 
  • Like
Reactions: itNGO