Ceph won't properly start after maintenance

ramblurr

New Member
Oct 24, 2020
12
2
3
I have a three node proxmox ceph cluster. I needed to re-organize the hardware in the rack, which required removing all the hardware, including the ceph/proxmox nodes.

Before I powered off the proxmox/ceph nodes I set norecover, noout and norebalance. Then I powered off the nodes.

Now, a few days later the rack rebuild is complete and I've booted up the nodes. The proxmox cluster joined up without issue, but the ceph cluster is in a strange state.

* All `ceph ..` commands hang
* The ceph web ui throws a timeout
* systemctl status "ceph*" shows everything active except ceph-volume@lvm-UUID.service on each node.
* journalctl for that ceph-volume service shows:

Nov 28 17:45:48 mill.mgmt.socozy.casa sh[1065]: Running command: /usr/sbin/ceph-volume lvm trigger 5-199f906e-6dae-4fa6-9c2f-2f2e927dafbf

* ceph-volume lvm activate --all hangs with:
Code:
root@mill:/var/log/ceph# ceph-volume lvm activate --all
--> Activating OSD ID 2 FSID 8fd557f9-52ed-48fd-9297-ab3ce3372841
Running command: /usr/bin/ceph --cluster ceph --name client.osd-lockbox.8fd557f9-52ed-48fd-9297-ab3ce3372841 --keyring /var/lib/ceph/osd/ceph-2/lockbox.keyring config-key get dm-crypt/osd/8fd557f9-52ed-48fd-9297-ab3ce3372841/luks


Here is a sample /etc/pve/ceph.conf from one of my three nodes:

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.9.10.13/23
     fsid = ddc81dc1-c8a4-42b0-8005-ce22ef4d1635
     mon_allow_pool_delete = true
     mon_host = 10.9.10.18 10.9.10.16 10.9.10.13
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.9.10.13/23

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.ibnsina]
     public_addr = 10.9.10.16

[mon.mill]
     public_addr = 10.9.10.13

[mon.peirce]
     public_addr = 10.9.10.18

I'm really not sure how to proceed with troubleshooting this.

PVE: 7.0-11
 
Last edited:
Oh boy, nevermind. This was a networking error. The 10G network that the ceph cluster used wasn't routing properly. Sorry for the noise.
 
  • Like
Reactions: itNGO

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!