[SOLVED] Restore / rebuild Ceph monitor after HW crash

cmonty14 · Jul 17, 2019

Hello!
Due to an HD crash I was forced to rebuild a server node from scratch, means I installed OS and Proxmox VE (
apt install proxmox-ve postfix open-iscsi) fresh on the server.
Then I executed and Ceph (pveceph install) on greenfield.
Then I ran pvecm add 192.168.10.11 -ring0_addr 192.168.10.12 -ring1_addr 192.168.20.12 to add the node to the existing cluster.

This all worked well.

As a next step I started installation of Ceph (pveceph install) and finally executed pveceph createmon.

The Ceph status shows that the relevant node is out of quorum:
ceph health detail
HEALTH_WARN noout flag(s) set; 20 osds down; 3 hosts (22 osds) down; Reduced data availability: 1429 pgs inactive; Degraded data redundancy: 714
7773/14446416 objects degraded (49.478%), 1444 pgs degraded, 1845 pgs undersized; mon ld4257 is low on available space; 1/3 mons down, quorum ld
4257,ld4465
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 20 osds down
[...]
MON_DOWN 1/3 mons down, quorum ld4257,ld4465
mon.ld4464 (rank 1) addr 10.97.206.98:6789/0 is down (out of quorum)

Question:
Does it makes sense to continue like this?
Will it be possible to rebuild the cluster?

In my understanding I must fix the issue with failed monitoring service on node ld4464.
How can I do this?

THX

Alwin · Jul 17, 2019

The MON is one thing, the OSDs should start once ceph is installed and it can access the other MONs. Then as a next step, the failed MON needs replacement. For that you need to remove the failed MON from the remaining cluster, see the section for manual removal.
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#removing-monitors

cmonty14 · Jul 17, 2019

Hi,
thanks for this input.

After successfully removing the relevant node ld4464 from Ceph the relevant error message is gone.

What would be the next steps?
Do you advice to re-enter this node ld4464 to the existing Ceph cluster?
Or should I first fix the OSDs and ensure that they will start up?

THX

Alwin · Jul 17, 2019

c.monty said:
After successfully removing the relevant node ld4464 from Ceph the relevant error message is gone.

You removed the MON as described in the link? If os, then you can use the pveceph tool to create a new one.

c.monty said:
Or should I first fix the OSDs and ensure that they will start up?

I would start first with the OSDs, to get the cluster back to a healthy state. It should only need a restart of the ceph-osd.target to get them started.

Search

Search

[SOLVED] Restore / rebuild Ceph monitor after HW crash

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

cmonty14

Well-Known Member

Alwin

Proxmox Retired Staff

We value your privacy