[SOLVED] Ceph rebalancing just stopped

proxwolfe · Aug 11, 2024

Hi,

I have a small three node cluster. There are two pools with three OSDs each. Each node hosts one OSD from each pool (one HDD, one SSD). Replication rule is 3/2.

When one of the OSDs in one of the nodes started acting up, I decided to not just replace the OSD but to also replace the entire (for different reasons).

So I set up a new node with identical OSDs and added it to the pool. Ceph started copying PGs to the new node. After a while I shut down the old node so that my pool again had 3 running nodes. Ceph continued to copy (now rebalance, I think) PGs to the new node.

But then, it suddenly stopped and just complains about one node with two OSDs being down. It shows that so and so many PGs are undersized but it doesn't do anything about this.

Any ideas why? Can I force Ceph to continue rebalancing?

Thanks!

Falk R. · Aug 11, 2024

This can have various causes.
Did you remove the OSD cleanly before removing the node? Monitor service also removed and recreated?
How full is the pool? If individual disks are too full (I think 85%) it stops with some optimizations.
It is also recommended to have at least 3, better 4 OSDs per node in the pool.

proxwolfe · Aug 13, 2024

Falk R. said:
Did you remove the OSD cleanly before removing the node? Monitor service also removed and recreated?

No, I just shut the node down. Ceph complains about the missing node and the missing OSDs. But it did copy over a good part to the new node and new OSDs before it stopped.

The same goes for the monitor service. But there are enough monitors in the cluster.

Falk R. said:
How full is the pool? If individual disks are too full (I think 85%) it stops with some optimizations.

Between 44% and 55%

Falk R. said:
It is also recommended to have at least 3, better 4 OSDs per node in the pool.

I don't have that. But that has never kept Ceph from doing what it was supposed to do.

Falk R. · Aug 13, 2024

proxwolfe said:
The same goes for the monitor service. But there are enough monitors in the cluster.

How many monitors do you have? It is recommended to use 3 monitors and 5 monitors for large setups of approx. 10 nodes or more.

proxwolfe said:
Between 44% and 55%

good

proxwolfe said:
I don't have that. But that has never kept Ceph from doing what it was supposed to do.

Of course this works, but it can lead to undesirable effects and may limit the availability of the cluster if a single OSD fails.

proxwolfe · Aug 14, 2024

Falk R. said:
How many monitors do you have? It is recommended to use 3 monitors and 5 monitors for large setups of approx. 10 nodes or more.

There is one monitor on each of the three (online) nodes (and one on the offline node as well).

Falk R. said:
Of course this works, but it can lead to undesirable effects and may limit the availability of the cluster if a single OSD fails.

Understood. But I would need to so many OSDs per node that my energy bill would become fully unsustainable

However, this should not affect rebalancing, right?

alexskysilk · Aug 14, 2024

post the output of

ceph health detail

Falk R. · Aug 14, 2024

These are only the recommendations for productive operation. Ceph also works with 3 OSDs up to many thousands.
Only with 3 it does not scale well and if a disk fails, it can quickly become tight in the pool.

Please only ever configure an odd number of monitors.

proxwolfe · Aug 15, 2024

Falk R. said:
Please only ever configure an odd number of monitors.

This may have been the decisive advice.

I wanted to keep my to-be-replaced node untouched until the replacement node was completely up to speed. And I had thought that monitors, OSDs, etc. on an offline node would have no negative impact.

But with your advice, I booted up the to-be-replaced node, stopped and destroyed all OSDs, monitors, managers, etc. And now Ceph is rebalancing again towards the replacement node.

Thanks!

Search

Search

[SOLVED] Ceph rebalancing just stopped

proxwolfe

Well-Known Member

Falk R.

Distinguished Member

proxwolfe

Well-Known Member

Falk R.

Distinguished Member

proxwolfe

Well-Known Member

alexskysilk

Distinguished Member

Falk R.

Distinguished Member

proxwolfe

Well-Known Member

We value your privacy