Low performance CEPH during a Disaster Receovery Simulation

alanspa · Jun 1, 2024

HI
our scenario involves this:
5 nodes in HA
2 nodes installed in one place
2 nodes installed elsewhere
1 node installed elsewhere witout CEPH storage

all places are connected in 10 gigabits LAN

if I turn off 2 nodes at the same time in the same place, then 50% of the Ceph OSD are down, the data is there but the performance is very slow, such that it is not possible to work. How come?

I attach screens of the ceph configuration and some screens during the shutdown of the two nodes.

question: could it be that the performances would have improved and returned to normal at the end of the rebalancing?

Thank you

leesteken · Jun 1, 2024

I don't know much about this, but Proxmox does not like remote nodes (due to increased latency). Assuming that that's not a problem in your setup, Ceph really needs three working nodes (it limps an rebuilds in a panic with two, I believe) which is not the case in your disaster. I hope other people here who know the details of all this will correct me.

UdoB · Jun 1, 2024

alanspa said:
if I turn off 2 nodes at the same time in the same place, then 50% of the Ceph OSD are down,

Yes. The missing information here is: what is your "size/min_size" setting? (Found in Pools --> row <yourpool> --> column "Size/min")

alanspa said:
the data is there but the performance is very slow,

The remaining nodes may begin to shuffle a lot of data around to get re-balanced. No?

Disclaimer: I am NOT a Ceph specialist.

alanspa · Jun 1, 2024

Do you need this info?

UdoB · Jun 1, 2024

"Size=3/Min Size=2" means: Ceph is going to write three copies. As long as two copies are available everything is fine.

You had four Nodes with OSDs. You turned off two Nodes with OSDs.

What if two of the tree datablocks written are on the turned-off Nodes? That Placement Group is read only now! Every VM writing data in this area will stop to work immediately. (Depending on the application different things may happen, from just messages "can not write this data" to crashed systems.)

The only way to allow two nodes to fail without getting into real trouble quickly is to set a least "size=4/min_size=2".

The number of nodes allowed to fail is the difference between these two number. Your setup is "3 - 2 = 1" meaning only one single node may fail.

Please make sure to understand the implications (e.g. less space) before you change that setting. Read about it in the official Ceph documentation. As already said I am not the specialist here...

Good luck!

alanspa · Jun 1, 2024

Until a few months ago it was set to 4 and in fact the disaster recovery simulation went better.

Now everything explains.

I hope we can understand from the documentation how much space will be taken up.

Two questions, even if you are not an expert on ceph:

- what happens if I fill the entire pool? in addition to freeze all the VMs, is there a way to return to normal without losing data by putting the data back to 3?

- in today's scenario, once rebalancing is complete, do you confirm that performance improves?

UdoB · Jun 1, 2024

alanspa said:
- what happens if I fill the entire pool? in addition to freeze all the VMs, is there a way to return to normal without losing data by putting the data back to 3?

- in today's scenario, once rebalancing is complete, do you confirm that performance improves?

Try under all circumstances to not fill up the whole pool. It may be really troublesome to get back to "normal". There should be no data loss as everything goes "read-only", but I would not bet on this one.

Probably yes. But I've not tested many corner cases and... I am not an expert ;-)

alanspa · Jun 1, 2024

thank you.
I'll wait a ceph expert which perhaps can be more precise

alexskysilk · Jun 1, 2024

alanspa said:
if I turn off 2 nodes at the same time in the same place, then 50% of the Ceph OSD are down, the data is there but the performance is very slow, such that it is not possible to work. How come?

Before we discuss remedy, we need to discuss what your intended logical configuration is to be.

Ceph is a hierarchical system that operates on the basis of failure domains. A "failure domain" describes layer of functionality you intend to account for fault. In your case, I'm counting three seperate failure domains- osd, node, and DC (location.) EACH of the failure domains needs to contain a full replica. in a 3:2 rule, 2 replicas must be alive for the system to offer full functionality.

You have 3 sites, but only 2 contain OSDs, which means you are logically compromised from the start. If this is by design, we can move on, but understand each of your DCs only contain 2 OSD nodes. This means they cant ever satisfy a 3 replica requirement; you can get in a lot of trouble on link disruption. This configuration is a timebomb and service loss and possibly data loss is all but guaranteed in your future.

Now to address the performance loss. Understand that a proxmox/ceph HC requires a minimum of 3 networks (even if in your case they travel over the same link)- cluster traffic, ceph private, and ceph public traffic. You describe your inter-DC as 10gb, but not how many links or latency. your guest traffic travels over ceph public which is fighting for latency with the other traffic types; if your ceph private network is filled with frantic rebalancing across DCs you can guess what would happen to your guest performance.

spirit · Jun 2, 2024

what is your ceph version ?

ceph 17 (quincy) have a qos , to priorize client access vs repair

ceph config set global osd_mclock_profile high_client_ops
ceph config set global osd_mclock_profile balanced (default)
ceph config set global osd_mclock_profile high_recovery_ops

Search

Search

Low performance CEPH during a Disaster Receovery Simulation

alanspa

Member

Attachments

leesteken

Distinguished Member

UdoB

Distinguished Member

alanspa

Member

Attachments

UdoB

Distinguished Member

alanspa

Member

UdoB

Distinguished Member

alanspa

Member

alexskysilk

Distinguished Member

spirit

Distinguished Member