Scheduled downtime of large CEPH node

kliment.toshkov · Jun 15, 2021

Hello,

We operate a cluster with 10+ nodes where 1 of them is serving as SAN with CEPH. It has 20+ disks (OSD) inside, with one monitor and one manager installed on the SAN.
The rest of the nodes are data nodes with 4 bay chassis with 2 installed disks in ZFS RAID1 mode.

We have scheduled maintenance + updates for the SAN node, with planned downtime of at least 2 hours.

I need to make a plan on how to ensure non-interrupted services to all nodes during SAN downtime. Right now I see two options:
(1) * Install temporary external NFS server, import it into the cluster and move all drives on nodes from CEPH to NFS, perform the maintenance and then return drives from NFS to CEPH;
(2) * Install temporary drives in the two remaining free bays on data nodes, install 2 additional monitors and 1 additional manager on selected nodes, then import the disks as OSDs, migrate all the data to them, remove OSDs from SAN and then stop it. Perform the maintenance and then return all OSDs in SAN and remove OSDs from data nodes.

As complicated as it seems option (2) I need very detailed step by step plan on how to perform it. I am not very familiar with CEPH and no-out option. My research so far shows that I may need to use it. I am also not very skilled on moving OSDs out of the cluster and do I need to remove them or only set as "out" for some time?

I am kindly asking on advise either if there is additional third option that I may not have figured out yet? Opinions on (1) vs (2), and detailed step-by-step manual on how to perform (2) would be of great help for me.

We intend to keep the existing structure with 1 SAN after we complete the maintenance.

Thank you!

kliment.toshkov · Jun 16, 2021

jasonsansone · Jun 16, 2021

I am a little confused. Is Ceph only on one node, and not all 10? The entire purpose of Ceph is that it is a high availability, clustered file system designed to be run on multiple nodes. The recommendation for a production cluster is a minimum of five nodes. You are saying you have Ceph on a singular node and need to take the only one out of service for maintenance?

kliment.toshkov · Jun 16, 2021

Yes to all questions and comments. This is the chosen and approved design and we have to deal with it already existing. Do you have experience with adding and removing OSDs, using "no-out" and triggering rebalance?

jasonsansone · Jun 16, 2021

Yes. If you are using separate DB/WAL drives, you can't move the OSD around. If you have 20 OSD concentrated on one node, I don't think that 2x OSD in 9x more nodes will provide enough capacity in a 3x replicated crushmap. Depends on OSD size. You are going to end up shuffling around a lot of data with either plan, so expect a strain on disk IOPS and network. You haven't received any other responses (and definitely no good answers) because there isn't one. Your "chosen and approved design" is inherently flawed. Instead of continuing to work around a problem, I would work hard to sell management on fixing the storage architecture. This won't be the last time a SPOF issue arrises with your "SAN".

kliment.toshkov · Jun 16, 2021

Thanks for the opinion, but this is chosen deliberately and good enough for us reasons. Other than that i agree with SPOF concern. Data transfer is not an issue especially that we are not in a hurry and we can extract one OSD at a time. Let's focus on the topic if you wish to contribute to the current question. Thank you

kliment.toshkov · Jun 16, 2021

Re-reading your answer, I believe something was not very clear. By adding and removing OSD-s I mean new additional drives. Not literally extracting one drive from chassisX and inserting it into chassisY and so forth.

grin · Jun 16, 2021

Indeed, there is no real good answer, apart from the generic "copy your whole storage to another node by your preferred means and connect it to the cluster and migrate your nodes over".

As for the ceph part: adding new OSDs to a ceph cluster shall be relatively painless. You can create new OSDs wherever you please, including your proxmox nodes, and add to your MONs, and if you provide at least the required amount of surplus OSDs the data start migrating over in the beginning and then it could be finished by slowly stopping the original OSDs, so in the end you will have OSDs all over your nodes.
If you have created a few (preferably minimum 3 and an odd number) MONs (and mgrs and other stuff you might use) on your nodes and added them to the cluster then when the OSD migration is finished you can simply remove everything on the storage from the cluster and it would work with zero downtime.

In the meantime there will be excessive I/O and data migrations, multiple times.

As for step-by-step:

buy the disks and put them into your nodes.
ensure that all network subnets are accessible wherever they have to; it depends on whether you have separate cluster/public network or not. especially that osd see the mon and clients see the mon too.
install 2 new MONs and other redundant daemons (mgr, rgw, ...) on your nodes, and verify that they work well. (ceph -s)
create a few new OSDs on your nodes and verify that they are in the ceph cluster and working.

If something doesn't work, your system will not suffer at this point. You keep trying until your new MON and OSD work.

you can add new OSDs and you can also, at your will, remove old OSDs from the storage.
Always try to let the cluster heal before you remove more than the redundancy allows or you will kill ceph. If you do, put back your last few removed OSD, do not zap or delete it or you'll be toast. Preferably do not remove anything until all the new OSDs are online and well.
Remove old OSDs one by one and watch the cluster healing. You can remove them at once and risk lost pgs.
When all old OSDs are gone and the cluster is healthy remove old MON and stuff from the cluster. It should be in WARNING state until you install another MON to provide redundancy (3 or 5 instances.)
Congratulations, you have removed your storage. Now you havea real redundant and fault tolerant cluster until you convert it back to a non-redundant-non-tolerant one again, doing the same process backwards, while nobody in the sane world could understand why you do it.

Do not use noout or norebalance. You definitely need rebalancing all the time, and you need OSDs to get out of the cluster when they're gone.

Sidenote: after such moves there would be excessive amount of cruft in your CRUSH and other tables, in the end it may need some manual cleaning. It isn't required, just prevent a bunch of complaining from the system logs.

Disclaimers: No expressed or implied guarantee. If you break it you own both parts. Slippery when wet. Beware of the dog.

kliment.toshkov · Jun 16, 2021

Thank you for detailed answer. My general plan is exactly the same.
Happy to say that your plan matches my outlines 1:1, plus all the valuable advises on not using any additional options and more details.
This was something not clear to me and I felt it is important.

In regard to removing OSDs, I am not in a hurry. I will take one step at a time and wait for the cluster to heal before taking another one out.
I need +2 MONs and +1 MGR on additional nodes, right? Extra drives are already present. Network between all nodes is accessible and backed up with redundancy.

I intend to test the whole concept on a small cluster with 4 virtual machines, just to make myself familiar with all steps.

Right now we have slight adjustment in CEPH config:

Code:

rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type osd
    step emit
}

I assume that we should return "chooseleaf" to factory value?
How do I remove OSD? Choose OSD, then "STOP", then "OUT" or in another sequence?

Thank you for your time, I really appreciate it.

grin · Jun 16, 2021

kliment.toshkov said:
I need +2 MONs and +1 MGR on additional nodes, right?

Generally you need more than 3 and odd amount of MONs to provide a reliable majority quorum all the time; other daemons are not picky, and usually one spare will do fine. In all cases I advise you to read the documentation of ceph about the redundancy suggestions of the specific daemons, but usually one+spare is fine.
Any time you can consult ceph health or ceph status to see whether ceph is happy about your config or not. It's really chatty about its feelings.

kliment.toshkov said:
I intend to test the whole concept on a small cluster with 4 virtual machines, just to make myself familiar with all steps.

Good idea.

kliment.toshkov said:
I assume that we should return "chooseleaf" to factory value?

I only learn CRUSh when I need it and after that it fades, so I will not comment on that; my guess is that the default is for multinode, so it should be better indeed than a single-node-shaped you use. Try not to keep min_size below 2 though, it's not really safe.

kliment.toshkov said:
How do I remove OSD? Choose OSD, then "STOP", then "OUT" or in another sequence?

In the end you can be polite and pull them (send them out, then stop them) or be tough and simply stop it and let the system kick it out. I usually try the polite way. I am not readily remember whether it removes from crush as well so you may need to remove it from there manually, when you're finished with everything.

g

ps: Netfinity? ;-)

kliment.toshkov · Jun 16, 2021

I've read the manual many times, but it is easy to miss on some minor details when doing anything for the first time. That's why I am trying to plan ahead and also to collect community advises. Thank you all!

spirit · Jun 17, 2021

also, if you install new monitors, you need to stop/start vms or live migrate to use the new monitors.

About the "out" state of osd, it's working fine. Just add new disks, set current osd as "out". (but still "in"). So they still serve data to client, but at same time, they are migrating their data to news osds.

kliment.toshkov · Jun 29, 2021

all went well, thanks

Search

Search

Scheduled downtime of large CEPH node

kliment.toshkov

Member

kliment.toshkov

Member

jasonsansone

Active Member

kliment.toshkov

Member

jasonsansone

Active Member

kliment.toshkov

Member

kliment.toshkov

Member

grin

Renowned Member

kliment.toshkov

Member

grin

Renowned Member

kliment.toshkov

Member

spirit

Distinguished Member

kliment.toshkov

Member