ceph OSD adding huge problem

Charalampos Toulkaridis · Feb 22, 2017

Hi,
I have the above behavior when I adding a new OSD on CEPH storage.

After adding a new OSD all storage becomes unstable to connectivite with VMs. A lot of VMs crash with error "hung_task_timeout_secs" (look image). The process to rebalance take about 20 min but during this time the storage it real inaccessible for other operations.

It's possible to add OSD with 0 weight and weight a little bit at a time as "Argonaut (v0.48) Best Practices" says at : docs.ceph.com/docs/giant/rados/operations/add-or-rm-osds/ . what else I could do?

Code:

root@pve01:~# ceph status
    cluster bc865d82-2de0-439f-ae34-f14a565c023d
     health HEALTH_WARN
            55 pgs backfill
            22 pgs backfilling
            8 pgs peering
            11 pgs stuck inactive
            92 pgs stuck unclean
            33 requests are blocked > 32 sec
            recovery 96309/1702601 objects misplaced (5.657%)
     monmap e3: 3 mons at {0=172.16.0.1:6789/0,1=172.16.0.2:6789/0,2=172.16.0.3:6789/0}
            election epoch 152, quorum 0,1,2 0,1,2
     osdmap e2573: 19 osds: 19 up, 19 in; 77 remapped pgs
      pgmap v12000981: 1088 pgs, 2 pools, 2076 GB data, 537 kobjects
            6460 GB used, 28362 GB / 34823 GB avail
            96309/1702601 objects misplaced (5.657%)
                 996 active+clean
                  55 active+remapped+wait_backfill
                  22 active+remapped+backfilling
                   8 peering
                   7 activating
recovery io 357 MB/s, 90 objects/s
  client io 2691 kB/s rd, 5328 kB/s wr, 157 op/s
root@pve01:~#

PVE Manager Version: pve-manager/4.4-5/c43015a5

Ceph cluster network is over Infiniband.
Now I have only 20OSD on 5 nodes but with above process we wand to add plus 30 OSD and this is real huge problem. All new OSD is SSD but 12 old is SATA.

stormtronix · Feb 22, 2017

you can set

osd max backfills = 1
osd recovery max active = 1

in /etc/pve/ceph.conf

if not already done

and
ceph osd set noscrub
ceph osd set nodeep-scrub
on the commandline - when done activate again with unset the same

wahmed · Feb 22, 2017

As Storm pointed out above, by reducing max backfills and max active to 1 you will eliminate that issue. Its not actually an issue. But too many backfills and recovery thread at the same time causes massive read/write load on Ceph cluster. If network infrastructure, drives, CPU, memory are not very high specifications, then it is somewhat difficult to recovery faster.

In a large scale Ceph cluster with 100s of OSDs, dual socket server CPUs, SSDs or SSDs as journaling this is not a problem. Keep in mind that by reducing max backfills and max active will slow down recovery time significantly so do not lose patience. But your VMs will run fine and users will not really experience any major slow down to know even know something is happening behind scene.

Charalampos Toulkaridis · Feb 22, 2017

thanks both for your answer.
One question after change the "backfills" and "max active" must to reload something or it's just enough to change the config file?

FYI: All are dual socket server (2x12c/24t) with 512GB ram, All disk have SSD as journal. Networking is infiniband (40Gbps). Only the number of OSD is small ( we programming to maximize in scailling month my month ). just notice the "recovery io 357 MB/s, 90 objects/s" I see that to write with 530MB/s
"

udo · Feb 22, 2017

Charalampos Toulkaridis said:
thanks both for your answer.
One question after change the "backfills" and "max active" must to reload something or it's just enough to change the config file?

Hi,
you can change such values on the fly with

Code:

 ceph tell osd.* injectargs 'osd_max_backfills 1'

Udo

wahmed · Feb 22, 2017

The ceph command Udo suggested is the best way to insert new configuration into Ceph cluster. This is also great way to test many configurations and check performance. Once happy though you have to change the configuration in ceph.conf. Because config inserted using the command does not persist after a reboot. Injectargs does not require any reboot.

Charalampos Toulkaridis · Feb 22, 2017

I test the command in test environment and looks to work.
The accepting syndax is above:
ceph tell osd.* injectargs --osd_recovery_max_active 1
ceph tell osd.* injectargs --osd_max_backfills 1

I will try on production system at weekend (lookout at night) and I will infome you

chris_lee · Apr 9, 2017

Charalampos Toulkaridis said:
thanks both for your answer.
One question after change the "backfills" and "max active" must to reload something or it's just enough to change the config file?

FYI: All are dual socket server (2x12c/24t) with 512GB ram, All disk have SSD as journal. Networking is infiniband (40Gbps). Only the number of OSD is small ( we programming to maximize in scailling month my month ). just notice the "recovery io 357 MB/s, 90 objects/s" I see that to write with 530MB/s
"

May I ask, what Infiniband hardware you exactly have in use and if you have a redundant network (bonding/stacking).
I am thinking about using IB as well.

Charalampos Toulkaridis · Apr 10, 2017

Hi,
first confirm that after change the osd_recovery_max_active and osd_max_backfills to 1 all new OSD disk adding doing without any problem. every new OSD (SSD 1.92TB) needing about 15min to rebalance and I have add about +20 disk during last 3 weeks without any impact to VMs.

Chris_lee
We dont have redundant network only the IB. We have 4 managed switch, any server have dual IB PCIe adapter and it's connecting to 2 switchs.

chris_lee · Apr 10, 2017

Charalampos Toulkaridis said:
Chris_lee
We dont have redundant network only the IB. We have 4 managed switch, any server have dual IB PCIe adapter and it's connecting to 2 switchs.

Ok I see. What IB switch (modell) do you use?

Charalampos Toulkaridis · Apr 10, 2017

HP Voltaire Infiniband 4036 4xQDR 36-Port

chris_lee · Apr 10, 2017

Charalampos Toulkaridis said:
HP Voltaire Infiniband 4036 4xQDR 36-Port

Perfect, thanks a lot. Do the 4036 basically support bonding 2 links to achieve redundancy if you have 2 switches?

Charalampos Toulkaridis · Apr 12, 2017

yes, as active-backup

chris_lee · Apr 13, 2017

Active backup is sufficient if you run on 40G. Guess I´ll try IB on a home lab. Prices for the hardware seem to be worth it. Thanks for the info.
Are there sill firmware images out there or are you stuck on the fw the devices come with?

Charalampos Toulkaridis · Apr 13, 2017

well I update them from
version: 3.0.0
date: Mar 26 2010 01:00:02 PM
to
version: 3.9.1
date: Sep 23 2012 12:11:59 PM

but now see that HP has delete the update files from ftp server
ftp://ftp.hp.com/pub/softlib2/software1/pubsw-linux/p1353577982/

chris_lee · Apr 15, 2017

Ok, I see. Thanks for this info. Good to know, that recent fw is not available any more.

alexskysilk · Apr 9, 2018

I had similar results adding a node to a proxmox 5.1/Luminous cluster with all SSD deployment last night. I added a node with 8 1.6TB SSD drives to a cluster containing 70 existing 480GB drives; the result was chaos; tons of pg's stuck which had to be manually cleared after rebalance. it was a total FUBAR and took hours to resolve.

@devs, what is the correct procedure/ceph tunables to add OSDs to an existing cluster and remain operational?

wahmed · Apr 9, 2018

alexskysilk said:
I had similar results adding a node to a proxmox 5.1/Luminous cluster with all SSD deployment last night. I added a node with 8 1.6TB SSD drives to a cluster containing 70 existing 480GB drives; the result was chaos; tons of pg's stuck which had to be manually cleared after rebalance. it was a total FUBAR and took hours to resolve.

@devs, what is the correct procedure/ceph tunables to add OSDs to an existing cluster and remain operational?

Here is what happened in my opinion. This is without knowing what procedure you have used during adding that node, so apologize in advance.

If you added the node and all 8 SSDs at the same time, this sort of problem is very much expected. On top of that your new SSDs are 1.6TB instead of 480GB. This probably made the matter that much worse if you added each OSD with full possible weight.
I am assuming you are familiar with 'weight' of OSDs. So the new 1.6TB OSDs obviously has much much higher weight than 480GB. Ceph will try to distribute data based on weights. Higher the weights, higher the density of data. So in this scenario your entire Ceph cluster went in chaos to redistribute massive amount of data from every single 70 OSDs.

One of the correct way to do it would be add all the OSDs at once with weight 0. Then increase slowly lets say .10 or .25, wait for redistribution to finish, increase again, wait etc. This way Ceph wont go crazy since it will deal with smaller chunks of data.

Be aware that having 8 1.6TB OSD will end up with big chunk of data vs. all OSDs holding equal amount of objects.
Hope this helps.

alexskysilk · Apr 9, 2018

TY Wasim, I had similar thinking also but if I use the pveceph to create the osd, It will be created with its full weight; would it be safe to do so and reweight after deployment or create it by hand with a more appropriately weighted value to begin with?

wahmed · Apr 9, 2018

alexskysilk said:
TY Wasim, I had similar thinking also but if I use the pveceph to create the osd, It will be created with its full weight; would it be safe to do so and reweight after deployment or create it by hand with a more appropriately weighted value to begin with?

You are correct, pveceph/GUI creates OSD with full weight. In most cases it is not a problem really. If your recovery threads are set very low and you add 1 OSD at a time, Ceph cluster will be able to handle the load fine. If you have dedicated network for Ceph sync and it is +10gbps, this will pose no chaos.

We have large Pmx+Ceph clusters with lots of OSDs. But we try to keep the OSDs equal sizes and simply use GUI to add/remove OSDs. Only time we do manual OSD is when the scenario is similar to yours. Just keep in mind that 'anything' Ceph OSD related takes 'long time'. So imagine you replacing your 70 OSDs with 1.6TB in future, expect very long process. But the beauty is you can make it as seamless as possible for your users if you just get over the Time factor.

ceph OSD adding huge problem

Renowned Member

Attachments

Renowned Member

Famous Member

Renowned Member

Distinguished Member

Famous Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

We value your privacy