Urgent: Ceph Help Needed

ejmerkel

Renowned Member
Sep 20, 2012
122
4
83
Hello,

We have a 3 node proxmox/ceph cluster. This morning on one node all of the OSD were marked down/out. Right now all the VM's are up but the IO is getting killed by the recovery of the OSD's and they are basically not responding.

I have added the following in /etc/pve/ceph.conf

Code:
osd max backfills = 1
osd recovery max active = 1

How do I make this active to lessen the IO of the rebuild?

We already had one OSD out but why would all of the OSD's on one server get marked down/out at once? Here is all I see in the logs.

Code:
2015-09-21 7:00:54.877290 mon.0 10.0.3.11:6789/0 9323720 : [INF] pgmap v12130886: 1536 pgs: 1431 active+clean, 77 active+degraded, 28 active+remapped; 3394 GB data, 9949 GB used, 53353 GB / 63302 GB avail; 1635 kB/s rd, 222 kB/s wr, 78 op/s; 63933/2613351 objects degraded (2.446%)
2015-09-21 07:03:54.142139 mon.1 10.0.3.12:6789/0 3973 : [INF] pgmap v12130959: 1536 pgs: 1518 active+degraded, 18 active+remapped; 3394 GB data, 9949 GB used, 53353 GB / 63302 GB avail; 16683 kB/s rd, 1012 kB/s wr, 868 op/s; 870311/2613351 objects degraded (33.302%)
2015-09-21 07:03:55.670321 mon.1 10.0.3.12:6789/0 3974 : [INF] mon.1 calling new monitor election
2015-09-21 07:03:55.673390 mon.0 10.0.3.11:6789/0 9323722 : [INF] mon.0 calling new monitor election
2015-09-21 07:03:55.675321 mon.0 10.0.3.11:6789/0 9323723 : [INF] mon.0 calling new monitor election
2015-09-21 07:03:55.893161 mon.0 10.0.3.11:6789/0 9323724 : [INF] mon.0@0 won leader election with quorum 0,1,2
2015-09-21 07:03:56.231606 mon.0 10.0.3.11:6789/0 9323725 : [INF] monmap e3: 3 mons at {0=10.0.3.11:6789/0,1=10.0.3.12:6789/0,2=10.0.3.13:6789/0}
2015-09-21 07:03:56.231748 mon.0 10.0.3.11:6789/0 9323726 : [INF] pgmap v12130959: 1536 pgs: 1518 active+degraded, 18 active+remapped; 3394 GB data, 9949 GB used, 53353 GB / 63302 GB avail; 870311/2613351 objects degraded (33.302%)
2015-09-21 07:03:56.232198 mon.0 10.0.3.11:6789/0 9323727 : [INF] mdsmap e1: 0/0/1 up
2015-09-21 07:03:56.251357 mon.0 10.0.3.11:6789/0 9323728 : [INF] osdmap e1155: 18 osds: 12 up, 17 in

My first priority is how to lesson the recovery so the VM's will become responsive again. Thanks in advance for any advice or help!

Best regards,
Eric
 
you can change running config with "ceph tell osd.x"

Code:
>>[I] ceph tell osd.* injectargs '--osd-max-backfills 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-client-op-priority 63'
[/I]>[I] ceph tell osd.* injectargs '--osd-recovery-max-active 1'
[/I]
 
you can change running config with "ceph tell osd.x"

Code:
>>[I] ceph tell osd.* injectargs '--osd-max-backfills 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
[/I]>[I] ceph tell osd.* injectargs '--osd-client-op-priority 63'
[/I]>[I] ceph tell osd.* injectargs '--osd-recovery-max-active 1'

[/I]

Thank you that seemed to help. I have a question in regards to "--osd-client-op-priority 63" isn't that already the default? I suppose you were just wanting to make sure they were set correctly?

Eric
 
Thank you that seemed to help. I have a question in regards to "--osd-client-op-priority 63" isn't that already the default? I suppose you were just wanting to make sure they were set correctly?

Eric
Hi,
yes it's the default. You can control your values with
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep priority
Udo

BTW. Perhaps it's much faster to bring the OSDs on the node back to life and set before an "ceph osd set noout" to stop an rebuild to the other disks...