Urgent: Proxmox/Ceph Support Needed

ejmerkel · Nov 3, 2015

Hello,

Are there any professional paid Proxmox/Ceph support people on the forum who could assist us? Would prefer US based but really need help quickly.

Please email me at eric.merkel at sozotechnologies.com of via phone at 317-203-9222 if you can help.

Our Ceph cluster has lost 33% of its disks and it is killing our IO on all the servers. We have the max backfills etc turned down to 1 but it is not helping. Wondering if we can change # of copies from 3 to 2 on the fly in ceph?

Best regards,
Eric

udo · Nov 3, 2015

ejmerkel said:
Hello,

Are there any professional paid Proxmox/Ceph support people on the forum who could assist us? Would prefer US based but really need help quickly.

Please email me at eric.merkel at sozotechnologies.com of via phone at 317-203-9222 if you can help.

Our Ceph cluster has lost 33% of its disks and it is killing our IO on all the servers. We have the max backfills etc turned down to 1 but it is not helping. Wondering if we can change # of copies from 3 to 2 on the fly in ceph?

Best regards,
Eric

Hi Eric,
I assume the rebuild slow down anything?!

If you have chance to get the failed node back, you can set noout to avoid recovering...

Udo

tom · Nov 3, 2015

just to add, the command for setting noout is:

> ceph osd set noout

set it back to normal:

> ceph osd unset noout

bdennie · Nov 3, 2015

We have tried to run the command

ceph osd set noout

But does not look like it stopped the recovery. Is there something special you need to do in order to stop the current recovery in progress?

ejmerkel · Nov 3, 2015

udo said:
Hi Eric,
I assume the rebuild slow down anything?!

If you have chance to get the failed node back, you can set noout to avoid recovering...

Udo

I did this but the rebuild continues probably cause I marked the OUT previously. Should I put them back IN and then OUT to stop the rebuild?

Is there a way to change the replicas from 3/1 to 2/1 on the fly?

Best regards,
Eric

wahmed · Nov 3, 2015

No sure why you want to change replica from 3 to 2. I dont think that will help with the situation you currently have in hand.

Some details of what caused the 33% loss would definitely help to provide you applicable help. Info such as what exactly was done (ex: marking OUT) before the issue occurred. It will be a bad idea right now to try to change the replica. Also provide info such as how many nodes, how many OSDs per nodes etc.

ejmerkel · Nov 3, 2015

symmcom said:
No sure why you want to change replica from 3 to 2. I dont think that will help with the situation you currently have in hand.

Some details of what caused the 33% loss would definitely help to provide you applicable help. Info such as what exactly was done (ex: marking OUT) before the issue occurred. It will be a bad idea right now to try to change the replica. Also provide info such as how many nodes, how many OSDs per nodes etc.

We have 3 nodes with 6 OSDs (4TB/OSD). 1 of the 3 nodes started having high latency (Apply/Commit). We could not find any networking issues or errors. This node was adversely affecting the other 2 nodes because of the high latency even though the health of the ceph cluster was OK.

We tried rebooting the server without any luck. We thought removing the OSD's might get rid of the high IO latency and we had the set the following to minimize the rebuild but it didnt.

Code:

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell mon.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell mon.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_max_single_start 1'
ceph tell mon.* injectargs '--osd_recovery_max_single_start 1'

Just grasping at straws. I have put OSD's on that hv03 back UP/IN. Gonna wait for it to finish rebuild and then

Code:

ceph osd set noout

And make all the OSD's as UP/OUT. How does that sound?

Best regards,
Eric

wahmed · Nov 3, 2015

Mark all the OSD In. Out tells ceph that those OSDs are not in use and it will try to move data out of those OSds and redistribute them in rest of the In OSDs. How does your #ceph osd tree looks like? Post results of :
#ceph osd tree
and
#ceph -s

If you have not taken out large amount of OSDs yet, there may still be chance to bring your Ceph cluster back to healthy.

ejmerkel · Nov 3, 2015

symmcom said:
Mark all the OSD In. Out tells ceph that those OSDs are not in use and it will try to move data out of those OSds and redistribute them in rest of the In OSDs. How does your #ceph osd tree looks like? Post results of :
#ceph osd tree
and
#ceph -s

If you have not taken out large amount of OSDs yet, there may still be chance to bring your Ceph cluster back to healthy.

I have put all of the OSD's on the problem HV03 back to UP/IN and it is rebuilding. Should I wait till rebuild is done and then set the NOOUT and take them back out?

Here are the commands you requested. This setup has been working great for months. Even during rebuilds.

Code:

ceph osd tree
# id    weight  type name       up/down reweight
-1      65.52   root default
-2      21.84           host fre-he-hv01
0       3.64                    osd.0   up      1
1       3.64                    osd.1   up      1
2       3.64                    osd.2   up      1
3       3.64                    osd.3   up      1
4       3.64                    osd.4   up      1
5       3.64                    osd.5   up      1
-3      21.84           host fre-he-hv02
6       3.64                    osd.6   up      1
7       3.64                    osd.7   up      1
8       3.64                    osd.8   up      1
9       3.64                    osd.9   up      1
10      3.64                    osd.10  up      1
11      3.64                    osd.11  up      1
-4      21.84           host fre-he-hv03
12      3.64                    osd.12  up      1
13      3.64                    osd.13  up      1
14      3.64                    osd.14  up      1
15      3.64                    osd.15  up      1
16      3.64                    osd.16  up      1
17      3.64                    osd.17  up      1

Code:

ceph -s
    cluster 19d554ee-15b1-4076-9b47-4399b1f8a6d4
     health HEALTH_WARN 32 pgs backfill; 1 pgs degraded; 1 pgs recovering; 125 pgs recovery_wait; 158 pgs stuck unclean; recovery 71540/2814405 objects degraded (2.542%); noout flag(s) set
     monmap e3: 3 mons at {0=10.0.3.11:6789/0,1=10.0.3.12:6789/0,2=10.0.3.13:6789/0}, election epoch 616, quorum 0,1,2 0,1,2
     osdmap e2116: 18 osds: 18 up, 18 in
            flags noout
      pgmap v15660051: 1536 pgs, 3 pools, 3567 GB data, 894 kobjects
            10690 GB used, 56335 GB / 67026 GB avail
            71540/2814405 objects degraded (2.542%)
                 125 active+recovery_wait
                1378 active+clean
                  31 active+remapped+wait_backfill
                   1 active+recovering
                   1 active+degraded+remapped+wait_backfill
recovery io 7926 kB/s, 1 objects/s
  client io 1502 kB/s rd, 302 kB/s wr, 90 op/s

Thanks for your help!

Best regards,
Eric

wahmed · Nov 3, 2015

For now let it finish rebuilding and achieve Health_OK status. After that disable noout with:
#ceph osd unset noout
Then observe Ceph cluster behavior. What are you trying to achieve here at the end? Take out OSDs from Node 3 or take the whole node 3 out of the cluster?

ejmerkel · Nov 3, 2015

symmcom said:
For now let it finish rebuilding and achieve Health_OK status. After that disable noout with:
#ceph osd unset noout
Then observe Ceph cluster behavior. What are you trying to achieve here at the end? Take out OSDs from Node 3 or take the whole node 3 out of the cluster?

I am just trying to get the cluster up and running. As you can see the third hypervisor has very high latency. I am thinking once they rebuild, if I take those OFFLINE/OUT and the cluster does not rebuild then I will have 2 nodes up and running normally. Right now HIGH IO is killing all VM's from working.

Code:

ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms)
    0                     0                    0
    1                     0                    0
    2                     0                    0
    3                     0                    0
    4                     0                    0
    5                     0                    0
    6                     0                    4
    7                     0                    3
    8                     0                    1
    9                     0                    1
   10                     0                    1
   11                     0                    1
   12                  9643                 9651
   13                 13649                13657
   14                  8732                 8737
   15                  9727                 9734
   16                  6932                 6975
   17                  6310                 6316

ejmerkel · Nov 3, 2015

Right now I have a paid of Intel 200GB SSD's mirrored RAID 1 as the journal. The other nodes are the same but working...could there be something wrong with SSD?

Best regards,
Eric

wahmed · Nov 4, 2015

High IO/Latency is normal on Ceph OSDs during rebuilding as you probably already know. Using just about any SSD does not increase Ceph performance. For example, Intel DC3500 DC3700 series SSDs are extremely good for ceph journal. Whereas other brand/model will not give you any noticeable performance. Not sure which SSD you are using.

Node 3 showing high traffic because ceph is putting data back into those OSDs which were being emptied due to OUT marking. OUT means you no longer wish to use those OSDs or very critical error occurred with the drives.

After it is done rebuilding you MUST reduce replica size before you mark OUT the node 3 OSDs!! Do not decommission Node 3 while your pools are still on replica 3. With replica 3 it will always try to make 3 copies. Replica Size should at least the minimum number of nodes in ceph cluster. So if you are planning to take out node 3, you must have replica size 2, not 3.

How is the rebuilding going?

ejmerkel · Nov 4, 2015

symmcom said:
High IO/Latency is normal on Ceph OSDs during rebuilding as you probably already know. Using just about any SSD does not increase Ceph performance. For example, Intel DC3500 DC3700 series SSDs are extremely good for ceph journal. Whereas other brand/model will not give you any noticeable performance. Not sure which SSD you are using.

Node 3 showing high traffic because ceph is putting data back into those OSDs which were being emptied due to OUT marking. OUT means you no longer wish to use those OSDs or very critical error occurred with the drives.

After it is done rebuilding you MUST reduce replica size before you mark OUT the node 3 OSDs!! Do not decommission Node 3 while your pools are still on replica 3. With replica 3 it will always try to make 3 copies. Replica Size should at least the minimum number of nodes in ceph cluster. So if you are planning to take out node 3, you must have replica size 2, not 3.

How is the rebuilding going?

The rebuilding did not complete but because i marked the OSD's on 3 down. The system seems to have stabilized with the NOOUT option on and shows 33% degraded. I am going to try and figure out what caused this high latency on node 3's OSD's in the first place. If I can't I am just going to convert this node to local disks.

That being said, how do I change the number of replicas from 3 to 2? Can this be done live without affecting the current traffic?

I am calling it a night...any advice is much appreciated.

Eric

Q-wulf · Nov 4, 2015

ejmerkel said:
[...]
That being said, how do I change the number of replicas from 3 to 2? Can this be done live without affecting the current traffic?

I am calling it a night...any advice is much appreciated.

Eric

http://docs.ceph.com/docs/master/rados/operations/pools/

read the part under

SET THE NUMBER OF OBJECT REPLICAS

To set the number of object replicas on a replicated pool, execute the following:
ceph osd pool set {poolname} size {num-replicas}

For example:
ceph osd pool set data size 3
You may execute this command for each pool. Note: An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:
ceph osd pool set data min_size 2
This ensures that no object in the data pool will receive I/O with fewer than min_sizereplicas.

You should be able to reduce the number of replicas on a replicated pool as long as you keep your new "size" >= your "min_size", else you cause your cluster to not be able to do any I/o on the pool in question.

ejmerkel · Nov 4, 2015

Right now we have 3 pools

data size/min 3/1
metadata size/min 3/1
rbd size/min 3/1

rbd is the main pool so would I just run the following command?

Code:

ceph osd pool set rbd size 2

Do I need to do this for data and metadata too? Those have 0% usage.

Once this is done, would it be safe to take node 3 down?

Best regards,
Eric

wahmed · Nov 4, 2015

Yes you need to decrease replica size for all Pools currently with 3 replicas.

I would suggest to take down 1 OSD at a time from node 3.

ejmerkel · Nov 5, 2015

Wasim,

Can you provide paid ceph support? We still need assistance badly or do you have any suggestions?

Best regards,
Eric

udo · Nov 5, 2015

symmcom said:
Yes you need to decrease replica size for all Pools currently with 3 replicas.

I would suggest to take down 1 OSD at a time from node 3.

Hi Wasim,
I would not recommend to switch the replica to 2 - Eric has now two active replica. So he has still only one additional copy.
But if he switch to replica 2 the rebuild will do a lot traffic and the VMs will slow down again.

the only effort is after that an healthy cluster.
I prefer to resolve the issue on the one node...

Udo

wahmed · Nov 9, 2015

udo said:
I would not recommend to switch the replica to 2 - Eric has now two active replica. So he has still only one additional copy.But if he switch to replica 2 the rebuild will do a lot traffic and the VMs will slow down again.the only effort is after that an healthy cluster.I prefer to resolve the issue on the one node... Udo

Yep, agree with you. Thats why i suggested that he does not change replica before he achieves healthy cluster across 3 nodes. Eric, To make things clearer, you must achieve Healthy cluster with all 3 nodes and all OSDs active before you change replica size. Only after replica is changed should you attempt to take out OSDs from 3rd nodes. I am still unsure why you are taking down 3rd node. Is it to replace it or your goal is to setup 2 node Ceph cluster? Whats that status of your current cluster? Is it done rebuilding?

Urgent: Proxmox/Ceph Support Needed

Renowned Member

Distinguished Member

Proxmox Staff Member

New Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Renowned Member

Renowned Member

Famous Member

Renowned Member

Distinguished Member

Famous Member