Urgent: Proxmox/Ceph Support Needed

ejmerkel

Renowned Member
Sep 20, 2012
121
4
83
Hello,

Are there any professional paid Proxmox/Ceph support people on the forum who could assist us? Would prefer US based but really need help quickly.

Please email me at eric.merkel at sozotechnologies.com of via phone at 317-203-9222 if you can help.

Our Ceph cluster has lost 33% of its disks and it is killing our IO on all the servers. We have the max backfills etc turned down to 1 but it is not helping. Wondering if we can change # of copies from 3 to 2 on the fly in ceph?


Best regards,
Eric
 
Last edited:
Hello,

Are there any professional paid Proxmox/Ceph support people on the forum who could assist us? Would prefer US based but really need help quickly.

Please email me at eric.merkel at sozotechnologies.com of via phone at 317-203-9222 if you can help.

Our Ceph cluster has lost 33% of its disks and it is killing our IO on all the servers. We have the max backfills etc turned down to 1 but it is not helping. Wondering if we can change # of copies from 3 to 2 on the fly in ceph?


Best regards,
Eric
Hi Eric,
I assume the rebuild slow down anything?!

If you have chance to get the failed node back, you can set noout to avoid recovering...

Udo
 
just to add, the command for setting noout is:

> ceph osd set noout

set it back to normal:


> ceph osd unset noout
 
We have tried to run the command

ceph osd set noout

But does not look like it stopped the recovery. Is there something special you need to do in order to stop the current recovery in progress?

 
Hi Eric,
I assume the rebuild slow down anything?!

If you have chance to get the failed node back, you can set noout to avoid recovering...

Udo

I did this but the rebuild continues probably cause I marked the OUT previously. Should I put them back IN and then OUT to stop the rebuild?

Is there a way to change the replicas from 3/1 to 2/1 on the fly?

Best regards,
Eric
 
No sure why you want to change replica from 3 to 2. I dont think that will help with the situation you currently have in hand.

Some details of what caused the 33% loss would definitely help to provide you applicable help. Info such as what exactly was done (ex: marking OUT) before the issue occurred. It will be a bad idea right now to try to change the replica. Also provide info such as how many nodes, how many OSDs per nodes etc.
 
No sure why you want to change replica from 3 to 2. I dont think that will help with the situation you currently have in hand.

Some details of what caused the 33% loss would definitely help to provide you applicable help. Info such as what exactly was done (ex: marking OUT) before the issue occurred. It will be a bad idea right now to try to change the replica. Also provide info such as how many nodes, how many OSDs per nodes etc.

We have 3 nodes with 6 OSDs (4TB/OSD). 1 of the 3 nodes started having high latency (Apply/Commit). We could not find any networking issues or errors. This node was adversely affecting the other 2 nodes because of the high latency even though the health of the ceph cluster was OK.

We tried rebooting the server without any luck. We thought removing the OSD's might get rid of the high IO latency and we had the set the following to minimize the rebuild but it didnt.

Code:
ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell mon.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell mon.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_max_single_start 1'
ceph tell mon.* injectargs '--osd_recovery_max_single_start 1'
Just grasping at straws. I have put OSD's on that hv03 back UP/IN. Gonna wait for it to finish rebuild and then

Code:
ceph osd set noout

And make all the OSD's as UP/OUT. How does that sound?

Best regards,
Eric
 
Mark all the OSD In. Out tells ceph that those OSDs are not in use and it will try to move data out of those OSds and redistribute them in rest of the In OSDs. How does your #ceph osd tree looks like? Post results of :
#ceph osd tree
and
#ceph -s

If you have not taken out large amount of OSDs yet, there may still be chance to bring your Ceph cluster back to healthy.
 
Mark all the OSD In. Out tells ceph that those OSDs are not in use and it will try to move data out of those OSds and redistribute them in rest of the In OSDs. How does your #ceph osd tree looks like? Post results of :
#ceph osd tree
and
#ceph -s

If you have not taken out large amount of OSDs yet, there may still be chance to bring your Ceph cluster back to healthy.

I have put all of the OSD's on the problem HV03 back to UP/IN and it is rebuilding. Should I wait till rebuild is done and then set the NOOUT and take them back out?

Here are the commands you requested. This setup has been working great for months. Even during rebuilds.

Code:
ceph osd tree
# id    weight  type name       up/down reweight
-1      65.52   root default
-2      21.84           host fre-he-hv01
0       3.64                    osd.0   up      1
1       3.64                    osd.1   up      1
2       3.64                    osd.2   up      1
3       3.64                    osd.3   up      1
4       3.64                    osd.4   up      1
5       3.64                    osd.5   up      1
-3      21.84           host fre-he-hv02
6       3.64                    osd.6   up      1
7       3.64                    osd.7   up      1
8       3.64                    osd.8   up      1
9       3.64                    osd.9   up      1
10      3.64                    osd.10  up      1
11      3.64                    osd.11  up      1
-4      21.84           host fre-he-hv03
12      3.64                    osd.12  up      1
13      3.64                    osd.13  up      1
14      3.64                    osd.14  up      1
15      3.64                    osd.15  up      1
16      3.64                    osd.16  up      1
17      3.64                    osd.17  up      1

Code:
ceph -s
    cluster 19d554ee-15b1-4076-9b47-4399b1f8a6d4
     health HEALTH_WARN 32 pgs backfill; 1 pgs degraded; 1 pgs recovering; 125 pgs recovery_wait; 158 pgs stuck unclean; recovery 71540/2814405 objects degraded (2.542%); noout flag(s) set
     monmap e3: 3 mons at {0=10.0.3.11:6789/0,1=10.0.3.12:6789/0,2=10.0.3.13:6789/0}, election epoch 616, quorum 0,1,2 0,1,2
     osdmap e2116: 18 osds: 18 up, 18 in
            flags noout
      pgmap v15660051: 1536 pgs, 3 pools, 3567 GB data, 894 kobjects
            10690 GB used, 56335 GB / 67026 GB avail
            71540/2814405 objects degraded (2.542%)
                 125 active+recovery_wait
                1378 active+clean
                  31 active+remapped+wait_backfill
                   1 active+recovering
                   1 active+degraded+remapped+wait_backfill
recovery io 7926 kB/s, 1 objects/s
  client io 1502 kB/s rd, 302 kB/s wr, 90 op/s

Thanks for your help!

Best regards,
Eric
 
For now let it finish rebuilding and achieve Health_OK status. After that disable noout with:
#ceph osd unset noout
Then observe Ceph cluster behavior. What are you trying to achieve here at the end? Take out OSDs from Node 3 or take the whole node 3 out of the cluster?
 
For now let it finish rebuilding and achieve Health_OK status. After that disable noout with:
#ceph osd unset noout
Then observe Ceph cluster behavior. What are you trying to achieve here at the end? Take out OSDs from Node 3 or take the whole node 3 out of the cluster?

I am just trying to get the cluster up and running. As you can see the third hypervisor has very high latency. I am thinking once they rebuild, if I take those OFFLINE/OUT and the cluster does not rebuild then I will have 2 nodes up and running normally. Right now HIGH IO is killing all VM's from working.

Code:
ceph osd perf
osdid fs_commit_latency(ms) fs_apply_latency(ms)
    0                     0                    0
    1                     0                    0
    2                     0                    0
    3                     0                    0
    4                     0                    0
    5                     0                    0
    6                     0                    4
    7                     0                    3
    8                     0                    1
    9                     0                    1
   10                     0                    1
   11                     0                    1
   12                  9643                 9651
   13                 13649                13657
   14                  8732                 8737
   15                  9727                 9734
   16                  6932                 6975
   17                  6310                 6316
 
Right now I have a paid of Intel 200GB SSD's mirrored RAID 1 as the journal. The other nodes are the same but working...could there be something wrong with SSD?

Best regards,
Eric
 
High IO/Latency is normal on Ceph OSDs during rebuilding as you probably already know. Using just about any SSD does not increase Ceph performance. For example, Intel DC3500 DC3700 series SSDs are extremely good for ceph journal. Whereas other brand/model will not give you any noticeable performance. Not sure which SSD you are using.

Node 3 showing high traffic because ceph is putting data back into those OSDs which were being emptied due to OUT marking. OUT means you no longer wish to use those OSDs or very critical error occurred with the drives.

After it is done rebuilding you MUST reduce replica size before you mark OUT the node 3 OSDs!! Do not decommission Node 3 while your pools are still on replica 3. With replica 3 it will always try to make 3 copies. Replica Size should at least the minimum number of nodes in ceph cluster. So if you are planning to take out node 3, you must have replica size 2, not 3.

How is the rebuilding going?
 
High IO/Latency is normal on Ceph OSDs during rebuilding as you probably already know. Using just about any SSD does not increase Ceph performance. For example, Intel DC3500 DC3700 series SSDs are extremely good for ceph journal. Whereas other brand/model will not give you any noticeable performance. Not sure which SSD you are using.

Node 3 showing high traffic because ceph is putting data back into those OSDs which were being emptied due to OUT marking. OUT means you no longer wish to use those OSDs or very critical error occurred with the drives.

After it is done rebuilding you MUST reduce replica size before you mark OUT the node 3 OSDs!! Do not decommission Node 3 while your pools are still on replica 3. With replica 3 it will always try to make 3 copies. Replica Size should at least the minimum number of nodes in ceph cluster. So if you are planning to take out node 3, you must have replica size 2, not 3.

How is the rebuilding going?

The rebuilding did not complete but because i marked the OSD's on 3 down. The system seems to have stabilized with the NOOUT option on and shows 33% degraded. I am going to try and figure out what caused this high latency on node 3's OSD's in the first place. If I can't I am just going to convert this node to local disks.

That being said, how do I change the number of replicas from 3 to 2? Can this be done live without affecting the current traffic?

I am calling it a night...any advice is much appreciated.

Eric
 
[...]
That being said, how do I change the number of replicas from 3 to 2? Can this be done live without affecting the current traffic?

I am calling it a night...any advice is much appreciated.

Eric

http://docs.ceph.com/docs/master/rados/operations/pools/

read the part under
SET THE NUMBER OF OBJECT REPLICAS

To set the number of object replicas on a replicated pool, execute the following:
ceph osd pool set {poolname} size {num-replicas}


For example:
ceph osd pool set data size 3
You may execute this command for each pool. Note: An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:
ceph osd pool set data min_size 2
This ensures that no object in the data pool will receive I/O with fewer than min_sizereplicas.


You should be able to reduce the number of replicas on a replicated pool as long as you keep your new "size" >= your "min_size", else you cause your cluster to not be able to do any I/o on the pool in question.
 
Last edited:
Right now we have 3 pools

data size/min 3/1
metadata size/min 3/1
rbd size/min 3/1

rbd is the main pool so would I just run the following command?

Code:
ceph osd pool set rbd size 2

Do I need to do this for data and metadata too? Those have 0% usage.

Once this is done, would it be safe to take node 3 down?

Best regards,
Eric
 
Yes you need to decrease replica size for all Pools currently with 3 replicas.

I would suggest to take down 1 OSD at a time from node 3.

Hi Wasim,
I would not recommend to switch the replica to 2 - Eric has now two active replica. So he has still only one additional copy.
But if he switch to replica 2 the rebuild will do a lot traffic and the VMs will slow down again.

the only effort is after that an healthy cluster.
I prefer to resolve the issue on the one node...

Udo
 
I would not recommend to switch the replica to 2 - Eric has now two active replica. So he has still only one additional copy.But if he switch to replica 2 the rebuild will do a lot traffic and the VMs will slow down again.the only effort is after that an healthy cluster.I prefer to resolve the issue on the one node... Udo
Yep, agree with you. Thats why i suggested that he does not change replica before he achieves healthy cluster across 3 nodes. Eric, To make things clearer, you must achieve Healthy cluster with all 3 nodes and all OSDs active before you change replica size. Only after replica is changed should you attempt to take out OSDs from 3rd nodes. I am still unsure why you are taking down 3rd node. Is it to replace it or your goal is to setup 2 node Ceph cluster? Whats that status of your current cluster? Is it done rebuilding?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!