Understanding Ceph

sourceminer · Jan 14, 2017

Hi proxmox fans.

Excuse my ignorance in this area, I have been trying to wrap my head around the Ceph Clustering Model. Watched some videos and read bunches. I have a 4 Node Cluster setup with Equivalent hardware. Based on some tutorials I have setup and have functioning an environment. I would like it to have 1 node go down without any kind of faults.

Here is my problem.
When 1 host goes down, everything goes down. So clearly I have something setup wrong.
I also notice that when 1 node goes down it takes down the other hosts osd's
Here is my Crush Map and Ceph Config:

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = c4a24163-5d13-4d82-8877-8b6ddc050f29
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.10.10.0/24

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.2]
host = pm02
mon addr = 10.10.10.2:6789

[mon.0]
host = PM01
mon addr = 10.10.10.1:6789

[mon.1]
host = pm03
mon addr = 10.10.10.3:6789

[mon.3]
host = pm04
mon addr = 10.10.10.4:6789

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pm02 {
id -2 # do not change unnecessarily
# weight 2.720
alg straw
hash 0 # rjenkins1
item osd.0 weight 2.720
}
host pm03 {
id -3 # do not change unnecessarily
# weight 1.050
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.050
}
host pm04 {
id -4 # do not change unnecessarily
# weight 1.810
alg straw
hash 0 # rjenkins1
item osd.5 weight 1.810
}
host PM01 {
id -5 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.900
item osd.3 weight 0.900
item osd.4 weight 0.900
}
root default {
id -1 # do not change unnecessarily
# weight 8.280
alg straw
hash 0 # rjenkins1
item pm02 weight 2.720
item pm03 weight 1.050
item pm04 weight 1.810
item PM01 weight 2.700
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
Logs

root@PM01:~# ceph osd pool get cephStor pg_num
pg_num: 150

root@PM01:~# ceph osd pool get cephStor size
size: 3

I found the Ceph Calc.
Pool cephStor
Size =3
OSD = 5
% of Data = 100%
PG's Per OSD = 200
Total PG's 256

If looking at this correctly the calculator is suggesting that I have a PG count of 256 with 200 on each osd where I only have 150.

How do I update and is this correct for what I am trying to accomplish?

dietmar · Jan 14, 2017

I would only use 3 monitors (instead of four) - better to keep quorum.

Dan Nicolae · Jan 15, 2017

We have 6 nodes, each node running 2 osd with journal on Intel Enterprise SSD. When a node goes down (2 osd from 12 of them) 18% of the cluster goes out and still, a lot of faults. The VM go down, partition corruption... Still searching for a solution.

Ashley · Jan 15, 2017

Dan Nicolae said:
We have 6 nodes, each node running 2 osd with journal on Intel Enterprise SSD. When a node goes down (2 osd from 12 of them) 18% of the cluster goes out and still, a lot of faults. The VM go down, partition corruption... Still searching for a solution.

What does your crush map look like?

Dan Nicolae · Jan 15, 2017

Here it is,

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph01 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
host ceph03 {
id -3 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.810
item osd.3 weight 1.810
}
host ceph02 {
id -4 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.4 weight 1.810
item osd.5 weight 1.810
}
host ceph04 {
id -5 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.810
item osd.1 weight 1.810
}
host ceph05 {
id -6 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.6 weight 1.810
item osd.7 weight 1.810
}
host ceph06 {
id -7 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.8 weight 1.810
item osd.9 weight 1.810
}
host ceph07 {
id -8 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.10 weight 1.810
item osd.11 weight 1.810
}
root default {
id -1 # do not change unnecessarily
# weight 21.720
alg straw
hash 0 # rjenkins1
item ceph01 weight 0.000
item ceph03 weight 3.620
item ceph02 weight 3.620
item ceph04 weight 3.620
item ceph05 weight 3.620
item ceph06 weight 3.620
item ceph07 weight 3.620
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Ashley · Jan 15, 2017

What did ceph -w say when the 2 OSD's are down if you have a copy?

I would also remove the empty host from the crushmap if you have no plan to use it going forward.

Dan Nicolae · Jan 15, 2017

Unfortunately I can't give you at this moment what Ceph say if 2 osd's are down. We have other problems with the VM running on Ceph, partition corruption even when Ceph health is green. Now we are moving VM's out of Ceph until things become clear and they run without problems. No problems in the log (syslog, ceph log), everything looks OK but partitions goes bad.
The unused ceph01 was stuck, that server does not exist anymore and we cand reuse it, or remove it. Maybe we don't know how.

dmora · Jan 15, 2017

My advice would be to NOT Use proxmox to learn how ceph works. Go to cephs website and bring up a ceph Rbd cluster using the quick start. It's super easy. It'll help you realize that proxmox just writes wrappers around ceph commands and help you understand where you are failing.

Your of number looks too low as well... Might want to change that...
ceph osd pool set {pool_name} pg_num {value}

How big are those ssds? Ceph is surprisingly resilient, definitely more stable than I've seen the proxmox clustering be.

Also ...did this thread get hijacked? I just noticed the OPs question got thrown to the wind and hasn't responded.

Ashley · Jan 15, 2017

As there is no osd's you can just remove the entry from your crushmap.

Do you have the output of ceph -w now in the healthy state?

Ashley · Jan 15, 2017

dmora said:
My advice would be to NOT Use proxmox to learn how ceph works. Go to cephs website and bring up a ceph Rbd cluster using the quick start. It's super easy. It'll help you realize that proxmox just writes wrappers around ceph commands and help you understand where you are failing.

Your of number looks too low as well... Might want to change that...
ceph osd pool set {pool_name} pg_num {value}

How big are those ssds? Ceph is surprisingly resilient, definitely more stable than I've seen the proxmox clustering be.

Also ...did this thread get hijacked? I just noticed the OPs question got thrown to the wind and hasn't responded.

Didn't realise was two separate things going missed the OP.

However to OP, as just suggested the default pool created has far too small of a PG value, your want this around 200 per an OSD if your not looking to expand further shortly.

If your not using the storage in production yet your probably better off deleting the current pool and creating a new one. However you will lose all data in the current pool when you delete it.

You also have a massive difference in size per a HOST, so if you loose the biggest HOST CEPH may struggle to rebalance with just the smaller of the capacity left.

It is suggested to try and keep every host fairly balanced in size.

Dan Nicolae · Jan 15, 2017

Sorry if I hijacked this thread, it was not my intention. I have thesame issue, how to make a small Ceph cluster HA. If you consider it's better, I'll open another thread.

Ashley · Jan 15, 2017

Dan Nicolae said:
Sorry if I hijacked this thread, it was not my intention. I have thesame issue, how to make a small Ceph cluster HA. If you consider it's better, I'll open another thread.

If your also using the default pool then your issue may very much be the same.

udo · Jan 15, 2017

sourceminer said:
Hi proxmox fans.

Excuse my ignorance in this area, I have been trying to wrap my head around the Ceph Clustering Model. Watched some videos and read bunches. I have a 4 Node Cluster setup with Equivalent hardware. Based on some tutorials I have setup and have functioning an environment. I would like it to have 1 node go down without any kind of faults.

Here is my problem.
When 1 host goes down, everything goes down. So clearly I have something setup wrong.
I also notice that when 1 node goes down it takes down the other hosts osd's
...

Hi,
sure that the other OSDs are down if one host is down?
Check with

Code:

ceph osd tree

But I assume the problem is another one.
If you shutdown an host the OSDs and the Mon dies at the same time and due that, the OSD-message, that they go down, has no effect. So the cluster must get the quorum again and after that, it's takes some time to build an neu osdmap (+ pgmap) due the stopped OSDs.

Try if it works, if you stop the OSDs first before you shutdown one host.
And like Dietmar allready wrote - use three mons instead of four (but I think, this isn't the issue).

Udo

sourceminer · Jan 15, 2017

Wow, ok so glad to see Im not the only one but now I am trying to figure out what responses go to the initial question, the inline helps.

So to clarify when PM04 has gone down (not sure why yet), all the VM's freak out. When I look at the proxmox GUI, it shows that 1 of the OSD's on each host is also down. Why would Ceph do that?

sourceminer · Jan 15, 2017

dmora said:
Your of number looks too low as well... Might want to change that...
ceph osd pool set {pool_name} pg_num {value}

Thanks for this I will us this to increase to 256? Still trying to figure out what that actually means...

# Ensure you have a realistic number of placement groups. We recommend
# approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
# divided by the number of replicas (i.e., osd pool default size). So for
# 10 OSDs and osd pool default size = 4, we'd recommend approximately
# (100 * 10) / 4 = 250.

As Per Documentation its suggesting 100 per OSD buy why?

Dan Nicolae · Jan 15, 2017

The situation in thesame. Default pool. Only difference is that we have 6 nodes.

sourceminer said:
Thanks for this I will us this to increase to 256? Still trying to figure out what that actually means...

# Ensure you have a realistic number of placement groups. We recommend
# approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
# divided by the number of replicas (i.e., osd pool default size). So for
# 10 OSDs and osd pool default size = 4, we'd recommend approximately
# (100 * 10) / 4 = 250.

As Per Documentation its suggesting 100 per OSD buy why?

Use this to set pg_num.

http://ceph.com/pgcalc/

sourceminer · Jan 15, 2017

So when using Set pg_num "ceph osd pool set {pool_name} pg_num {value}" to 256 on an existing pool doesn't cause any data loss?
Im fully aware of the pgcalc, after creating my pool (not using default Pool). Still doesnt explain why the number of placement groups is in the hundreds.

http://docs.ceph.com/docs/master/rados/operations/placement-groups/
Even in the documentation starts by suggesting over 100 but why?

Dan Nicolae · Jan 15, 2017

Like they say,
It's also important to know that the PG count can be increased, but NEVER decreased without destroying / recreating the pool. However, increasing the PG Count of a pool is one of the most impactful events in a Ceph Cluster, and should be avoided for production clusters if possible.
We increased pg_num from 128 to 256 when we added new nodes and there was no data loss., only a lot of moving data.

Dan Nicolae · Jan 16, 2017

In my case, we have 12 osd (6 nodes, 2 osd /node). Using pg_calc, rbd pool name, size 3, 12 osd, 100% data, target per osd 100 result pg count 512. At this moment we have 256. Should I change to 512 or jump to 1024?
According to ceph documentation, the range is,

Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 1024
If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself
For calculating pg_num value by yourself please take help of pgcalc tool

sourceminer · Jan 22, 2017

Still have not been able to find the reason for the number of "128" Placement Groups anywhere.
Why 128/512? Why not 2, or 10 or 30?

Understanding Ceph

Active Member

Proxmox Staff Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

New Member

Member

Member

Renowned Member

Member

Distinguished Member

Active Member

Active Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Active Member