Ceph Optimization for HA. 7 nodes, 2 osd each

Dan Nicolae · Feb 3, 2017

Hello, everyone.

After a lot of reading on the web and trying to tune the ceph, we whre not able to make it HA. If one of the node is turned off, after some time we have partition corruption on the VM.
The idea is if a node (2 osd) goes down, or if 2 osd's on different nodes goes down, the VM would work without data loss...

We are runing a proxmox cluster with ceph as storage. Our Ceph cluster has at this moment 7 nodes, each node with 2 osd (2TB HDD), total of 14 OSD.

Some of them has journal on a DC SSD, some of the are using the default journal location on OSD HDD.

Software version is,

root@ceph07:~# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-107
pve-firmware: 1.1-10
libpve-common-perl: 4.0-90
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
ceph: 10.2.5-1~bpo80+1

Ceph was upgraded from Hammer to Jewel.

Ceph pool is using size 3, min_size 2, pg_num 512

ceph.conf has,

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = b959b08a-0827-4840-89b0-da9f40d6ff22
keyring = /etc/pve/priv/$cluster.$name.keyring
log max recent = 250000
osd journal size = 5120
osd map message max = 10
osd max object name len = 256
osd max object namespace len = 64
osd pool default min size = 2
public network = 10.10.10.0/24

[client]
rbd cache = true
rbd cache max dirty = 67108864
rbd cache max dirty age = 5
rbd cache size = 134217728
[osd]
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7
filestore max sync interval = 15
filestore min sync interval = 10
filestore op threads = 2
filestore queue committing max bytes = 10485760000
filestore queue committing max ops = 5000
filestore queue max bytes = 10485760
filestore queue max ops = 25000
filestore xattr use omap = true
keyring = /var/lib/ceph/osd/ceph-$id/keyring
max open files = 131072
osd client message size cap = 524288000
osd deep scrub stride = 1058576
osd disk threads = 2
osd map cache bl size = 50
osd map cache size = 500
osd map max advance = 10
osd map share max epochs = 10
osd max backfills = 1
osd max write size = 180
osd pg epoch persisted max stale = 10
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
[mon.1]
host = ceph02
mon addr = 10.10.10.2:6789

[mon.2]
host = ceph04
mon addr = 10.10.10.4:6789

[mon.0]
host = ceph03
mon addr = 10.10.10.3:6789

[osd.9]
osd journal = /dev/disk/by-partlabel/journal-9
osd journal size = 10240

[osd.6]
osd journal = /dev/disk/by-partlabel/journal-6
osd journal size = 10240

[osd.2]
osd journal = /dev/disk/by-partlabel/journal-2
osd journal size = 10240

[osd.1]
osd journal = /dev/disk/by-partlabel/journal-1
osd journal size = 10240

[osd.0]
osd journal = /dev/disk/by-partlabel/journal-0
osd journal size = 10240

[osd.8]
osd journal = /dev/disk/by-partlabel/journal-8
osd journal size = 10240

[osd.7]
osd journal = /dev/disk/by-partlabel/journal-7
osd journal size = 10240

[osd.11]
osd journal = /dev/disk/by-partlabel/journal-11
osd journal size = 10240

[osd.10]
osd journal = /dev/disk/by-partlabel/journal-10
osd journal size = 10240

[osd.3]
osd journal = /dev/disk/by-partlabel/journal-3
osd journal size = 10240

Please can someone gives us some suggestions on how to make this work.

Thank you.

Dan Nicolae · Feb 3, 2017

Nothing?

udo · Feb 4, 2017

jeffwadsworth said:
Unless you set the min_size to 1, you won't get any I/O when you bring down a node with 2 osd's in that setup. Unless you compensate for it via your crushmap.

Hi,
no - this is wrong and also dangerous.

With an normal crush rule you have two writables replicas if one node goes down - all traffic should run without trouble.
Except you have an pool with replica = 2 - which is also dangerous.

How much replicas do you have?

Code:

for i in `ceph osd lspools | tr -d ",[0-9]"`
  do
    ceph osd dump | grep \'$i\'
done

Udo

Ashley · Feb 4, 2017

You may also find your PG number is too low for the amount of OSD's you have.

Can you give an export of your crushmap?

udo · Feb 4, 2017

Ashley said:
You may also find your PG number is too low for the amount of OSD's you have.

Can you give an export of your crushmap?

Hi,
pgcalc suggest 1024PGs (if one pool excist only) - so 512 is not too bad. And wrong PG-count produce performance (and weighting) issues and not IO-block.

Udo

Ashley · Feb 4, 2017

udo said:
Hi,
pgcalc suggest 1024PGs (if one pool excist only) - so 512 is not too bad. And wrong PG-count produce performance (and weighting) issues and not IO-block.

Udo

Correct, as the OP asked for Optimisation i thought it was also worth stating that, along with asking about the crushmap to make sure it was not doing OSD replication instead of HOST.

Dan Nicolae · Feb 4, 2017

Hello,
Thanks for the answers.

We have only one pool. pg_num is 512, the lowest number according to pg calc. I choose 512 because pg_num's number can not be reduced without deleting the pool, if the number of osd's is also reduced.

<quote>
it is mandatory to choose the value of pg_num because it cannot be calculated automatically. Here are a few values commonly used:

Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 1024
If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself
For calculating pg_num value by yourself please take help of pgcalc tool

</quote>

Pool size is 3, min size 2. I thought about to change min size to 1, but I was not sure about it. Problem could be solved via crush map I guess.

About the crush map, it is default as the proxmox / ceph built it.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph01 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
host ceph03 {
id -3 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.810
item osd.3 weight 1.810
}
host ceph02 {
id -4 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.4 weight 1.810
item osd.5 weight 1.810
}
host ceph04 {
id -5 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.810
item osd.1 weight 1.810
}
host ceph05 {
id -6 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.6 weight 1.810
item osd.7 weight 1.810
}
host ceph06 {
id -7 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.8 weight 1.810
item osd.9 weight 1.810
}
host ceph07 {
id -8 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.10 weight 1.810
item osd.11 weight 1.810
}
host ceph08 {
id -9 # do not change unnecessarily
# weight 3.624
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.810
item osd.13 weight 1.814
}
root default {
id -1 # do not change unnecessarily
# weight 25.344
alg straw
hash 0 # rjenkins1
item ceph01 weight 0.000
item ceph03 weight 3.620
item ceph02 weight 3.620
item ceph04 weight 3.620
item ceph05 weight 3.620
item ceph06 weight 3.620
item ceph07 weight 3.620
item ceph08 weight 3.624
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

PS: ceph01 does not exists and I don't know how to remove it.

Dan Nicolae · Feb 4, 2017

Ashley said:
Correct, as the OP asked for Optimisation i thought it was also worth stating that, along with asking about the crushmap to make sure it was not doing OSD replication instead of HOST.

I was thinking to put it 1024 at the moment when we grow the cluster to 20 osd's or more. Now we have only 14 osds and the lower number for 1024 is 10 osd.
<quote>

Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 1024

</quote>
Thanks for the suggestion. If it worth I'll increase pg_num to 1024.

udo · Feb 4, 2017

Dan Nicolae said:
Hello,
Thanks for the answers.

We have only one pool. pg_num is 512, the lowest number according to pg calc. I choose 512 because pg_num's number can not be reduced without deleting the pool, if the number of osd's is also reduced.

<quote>

Hi,
you must use [] instead of <> for quote+code.

PS: ceph01 does not exists and I don't know how to remove it.

the crusmap looks not bad... perhaps an effect form the empty node ceph01?! Make not realy sense, but how knows...

You can remove ceph01 with folliging procedere:
export crushmap
decompile crushmap
edit decompiled crushmap (remove ceph01-entry (also in root-default))
compile crushmap
import crushmap

Udo

jeffwadsworth · Feb 5, 2017

udo said:
Hi,
no - this is wrong and also dangerous.

With an normal crush rule you have two writables replicas if one node goes down - all traffic should run without trouble.
Except you have an pool with replica = 2 - which is also dangerous.

How much replicas do you have?

Code:

for i in `ceph osd lspools | tr -d ",[0-9]"` do ceph osd dump | grep \'$i\' done

Udo

Please read what he wrote again. I never said a word about it being safe. Cheers.

udo · Feb 5, 2017

jeffwadsworth said:
Please read what he wrote again.

Hi,
I don't find the point - his scenario should work with osd_pool_default_min_size=2. because one replica is down only with one missing node.

I never said a word about it being safe. Cheers.

sure, but normaly ceph is used to be on the safe side.

Udo

Dan Nicolae · Feb 5, 2017

udo said:
Hi,
you must use [] instead of <> for quote+code.

the crusmap looks not bad... perhaps an effect form the empty node ceph01?! Make not realy sense, but how knows...

You can remove ceph01 with folliging procedere:
export crushmap
decompile crushmap
edit decompiled crushmap (remove ceph01-entry (also in root-default))
compile crushmap
import crushmap

Udo

I removed ceph01 entry from crushmap. Now I'll torture a little bit the ceph cluster to see how it reacts. Be right back with the results.

udo · Feb 5, 2017

Dan Nicolae said:
I removed ceph01 entry from crushmap. Now I'll torture a little bit the ceph cluster to see how it reacts. Be right back with the results.

Hi Dan,
look if the issue happens, if you stop the OSDs from one node (play with different nodes). Perhaps your issue are on powerloss/networkloss only?
I had the effect too, that ceph has an osd marked as primary for an PG, and the PG was stalled until the OSD where up again...
Normaly that shouldn't happens but...

Udo

Dan Nicolae · Feb 6, 2017

Hello, Udo.

Thanks for suggestion.
I'm back with the result of the first tests.

First scenario.

Stop 2 osd's on thesame node. Ceph cluster recover without problems. Start the 2 osd's that where stopped. Ceph cluster recover to the initial point. There was no IO intreruption on the VM.

Second scenario.

Stop 2 osd's on different nodes. Ceph cluster goes in to Health Err with one of the messages " 12 pgs are stuck inactive for more than 300 seconds". VM dies. After a while, the number of pgs stuck slowly decrease. At around 3 stuck pgs, VM is comming to life with no file system problems. Ceph cluster recover without problems. Start the 2 osd's that where stopped. Cluster recover to the initial point.

Third scenario.

Power down an entire node. Two OSD's are out. I get Health Err & pgs stuck for under a minute and after, Healt Warn and cluster start the recover process. Power back on the node, Health Err & pgs stuck for under a minute, Ceph cluster recover to initial state. No problem on VM's file system.

What happend when the pgs where stuck and no IO was available?

Thanks

Search

Search

Ceph Optimization for HA. 7 nodes, 2 osd each

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

udo

Distinguished Member

Ashley

Member

udo

Distinguished Member

Ashley

Member

Dan Nicolae

Renowned Member

Dan Nicolae

Renowned Member

udo

Distinguished Member

jeffwadsworth

Member

udo

Distinguished Member

Dan Nicolae

Renowned Member

udo

Distinguished Member

Dan Nicolae

Renowned Member