Ceph Optimization for HA. 7 nodes, 2 osd each

Dan Nicolae

Renowned Member
Apr 27, 2016
78
5
73
44
Hello, everyone.

After a lot of reading on the web and trying to tune the ceph, we whre not able to make it HA. If one of the node is turned off, after some time we have partition corruption on the VM.
The idea is if a node (2 osd) goes down, or if 2 osd's on different nodes goes down, the VM would work without data loss...

We are runing a proxmox cluster with ceph as storage. Our Ceph cluster has at this moment 7 nodes, each node with 2 osd (2TB HDD), total of 14 OSD.

Some of them has journal on a DC SSD, some of the are using the default journal location on OSD HDD.

Software version is,

root@ceph07:~# pveversion -v
proxmox-ve: 4.4-79 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.35-2-pve: 4.4.35-79
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-107
pve-firmware: 1.1-10
libpve-common-perl: 4.0-90
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-73
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-93
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-1
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve14~bpo80
ceph: 10.2.5-1~bpo80+1

Ceph was upgraded from Hammer to Jewel.

Ceph pool is using size 3, min_size 2, pg_num 512

ceph.conf has,

[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.10.10.0/24
filestore xattr use omap = true
fsid = b959b08a-0827-4840-89b0-da9f40d6ff22
keyring = /etc/pve/priv/$cluster.$name.keyring
log max recent = 250000
osd journal size = 5120
osd map message max = 10
osd max object name len = 256
osd max object namespace len = 64
osd pool default min size = 2
public network = 10.10.10.0/24

[client]
rbd cache = true
rbd cache max dirty = 67108864
rbd cache max dirty age = 5
rbd cache size = 134217728
[osd]
osd disk thread ioprio class = idle
osd disk thread ioprio priority = 7
filestore max sync interval = 15
filestore min sync interval = 10
filestore op threads = 2
filestore queue committing max bytes = 10485760000
filestore queue committing max ops = 5000
filestore queue max bytes = 10485760
filestore queue max ops = 25000
filestore xattr use omap = true
keyring = /var/lib/ceph/osd/ceph-$id/keyring
max open files = 131072
osd client message size cap = 524288000
osd deep scrub stride = 1058576
osd disk threads = 2
osd map cache bl size = 50
osd map cache size = 500
osd map max advance = 10
osd map share max epochs = 10
osd max backfills = 1
osd max write size = 180
osd pg epoch persisted max stale = 10
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
[mon.1]
host = ceph02
mon addr = 10.10.10.2:6789

[mon.2]
host = ceph04
mon addr = 10.10.10.4:6789

[mon.0]
host = ceph03
mon addr = 10.10.10.3:6789

[osd.9]
osd journal = /dev/disk/by-partlabel/journal-9
osd journal size = 10240

[osd.6]
osd journal = /dev/disk/by-partlabel/journal-6
osd journal size = 10240

[osd.2]
osd journal = /dev/disk/by-partlabel/journal-2
osd journal size = 10240

[osd.1]
osd journal = /dev/disk/by-partlabel/journal-1
osd journal size = 10240

[osd.0]
osd journal = /dev/disk/by-partlabel/journal-0
osd journal size = 10240

[osd.8]
osd journal = /dev/disk/by-partlabel/journal-8
osd journal size = 10240

[osd.7]
osd journal = /dev/disk/by-partlabel/journal-7
osd journal size = 10240

[osd.11]
osd journal = /dev/disk/by-partlabel/journal-11
osd journal size = 10240

[osd.10]
osd journal = /dev/disk/by-partlabel/journal-10
osd journal size = 10240

[osd.3]
osd journal = /dev/disk/by-partlabel/journal-3
osd journal size = 10240


Please can someone gives us some suggestions on how to make this work.

Thank you.
 
Last edited:
Unless you set the min_size to 1, you won't get any I/O when you bring down a node with 2 osd's in that setup. Unless you compensate for it via your crushmap.
Hi,
no - this is wrong and also dangerous.

With an normal crush rule you have two writables replicas if one node goes down - all traffic should run without trouble.
Except you have an pool with replica = 2 - which is also dangerous.

How much replicas do you have?
Code:
for i in `ceph osd lspools | tr -d ",[0-9]"`
  do
    ceph osd dump | grep \'$i\'
done

Udo
 
You may also find your PG number is too low for the amount of OSD's you have.

Can you give an export of your crushmap?
 
Hi,
pgcalc suggest 1024PGs (if one pool excist only) - so 512 is not too bad. And wrong PG-count produce performance (and weighting) issues and not IO-block.

Udo

Correct, as the OP asked for Optimisation i thought it was also worth stating that, along with asking about the crushmap to make sure it was not doing OSD replication instead of HOST.
 
Hello,
Thanks for the answers.

We have only one pool. pg_num is 512, the lowest number according to pg calc. I choose 512 because pg_num's number can not be reduced without deleting the pool, if the number of osd's is also reduced.

<quote>
it is mandatory to choose the value of pg_num because it cannot be calculated automatically. Here are a few values commonly used:

  • Less than 5 OSDs set pg_num to 128
  • Between 5 and 10 OSDs set pg_num to 512
  • Between 10 and 50 OSDs set pg_num to 1024
  • If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself
  • For calculating pg_num value by yourself please take help of pgcalc tool
</quote>

Pool size is 3, min size 2. I thought about to change min size to 1, but I was not sure about it. Problem could be solved via crush map I guess.

About the crush map, it is default as the proxmox / ceph built it.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph01 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
host ceph03 {
id -3 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.810
item osd.3 weight 1.810
}
host ceph02 {
id -4 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.4 weight 1.810
item osd.5 weight 1.810
}
host ceph04 {
id -5 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.810
item osd.1 weight 1.810
}
host ceph05 {
id -6 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.6 weight 1.810
item osd.7 weight 1.810
}
host ceph06 {
id -7 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.8 weight 1.810
item osd.9 weight 1.810
}
host ceph07 {
id -8 # do not change unnecessarily
# weight 3.620
alg straw
hash 0 # rjenkins1
item osd.10 weight 1.810
item osd.11 weight 1.810
}
host ceph08 {
id -9 # do not change unnecessarily
# weight 3.624
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.810
item osd.13 weight 1.814
}
root default {
id -1 # do not change unnecessarily
# weight 25.344
alg straw
hash 0 # rjenkins1
item ceph01 weight 0.000
item ceph03 weight 3.620
item ceph02 weight 3.620
item ceph04 weight 3.620
item ceph05 weight 3.620
item ceph06 weight 3.620
item ceph07 weight 3.620
item ceph08 weight 3.624
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

PS: ceph01 does not exists and I don't know how to remove it. :)
 
Last edited:
Correct, as the OP asked for Optimisation i thought it was also worth stating that, along with asking about the crushmap to make sure it was not doing OSD replication instead of HOST.
I was thinking to put it 1024 at the moment when we grow the cluster to 20 osd's or more. Now we have only 14 osds and the lower number for 1024 is 10 osd.
<quote>
  • Less than 5 OSDs set pg_num to 128
  • Between 5 and 10 OSDs set pg_num to 512
  • Between 10 and 50 OSDs set pg_num to 1024
</quote>
Thanks for the suggestion. If it worth I'll increase pg_num to 1024.
 
Hello,
Thanks for the answers.

We have only one pool. pg_num is 512, the lowest number according to pg calc. I choose 512 because pg_num's number can not be reduced without deleting the pool, if the number of osd's is also reduced.

<quote>
Hi,
you must use [] instead of <> for quote+code.
PS: ceph01 does not exists and I don't know how to remove it. :)
the crusmap looks not bad... perhaps an effect form the empty node ceph01?! Make not realy sense, but how knows...

You can remove ceph01 with folliging procedere:
export crushmap
decompile crushmap
edit decompiled crushmap (remove ceph01-entry (also in root-default))
compile crushmap
import crushmap

Udo
 
  • Like
Reactions: Dan Nicolae
Hi,
no - this is wrong and also dangerous.

With an normal crush rule you have two writables replicas if one node goes down - all traffic should run without trouble.
Except you have an pool with replica = 2 - which is also dangerous.

How much replicas do you have?
Code:
for i in `ceph osd lspools | tr -d ",[0-9]"`
  do
    ceph osd dump | grep \'$i\'
done

Udo
Hi,
no - this is wrong and also dangerous.

With an normal crush rule you have two writables replicas if one node goes down - all traffic should run without trouble.
Except you have an pool with replica = 2 - which is also dangerous.

How much replicas do you have?
Code:
for i in `ceph osd lspools | tr -d ",[0-9]"`
  do
    ceph osd dump | grep \'$i\'
done

Udo


Please read what he wrote again. I never said a word about it being safe. Cheers.
 
Hi,
you must use [] instead of <> for quote+code.

the crusmap looks not bad... perhaps an effect form the empty node ceph01?! Make not realy sense, but how knows...

You can remove ceph01 with folliging procedere:
export crushmap
decompile crushmap
edit decompiled crushmap (remove ceph01-entry (also in root-default))
compile crushmap
import crushmap

Udo
I removed ceph01 entry from crushmap. Now I'll torture a little bit the ceph cluster to see how it reacts. Be right back with the results. :)
 
I removed ceph01 entry from crushmap. Now I'll torture a little bit the ceph cluster to see how it reacts. Be right back with the results. :)
Hi Dan,
look if the issue happens, if you stop the OSDs from one node (play with different nodes). Perhaps your issue are on powerloss/networkloss only?
I had the effect too, that ceph has an osd marked as primary for an PG, and the PG was stalled until the OSD where up again...
Normaly that shouldn't happens but...

Udo
 
Hello, Udo.

Thanks for suggestion.
I'm back with the result of the first tests.

First scenario.

Stop 2 osd's on thesame node. Ceph cluster recover without problems. Start the 2 osd's that where stopped. Ceph cluster recover to the initial point. There was no IO intreruption on the VM.

Second scenario.

Stop 2 osd's on different nodes. Ceph cluster goes in to Health Err with one of the messages " 12 pgs are stuck inactive for more than 300 seconds". VM dies. After a while, the number of pgs stuck slowly decrease. At around 3 stuck pgs, VM is comming to life with no file system problems. Ceph cluster recover without problems. Start the 2 osd's that where stopped. Cluster recover to the initial point.

Third scenario.

Power down an entire node. Two OSD's are out. I get Health Err & pgs stuck for under a minute and after, Healt Warn and cluster start the recover process. Power back on the node, Health Err & pgs stuck for under a minute, Ceph cluster recover to initial state. No problem on VM's file system.

What happend when the pgs where stuck and no IO was available?


Thanks
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!