Ceph - High latency on VM before OSD maked down

fall

Renowned Member
Jan 26, 2015
7
0
66
FRANCE
Hello

I test my new ceph installation with 3 nodes and 3 OSDs on each node.
I have an other proxmox cluster with 1 Windows VM with disk mapped on ceph.
When I stop 1 ceph node, there is near 1 minute before the 3 OSDs goes down (I think it's normal).
The problem is that the disk access in VM are blocked due to IO latency (i.e. apply latency in Proxmox GUI) before OSDs are marked down, for 1 minute.

How resolve this freeze of the VM ?

My Ceph configuration :
- Proxmox 3.3-5
- CEPH Giant 0.87-1


OSD Tree :
Code:
# ceph osd tree
# id    weight  type name       up/down reweight
-1      32.76   root default
-2      10.92           host ceph01
0       3.64                    osd.0   up      1
2       3.64                    osd.2   up      1
1       3.64                    osd.1   up      1
-3      10.92           host ceph02
3       3.64                    osd.3   up      1
4       3.64                    osd.4   up      1
5       3.64                    osd.5   up      1
-4      10.92           host ceph03
6       3.64                    osd.6   up      1
7       3.64                    osd.7   up      1
8       3.64                    osd.8   up      1

Crush map :
Code:
# ceph osd crush dump
{ "devices": [
        { "id": 0,
          "name": "osd.0"},
        { "id": 1,
          "name": "osd.1"},
        { "id": 2,
          "name": "osd.2"},
        { "id": 3,
          "name": "osd.3"},
        { "id": 4,
          "name": "osd.4"},
        { "id": 5,
          "name": "osd.5"},
        { "id": 6,
          "name": "osd.6"},
        { "id": 7,
          "name": "osd.7"},
        { "id": 8,
          "name": "osd.8"}],
  "types": [
        { "type_id": 0,
          "name": "osd"},
        { "type_id": 1,
          "name": "host"},
        { "type_id": 2,
          "name": "chassis"},
        { "type_id": 3,
          "name": "rack"},
        { "type_id": 4,
          "name": "row"},
        { "type_id": 5,
          "name": "pdu"},
        { "type_id": 6,
          "name": "pod"},
        { "type_id": 7,
          "name": "room"},
        { "type_id": 8,
          "name": "datacenter"},
        { "type_id": 9,
          "name": "region"},
        { "type_id": 10,
          "name": "root"}],
  "buckets": [
        { "id": -1,
          "name": "default",
          "type_id": 10,
          "type_name": "root",
          "weight": 2146959,
          "alg": "straw",
          "hash": "rjenkins1",
          "items": [
                { "id": -2,
                  "weight": 715653,
                  "pos": 0},
                { "id": -3,
                  "weight": 715653,
                  "pos": 1},
                { "id": -4,
                  "weight": 715653,
                  "pos": 2}]},
        { "id": -2,
          "name": "ceph01",
          "type_id": 1,
          "type_name": "host",
          "weight": 715653,
          "alg": "straw",
          "hash": "rjenkins1",
          "items": [
                { "id": 0,
                  "weight": 238551,
                  "pos": 0},
                { "id": 2,
                  "weight": 238551,
                  "pos": 1},
                { "id": 1,
                  "weight": 238551,
                  "pos": 2}]},
        { "id": -3,
          "name": "ceph02",
          "type_id": 1,
          "type_name": "host",
          "weight": 715653,
          "alg": "straw",
          "hash": "rjenkins1",
          "items": [
                { "id": 3,
                  "weight": 238551,
                  "pos": 0},
                { "id": 4,
                  "weight": 238551,
                  "pos": 1},
                { "id": 5,
                  "weight": 238551,
                  "pos": 2}]},
        { "id": -4,
          "name": "ceph03",
          "type_id": 1,
          "type_name": "host",
          "weight": 715653,
          "alg": "straw",
          "hash": "rjenkins1",
          "items": [
                { "id": 6,
                  "weight": 238551,
                  "pos": 0},
                { "id": 7,
                  "weight": 238551,
                  "pos": 1},
                { "id": 8,
                  "weight": 238551,
                  "pos": 2}]}],
  "rules": [
        { "rule_id": 0,
          "rule_name": "replicated_ruleset",
          "ruleset": 0,
          "type": 1,
          "min_size": 1,
          "max_size": 10,
          "steps": [
                { "op": "take",
                  "item": -1,
                  "item_name": "default"},
                { "op": "chooseleaf_firstn",
                  "num": 0,
                  "type": "host"},
                { "op": "emit"}]}],
  "tunables": { "choose_local_tries": 0,[CODE]
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"profile": "bobtail",
"optimal_tunables": 0,
"legacy_tunables": 0,
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"require_feature_tunables3": 0,
"has_v2_rules": 0,
"has_v3_rules": 0}}
[/CODE]

ceph.conf :
Code:
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         auth supported = cephx
         cluster network = 10.10.1.0/24
         filestore xattr use omap = true
         fsid = 2dbbec32-a464-4bc5-bb2b-983695d1d0c6
         keyring = /etc/pve/priv/$cluster.$name.keyring
         osd journal size = 5120
         osd pool default min size = 1
         public network = 192.168.80.0/24
         mon osd down out subtree limit = host
         osd max backfills = 1
         osd recovery max active = 1


[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring


[mon.4]
         host = ceph05
         mon addr = 192.168.80.45:6789


[mon.0]
         host = ceph01
         mon addr = 192.168.80.41:6789


[mon.1]
         host = ceph02
         mon addr = 192.168.80.42:6789


[mon.3]
         host = ceph04
         mon addr = 192.168.80.44:6789


[mon.2]
         host = ceph03
         mon addr = 192.168.80.43:6789

Thanks.
Best regards
 
I may be incorrect but I am thinking your Monitors are not on your cluster network.

Serge
 
Hi,

your main problem is: ceph is not really designed for 3 osd nodes with 3 disks each.

Ceph begins to shine the more osd-nodes and the more disks you have.

In your case you suddenly loose 33% of your disks. Use more hosts and more disks and you won't have problems in the size of suddenly losing 33% capacity and ressources.

If you shut down your osd host cleanly the osds will be marked as down and there is no 1 minute delay to find out that the osds are down.

You can also tune the timeouts after what time an osd is marked as down but I think the defaults are quite ok.

Maybe also your timeouts are from a massive overload because after you loose 33% of all disks your cluster is tryiing to repair itself and to move all data.
You can prevent this (e.g. for maintenance) by setting "ceph osd set noout".

You almost certainly have to tune the intensity ceph spends on repairing the cluster and moving the data and lower the values.
We use:

[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
 
Hi,
I don't think, that the problem is because "3 hosts with each 3 osd disks" related. This should work fine, and an rebuild should not happens, because the default for replica is now 3 - so if the pool, which pve used, has an replica of 3 there are no rebuild possible if on node go down.

But the crush map locks very strange! The weight don't fit to the "ceph osd tree" output!

EDIT: Sorry, my fault. The output of "ceph osd crush dump" differs from "ceph osd getcrushmap -o crushmap.compiled; crushtool -d crushmap.compiled -o crushmap.decompiled" and edit the crushmap.decompiled...

This output should be comparable to my crusmap below.

/Edit

You can try to modify the crusmap, reload it and see if anything changed.

And you have 5 mons?? but only 3 OSD-Nodes?

The crusmap should look like
Code:
  "buckets": [
        { "id": -1,
          "name": "default",
          "type_id": 10,
          "type_name": "root",
          "weight": 32.76,

...
        { "id": -2,
          "name": "ceph01",
          "type_id": 1,
          "type_name": "host",
          "weight": 10.92,
          "alg": "straw",
          "hash": "rjenkins1",
          "items": [
                { "id": 0,
                  "weight": 3.64,
                  "pos": 0},
                { "id": 2,
                  "weight": 3.64,
                  "pos": 1},
                { "id": 1,
                  "weight": 3.64,
                  "pos": 2}]},
...
Udo
 
Last edited:
Hi,
how looks the output of following command on ceph01:
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep reporters
Udo


Hi,

Here the result of the command :
Code:
# ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok config show | grep reporters
  "mon_osd_min_down_reporters": "1",

I have decompiled the crushmap and the result is the same as your crushmap, and the same as the display in Proxmox GUI :
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1


# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8


# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root


# buckets
host ceph01 {
        id -2           # do not change unnecessarily
        # weight 10.920
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 3.640
        item osd.2 weight 3.640
        item osd.1 weight 3.640
}
host ceph02 {
        id -3           # do not change unnecessarily
        # weight 10.920
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 3.640
        item osd.4 weight 3.640
        item osd.5 weight 3.640
}
host ceph03 {
        id -4           # do not change unnecessarily
        # weight 10.920
        alg straw
        hash 0  # rjenkins1
        item osd.6 weight 3.640
        item osd.7 weight 3.640
        item osd.8 weight 3.640
}
root default {
        id -1           # do not change unnecessarily
        # weight 32.760
        alg straw
        hash 0  # rjenkins1
        item ceph01 weight 10.920
        item ceph02 weight 10.920
        item ceph03 weight 10.920
}


# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}


# end crush map

I have change this parameter in ceph.conf :
Code:
mon osd down out subtree limit = host
So cluster ceph doesn't rebalance data to other OSD when I stop one OSD-node, but VM freezes just before the 3 OSDs are marked down by cluster (between 60 and 150s).

The choice of 5 mon-nodes and 3 OSD-nodes (3 of 5 mon-nodes and the 3 OSD-nodes are the same) is due to our architecture and the HA of ceph cluster.
2 OSD-nodes are in room 1 (6 OSDs and 2 mons), 1 OSD-node and 1 mon-node are on room 2 (3 OSDs et 2 mons), 1 mon-node in room 3.
If a room is unreachable, there is always at least 3 active mons, so the cluster is still reachable.

For information, this is the file /etc/pve/storage.cfg on pve cluster :
Code:
rbd: ceph1
        monhost 192.168.80.41:6789;192.168.80.42:6789;192.168.80.43:6789;192.168.80.44:6789;192.168.80.45:6789
        pool pool1
        content images
        nodes promox1,promox2
        username admin
It map RBD on ceph public network (192.168.80.0/24).
The cluster network (10.10.1.0/24) is reserved on ceph cluster for OSD replication.

Thanks
 
Hi,
looks OK!
Can you try if the issue is osd-related and not mon-related?!

Does the same happens if you stop only the OSDs? Like
Code:
service ceph stop osd.0
service ceph stop osd.1
service ceph stop osd.2
Udo
 
Hi,
looks OK!
Can you try if the issue is osd-related and not mon-related?!

Does the same happens if you stop only the OSDs? Like
Code:
service ceph stop osd.0
service ceph stop osd.1
service ceph stop osd.2
Udo

Hi,
I think the solution is in the "mon osd adjust heartbeat grace" parameter.

When I have tested rebuild the first time, I had a lot of latency on OSDs due to, I think, the "OSD max backfills" parameter (default : 10, changed to 1).
Then, the time before the OSDs was marked down could last several minutes, even if "osd heartbeat grace" is set to 20s.

I have changed the "mon osd adjust heartbeat grace" parameter to false and the time before OSDs are marked down is rigorously 20s !
The VMs on pve cluster freeze 20s but it's an acceptable and fixe time.

In Ceph documentation, the description of "mon osd adjust heartbeat grace" parameter is "If set to true, Ceph will scale based on laggy estimations".
We have never find how it is calculated and the period, and where the statistics are saved.
Do you have any idea ?

Thanks,

Fall
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!