[SOLVED] CEPH resilience: self-heal flawed?

Lephisto

Well-Known Member
Jun 22, 2019
154
16
58
47
Hi,

i'm exeriencing a strange behaviour in pve@6.1-7/ceph@14.2.6:

I have a Lab setup with 5 physical Nodes, each with two OSDs.

This is the Ceph Config + Crushmap:

Config:
Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.42.42.1/24
     fsid = <redacted:fsid>
     mon_allow_pool_delete = true
     mon_host = 10.42.42.1 10.42.42.2 10.42.42.3 10.42.42.4 10.42.42.5
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.42.42.1/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

Crush:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host node1 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.116
    item osd.1 weight 0.116
}
host node2 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.116
    item osd.3 weight 0.116
}
host node3 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.116
    item osd.5 weight 0.116
}
host node4 {
    id -9        # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.6 weight 0.116
    item osd.7 weight 0.116
}
host node5 {
    id -11        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.116
    item osd.9 weight 0.116
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 1.162
    alg straw2
    hash 0    # rjenkins1
    item node1 weight 0.232
    item node2 weight 0.232
    item node3 weight 0.232
    item node4 weight 0.232
    item node5 weight 0.232
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

So nothing fancy in here, all straight forward.

When I take down a first node now, it's two OSD's appear "DOWN", and some minutes later the manager marks it out and redistribution of data throughout the cluster is kicking in:

Code:
-11       0.23239     host node5
  8   hdd 0.11620         osd.8    down        0 1.00000
  9   hdd 0.11620         osd.9    down        0 1.00000

Nodes reweight set to 0 (OUT), all fine.

Second node gets shutdown, this is what the complete osd tree looks:

Code:
ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       1.16196 root default
 -3       0.23239     host node1
  0   hdd 0.11620         osd.0    down        0 1.00000
  1   hdd 0.11620         osd.1    down  1.00000 1.00000
 -5       0.23239     host node2
  2   hdd 0.11620         osd.2      up  1.00000 1.00000
  3   hdd 0.11620         osd.3      up  1.00000 1.00000
 -7       0.23239     host node3
  4   hdd 0.11620         osd.4      up  1.00000 1.00000
  5   hdd 0.11620         osd.5      up  1.00000 1.00000
 -9       0.23239     host node4
  6   hdd 0.11620         osd.6      up  1.00000 1.00000
  7   hdd 0.11620         osd.7      up  1.00000 1.00000
-11       0.23239     host node5
  8   hdd 0.11620         osd.8    down        0 1.00000
  9   hdd 0.11620         osd.9    down        0 1.00000

So one OSD of the down node get's OUT'ed, the other one won't ever. This inhibits selfhealing because redistribution of Objects on the OSD which wasn't OUT'ed won't ever kick in.

So i totally miss something here?
 
Is there any IO on that cluster?
 
Minimal, Just a lab setup.
Run your test, while eg. a rados bench is running. It could be that the OSD wasn't accessed by at least two peers.
 
Hi Alwin,

thanks for the answer. I will do so.

What's suspicious is that the first Nodefail is handed proper, and the second not. I tried this like 5 times, it's absolutely reproduceable under minimal load settings. I will re-test this with some more heavy i/o load.

Will keep you post..
 
Double checked, made some traffic not with rados bench but just client traffic while putting a node offline. Still the same. From the second failed node one OSD won't go out. This is reproduceable in my Nested virt, as well. First I thought it was some strange Virt-in-Virt effect, that's the reason why i started testing this scenario on physical hosts.
 
Best check the Ceph logs on all of the nodes. Any settings changed?
 
No Settings changed, all pretty much default.

What I see in the Logs of the active manager is:

First OSD (2) of that host behaves correctly..

ceph.log:
Code:
2020-02-27 14:24:02.434055 mon.node2 (mon.1) 290 : cluster [INF] osd.2 marked itself down
2020-02-27 14:34:10.702147 mon.node3 (mon.2) 1555 : cluster [INF] Marking osd.2 out (has been down for 607 seconds)

ceph-mgr.log
Code:
2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] osd.2 marked out
2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] 0 PGs affected by osd.2 going out

Second OSD (3) not:

Code:
2020-02-27 14:24:02.434240 mon.node2 (mon.1) 291 : cluster [INF] osd.3 marked itself down

There's simply nothing in the logs, just the "Marking osd.xy out" message for osd.3 is missing thus the osd not going out.
 
This is all I see in another node. (The node which has the issue is down ofcourse)

Code:
root@node4:/var/log/ceph# zgrep -i "osd\.2" *|grep "2020-02-27\ 14"
ceph.log.1.gz:2020-02-27 14:24:02.434055 mon.node2 (mon.1) 290 : cluster [INF] osd.2 marked itself down
ceph.log.1.gz:2020-02-27 14:34:10.702147 mon.node3 (mon.2) 1555 : cluster [INF] Marking osd.2 out (has been down for 607 seconds)
ceph-mgr.node4.log.1.gz:2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] osd.2 marked out
ceph-mgr.node4.log.1.gz:2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] 0 PGs affected by osd.2 going out
root@node4:/var/log/ceph# zgrep -i "osd\.3" *|grep "2020-02-27\ 14"
ceph.log.1.gz:2020-02-27 14:24:02.434240 mon.node2 (mon.1) 291 : cluster [INF] osd.3 marked itself down
 
It's osd.3 in the last example. I will make it more verbose and test again.

I gracfully shut the node down, but it happens as well if i manually stop the osd service and also when i just pull the plug :)
 
It might be the "mon_osd_down_out_interval": "600", it will take up to 10 min till the OSD get marked out.
 
Yeah sure, after 10 Minutes one OSD gets marked out properly. But not the second one.
 
When you look at the logs, you see that 10 minutes after the Down, one OSD get's OUT'ed, the other one simply not. So the mon_osd_down_out_interval is correctly taken into account.
 
Ah, that could be it.

So in a 5 Node Setup, for auto redistributing to kick in after a second node fail i would set "osd min in ratio" to 0.6 ? Do I make a horrible mistake to override the fefault of .75 here?

regards
 
Just tested it, now it makes perfect sense. mon_osd_min_in_ratio default = 0.75.

As asked above, what are the implications when i set this value to 0.6 in a 5 node setting to enable it to self heal down to the smalles quorum size? I guess there is a good reason this value is 0.75 by default.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!