[SOLVED] CEPH resilience: self-heal flawed?

Lephisto · Feb 26, 2020

Hi,

i'm exeriencing a strange behaviour in pve@6.1-7/ceph@14.2.6:

I have a Lab setup with 5 physical Nodes, each with two OSDs.

This is the Ceph Config + Crushmap:

Config:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.42.42.1/24
     fsid = <redacted:fsid>
     mon_allow_pool_delete = true
     mon_host = 10.42.42.1 10.42.42.2 10.42.42.3 10.42.42.4 10.42.42.5
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.42.42.1/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

Crush:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host node1 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.116
    item osd.1 weight 0.116
}
host node2 {
    id -5        # do not change unnecessarily
    id -6 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.116
    item osd.3 weight 0.116
}
host node3 {
    id -7        # do not change unnecessarily
    id -8 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.116
    item osd.5 weight 0.116
}
host node4 {
    id -9        # do not change unnecessarily
    id -10 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.6 weight 0.116
    item osd.7 weight 0.116
}
host node5 {
    id -11        # do not change unnecessarily
    id -12 class hdd        # do not change unnecessarily
    # weight 0.232
    alg straw2
    hash 0    # rjenkins1
    item osd.8 weight 0.116
    item osd.9 weight 0.116
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 1.162
    alg straw2
    hash 0    # rjenkins1
    item node1 weight 0.232
    item node2 weight 0.232
    item node3 weight 0.232
    item node4 weight 0.232
    item node5 weight 0.232
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

So nothing fancy in here, all straight forward.

When I take down a first node now, it's two OSD's appear "DOWN", and some minutes later the manager marks it out and redistribution of data throughout the cluster is kicking in:

Code:

-11       0.23239     host node5
  8   hdd 0.11620         osd.8    down        0 1.00000
  9   hdd 0.11620         osd.9    down        0 1.00000

Nodes reweight set to 0 (OUT), all fine.

Second node gets shutdown, this is what the complete osd tree looks:

Code:

ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       1.16196 root default
 -3       0.23239     host node1
  0   hdd 0.11620         osd.0    down        0 1.00000
  1   hdd 0.11620         osd.1    down  1.00000 1.00000
 -5       0.23239     host node2
  2   hdd 0.11620         osd.2      up  1.00000 1.00000
  3   hdd 0.11620         osd.3      up  1.00000 1.00000
 -7       0.23239     host node3
  4   hdd 0.11620         osd.4      up  1.00000 1.00000
  5   hdd 0.11620         osd.5      up  1.00000 1.00000
 -9       0.23239     host node4
  6   hdd 0.11620         osd.6      up  1.00000 1.00000
  7   hdd 0.11620         osd.7      up  1.00000 1.00000
-11       0.23239     host node5
  8   hdd 0.11620         osd.8    down        0 1.00000
  9   hdd 0.11620         osd.9    down        0 1.00000

So one OSD of the down node get's OUT'ed, the other one won't ever. This inhibits selfhealing because redistribution of Objects on the OSD which wasn't OUT'ed won't ever kick in.

So i totally miss something here?

sg90 · Feb 27, 2020

What is your pool setup?

Alwin · Feb 27, 2020

Is there any IO on that cluster?

Lephisto · Feb 27, 2020

Alwin said:
Is there any IO on that cluster?

Minimal, Just a lab setup.

Alwin · Feb 27, 2020

Lephisto said:
Minimal, Just a lab setup.

Run your test, while eg. a rados bench is running. It could be that the OSD wasn't accessed by at least two peers.

Lephisto · Feb 27, 2020

Hi Alwin,

thanks for the answer. I will do so.

What's suspicious is that the first Nodefail is handed proper, and the second not. I tried this like 5 times, it's absolutely reproduceable under minimal load settings. I will re-test this with some more heavy i/o load.

Will keep you post..

Lephisto · Feb 27, 2020

Double checked, made some traffic not with rados bench but just client traffic while putting a node offline. Still the same. From the second failed node one OSD won't go out. This is reproduceable in my Nested virt, as well. First I thought it was some strange Virt-in-Virt effect, that's the reason why i started testing this scenario on physical hosts.

Alwin · Feb 27, 2020

Best check the Ceph logs on all of the nodes. Any settings changed?

Lephisto · Feb 27, 2020

No Settings changed, all pretty much default.

What I see in the Logs of the active manager is:

First OSD (2) of that host behaves correctly..

ceph.log:

Code:

2020-02-27 14:24:02.434055 mon.node2 (mon.1) 290 : cluster [INF] osd.2 marked itself down
2020-02-27 14:34:10.702147 mon.node3 (mon.2) 1555 : cluster [INF] Marking osd.2 out (has been down for 607 seconds)

ceph-mgr.log

Code:

2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] osd.2 marked out
2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] 0 PGs affected by osd.2 going out

Second OSD (3) not:

Code:

2020-02-27 14:24:02.434240 mon.node2 (mon.1) 291 : cluster [INF] osd.3 marked itself down

There's simply nothing in the logs, just the "Marking osd.xy out" message for osd.3 is missing thus the osd not going out.

Alwin · Feb 27, 2020

Well, there are more logs in the /var/log/ceph/ folder. Each daemon has its own.

The OSDs will also log who went down.
https://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#osds-check-heartbeats

Lephisto · Feb 28, 2020

This is all I see in another node. (The node which has the issue is down ofcourse)

Code:

root@node4:/var/log/ceph# zgrep -i "osd\.2" *|grep "2020-02-27\ 14"
ceph.log.1.gz:2020-02-27 14:24:02.434055 mon.node2 (mon.1) 290 : cluster [INF] osd.2 marked itself down
ceph.log.1.gz:2020-02-27 14:34:10.702147 mon.node3 (mon.2) 1555 : cluster [INF] Marking osd.2 out (has been down for 607 seconds)
ceph-mgr.node4.log.1.gz:2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] osd.2 marked out
ceph-mgr.node4.log.1.gz:2020-02-27 14:34:10.931 7f9620e8e700  1 mgr[progress] 0 PGs affected by osd.2 going out
root@node4:/var/log/ceph# zgrep -i "osd\.3" *|grep "2020-02-27\ 14"
ceph.log.1.gz:2020-02-27 14:24:02.434240 mon.node2 (mon.1) 291 : cluster [INF] osd.3 marked itself down

Alwin · Feb 28, 2020

Which one is the OSD that isn't marked as out? And did you issue a shutdown or pulled the plug to get the server down?

The logging might not be high enough. Best pump up the volume.

https://docs.ceph.com/docs/luminous...g-and-debug/#subsystem-log-and-debug-settings

Lephisto · Feb 28, 2020

It's osd.3 in the last example. I will make it more verbose and test again.

I gracfully shut the node down, but it happens as well if i manually stop the osd service and also when i just pull the plug

Alwin · Feb 28, 2020

It might be the "mon_osd_down_out_interval": "600", it will take up to 10 min till the OSD get marked out.

Lephisto · Feb 28, 2020

Yeah sure, after 10 Minutes one OSD gets marked out properly. But not the second one.

Alwin · Feb 28, 2020

Lephisto said:
Yeah sure, after 10 Minutes one OSD gets marked out properly.

?

Lephisto · Feb 28, 2020

When you look at the logs, you see that 10 minutes after the Down, one OSD get's OUT'ed, the other one simply not. So the mon_osd_down_out_interval is correctly taken into account.

Alwin · Feb 28, 2020

Lephisto said:
When you look at the logs, you see that 10 minutes after the Down, one OSD get's OUT'ed, the other one simply not. So the mon_osd_down_out_interval is correctly taken into account.

Ah. Now, I see it.

It could be the OSD out ratio, see the link.
https://forum.proxmox.com/threads/ceph-down-out-timeout.54258/post-250168

Lephisto · Feb 28, 2020

Ah, that could be it.

So in a 5 Node Setup, for auto redistributing to kick in after a second node fail i would set "osd min in ratio" to 0.6 ? Do I make a horrible mistake to override the fefault of .75 here?

regards

Lephisto · Feb 28, 2020

Just tested it, now it makes perfect sense. mon_osd_min_in_ratio default = 0.75.

As asked above, what are the implications when i set this value to 0.6 in a 5 node setting to enable it to self heal down to the smalles quorum size? I guess there is a good reason this value is 0.75 by default.

[SOLVED] CEPH resilience: self-heal flawed?

Well-Known Member

Renowned Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

We value your privacy