CEPH does not mark OSD down after node power failure

DANILO MONTAGNA · Oct 24, 2019

Hi,

I'm making some testing in a ceph cluster before put VMs in production onto this environment.. but we are seeing a strange problem..

When I reboot a node (Clean OS Shutdown) everything works great in the Ceph Manager, the node OSD become DOWN and everything works as expected..

But if we simulate a node power failure, pulling the power cords out from the server (Dirty Shutdown) the CEPH Manager still shows the node OSDs as UP/IN

the Survivor node logs still shows: "pgmap v19142: 1024 pgs: 1024 active+clean", into the Proxmox GUI, the OSDs from the failed node still appears as UP/IN

Some more logs I collected from the survivor node:

/var/log/ceph/ceph.log:
cluster [WRN] Health check update: 129 slow ops, oldest one blocked for 537 sec, daemons [mon,pve01-bnu,mon,pve03-bnu] have slow ops. (SLOW_OPS)

/var/log/syslog:
09:40:41.025 7f9781bdd700 -1 osd.6 207 heartbeat_check: no reply from 189.XXX.XXX.XXX:6830 osd.19 since back 2019-10-24 09:30:17.278044 front 2019-10-24 09:30:17.277976 (oldest deadline 2019-10-24 09:30:42.577666)

/var/log/ceph/ceph-mgr.node02.log:
log_channel(cluster) log [DBG] : pgmap v19222: 1024 pgs: 1024 active+clean; 5.2 GiB data, 11 GiB used, 18 TiB / 18 TiB avail

In this situation, I can't access the STORAGE from survivor node anymore... and the VMs becomes unstable (read/write errors)

I can only get the environment stable again if a manually mark the OSDs from the failed node as DOWN, using the command: ceph osd down osd.X

DANILO MONTAGNA · Oct 24, 2019

Attached is CEPH Manager GUI

Alwin · Oct 24, 2019

It can take up to 10min till an OSD is marked as out. But on node failure this usually happens faster. What is the size/min_size of the pools? And how are the OSDs distributed?

DANILO MONTAGNA · Oct 24, 2019

Alwin said:
It can take up to 10min till an OSD is marked as out. But on node failure this usually happens faster. What is the size/min_size of the pools? And how are the OSDs distributed?

It has been almost 20min since the power failure and OSDs are still UP/IN..

size/min_size is 2/1 (only a test environment)
10 OSD per node

the environment is:
3 mon nodes
3 mgr nodes
2 osd nodes

DANILO MONTAGNA · Oct 24, 2019

after almost 30min the OSDs were marked as DOWN..

Why it took so long? 30min is not a supported scenario for a cluster with dozens of VMs running...

Were can I lower this timeout or why it took so long to ceph take OSD down...

Alwin · Oct 24, 2019

DANILO MONTAGNA said:
size/min_size is 2/1 (only a test environment)

Never run with 2/1 in a small setup, this drastically increases the risk of data loss. Since the remaining object may be in-flight and not written on any OSD or the OSD dies.

DANILO MONTAGNA said:
Were can I lower this timeout or why it took so long to ceph take OSD down...

Try this, set it to host.
https://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#configuration-settings

mon osd down out subtree limit

Description
The smallest CRUSH unit type that Ceph will not automatically mark out. For instance, if set to host and if all OSDs of a host are down, Ceph will not automatically mark out these OSDs.

Type
String

Default
rack

EDIT: But in my test clusters, it reacts immediately, without the above change.

thheo · Dec 29, 2021

Did you sort this out? I experience the same thing

Spewk · Oct 11, 2023

Ciao,
i have the same scenario, and i can't find the right conf for Ceph.
Did you sort it out?

Maximiliano · Oct 11, 2023

Hello, what version of Proxmox VE version are you using?

What are the contents of `/etc/pve/ceph.conf`?

After disconnecting a node:

- Whats the output of `ceph status`?
- Is there anything interesting in journalctl or ceph logs (/var/logs/ceph)?
- Is the cluster working fine? e.g. whats the status of `pvecm status`?

In general it shouldn't take more than a few seconds for the changes to be reflected in the web UI with default settings.

Spewk · Oct 11, 2023

Maximiliano said:
Hello, what version of Proxmox VE version are you using?

What are the contents of `/etc/pve/ceph.conf`?

After disconnecting a node:

- Whats the output of `ceph status`?
- Is there anything interesting in journalctl or ceph logs (/var/logs/ceph)?
- Is the cluster working fine? e.g. whats the status of `pvecm status`?

In general it shouldn't take more than a few seconds for the changes to be reflected in the web UI with default settings.

Ciao Maximiliano,
i'm using latest 8 version, but i had this issue even on 7
i'll update you this evening with conf and status, but it's the default one applied by proxmox configuration (i've made some changes, and rolled back)

it's a kinda "next -> next -> done" configuration, and then struggle for the right ceph.conf line to make it work

nothing more, nothing less

i'll update you later with all the confs requested

Maximiliano · Oct 11, 2023

Please share the ceph config anyways, its not quite obvious what you mean by "struggle for the right ceph.conf.

Spewk · Oct 11, 2023

Maximiliano said:
Please share the ceph config anyways, its not quite obvious what you mean by "struggle for the right ceph.conf.

yep, will do it this evening, i can't connect to that environment now

Spewk · Oct 11, 2023

Maximiliano said:
Please share the ceph config anyways, its not quite obvious what you mean by "struggle for the right ceph.conf.

here it is:

Code:

root@proxmox01:~# cat /etc/ceph/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.178.8/24
         fsid = 5c6f56d5-b58b-4406-a2c0-d70266fab939
         mon_allow_pool_delete = true
         mon_host = 192.168.178.8 192.168.178.10 192.168.178.9
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 1
         osd_pool_default_size = 2
         public_network = 192.168.178.8/24
[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.proxmox01]
         public_addr = 192.168.178.8

[mon.proxmox02]
         public_addr = 192.168.178.9

[mon.proxmox03]
         public_addr = 192.168.178.10

Search

Search

CEPH does not mark OSD down after node power failure

DANILO MONTAGNA

Member

DANILO MONTAGNA

Member

Attachments

Alwin

Proxmox Retired Staff

DANILO MONTAGNA

Member

DANILO MONTAGNA

Member

Alwin

Proxmox Retired Staff

thheo

Renowned Member

Spewk

New Member

Maximiliano

Proxmox Staff Member

Spewk

New Member

Maximiliano

Proxmox Staff Member

Spewk

New Member

Spewk

New Member