CEPH does not mark OSD down after node power failure

Jun 28, 2019
119
12
23
43
Hi,

I'm making some testing in a ceph cluster before put VMs in production onto this environment.. but we are seeing a strange problem..

When I reboot a node (Clean OS Shutdown) everything works great in the Ceph Manager, the node OSD become DOWN and everything works as expected..

But if we simulate a node power failure, pulling the power cords out from the server (Dirty Shutdown) the CEPH Manager still shows the node OSDs as UP/IN

the Survivor node logs still shows: "pgmap v19142: 1024 pgs: 1024 active+clean", into the Proxmox GUI, the OSDs from the failed node still appears as UP/IN

Some more logs I collected from the survivor node:

/var/log/ceph/ceph.log:
cluster [WRN] Health check update: 129 slow ops, oldest one blocked for 537 sec, daemons [mon,pve01-bnu,mon,pve03-bnu] have slow ops. (SLOW_OPS)

/var/log/syslog:
09:40:41.025 7f9781bdd700 -1 osd.6 207 heartbeat_check: no reply from 189.XXX.XXX.XXX:6830 osd.19 since back 2019-10-24 09:30:17.278044 front 2019-10-24 09:30:17.277976 (oldest deadline 2019-10-24 09:30:42.577666)

/var/log/ceph/ceph-mgr.node02.log:
log_channel(cluster) log [DBG] : pgmap v19222: 1024 pgs: 1024 active+clean; 5.2 GiB data, 11 GiB used, 18 TiB / 18 TiB avail

In this situation, I can't access the STORAGE from survivor node anymore... and the VMs becomes unstable (read/write errors)

I can only get the environment stable again if a manually mark the OSDs from the failed node as DOWN, using the command: ceph osd down osd.X
 
It can take up to 10min till an OSD is marked as out. But on node failure this usually happens faster. What is the size/min_size of the pools? And how are the OSDs distributed?
 
It can take up to 10min till an OSD is marked as out. But on node failure this usually happens faster. What is the size/min_size of the pools? And how are the OSDs distributed?

It has been almost 20min since the power failure and OSDs are still UP/IN..

size/min_size is 2/1 (only a test environment)
10 OSD per node

the environment is:
3 mon nodes
3 mgr nodes
2 osd nodes
 
after almost 30min the OSDs were marked as DOWN..

Why it took so long? 30min is not a supported scenario for a cluster with dozens of VMs running...

Were can I lower this timeout or why it took so long to ceph take OSD down...
 
size/min_size is 2/1 (only a test environment)
Never run with 2/1 in a small setup, this drastically increases the risk of data loss. Since the remaining object may be in-flight and not written on any OSD or the OSD dies.

Were can I lower this timeout or why it took so long to ceph take OSD down...
Try this, set it to host.
https://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#configuration-settings
mon osd down out subtree limit

Description
The smallest CRUSH unit type that Ceph will not automatically mark out. For instance, if set to host and if all OSDs of a host are down, Ceph will not automatically mark out these OSDs.

Type
String

Default
rack

EDIT: But in my test clusters, it reacts immediately, without the above change.
 
Last edited:
Ciao,
i have the same scenario, and i can't find the right conf for Ceph.
Did you sort it out?
 
Hello, what version of Proxmox VE version are you using?

What are the contents of `/etc/pve/ceph.conf`?

After disconnecting a node:

- Whats the output of `ceph status`?
- Is there anything interesting in journalctl or ceph logs (/var/logs/ceph)?
- Is the cluster working fine? e.g. whats the status of `pvecm status`?

In general it shouldn't take more than a few seconds for the changes to be reflected in the web UI with default settings.
 
Hello, what version of Proxmox VE version are you using?

What are the contents of `/etc/pve/ceph.conf`?

After disconnecting a node:

- Whats the output of `ceph status`?
- Is there anything interesting in journalctl or ceph logs (/var/logs/ceph)?
- Is the cluster working fine? e.g. whats the status of `pvecm status`?

In general it shouldn't take more than a few seconds for the changes to be reflected in the web UI with default settings.
Ciao Maximiliano,
i'm using latest 8 version, but i had this issue even on 7
i'll update you this evening with conf and status, but it's the default one applied by proxmox configuration (i've made some changes, and rolled back)

it's a kinda "next -> next -> done" configuration, and then struggle for the right ceph.conf line to make it work :) nothing more, nothing less

i'll update you later with all the confs requested
 
Please share the ceph config anyways, its not quite obvious what you mean by "struggle for the right ceph.conf.
 
Please share the ceph config anyways, its not quite obvious what you mean by "struggle for the right ceph.conf.
here it is:

Code:
root@proxmox01:~# cat /etc/ceph/ceph.conf
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 192.168.178.8/24
         fsid = 5c6f56d5-b58b-4406-a2c0-d70266fab939
         mon_allow_pool_delete = true
         mon_host = 192.168.178.8 192.168.178.10 192.168.178.9
         ms_bind_ipv4 = true
         ms_bind_ipv6 = false
         osd_pool_default_min_size = 1
         osd_pool_default_size = 2
         public_network = 192.168.178.8/24
[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.proxmox01]
         public_addr = 192.168.178.8

[mon.proxmox02]
         public_addr = 192.168.178.9

[mon.proxmox03]
         public_addr = 192.168.178.10
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!