Understanding Ceph Failure Behavior

gdi2k

Renowned Member
Aug 13, 2016
83
1
73
We have a cluster of 3 Proxmox servers and use Ceph as underlying storage for the VMs. It's fantastic and hasn't given us any trouble so far.

As our usage rises and available capacity dimishes, I'm starting to wonder what actually happens in the event of a failure. I'm not too worried about an OSD failure as there is still plenty of space on the other OSDs to rebalance things. But what happens if one of the nodes fails? As we approach 60%+ usage, in my mind, if we lost one node (out of 3 = 33%) and Ceph tried to rebalance, we may be out of storage space and bad things may happen.

My question is: Will Ceph try to rebalance even if there is not enough storage space to meet its requirements with one node down? Or will it continue to operate in degraded mode (without rebalancing things) until the node is replaced or additional capacity is added?

Our config looks like this:
Code:
ceph osd tree
ID WEIGHT  TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 3.34996 root default                                       
-2 1.12999     host smiles1                                   
 0 0.23000         osd.0         up  1.00000          1.00000
 1 0.89999         osd.1         up  1.00000          1.00000
-3 1.10999     host smiles2                                   
 2 0.20999         osd.2         up  1.00000          1.00000
 3 0.89999         osd.3         up  1.00000          1.00000
-4 1.10999     host smiles3                                   
 4 0.20999         osd.4         up  1.00000          1.00000
 5 0.89999         osd.5         up  1.00000          1.00000

Code:
ceph -s
    cluster ab9b66eb-4363-4fca-85dd-e67e47aef05f
     health HEALTH_OK
     monmap e3: 3 mons at {0=10.15.15.50:6789/0,1=10.15.15.51:6789/0,2=10.15.15.52:6789/0}
            election epoch 178, quorum 0,1,2 0,1,2
     osdmap e476: 6 osds: 6 up, 6 in
      pgmap v5531478: 450 pgs, 1 pools, 1036 GB data, 263 kobjects
            2098 GB used, 1350 GB / 3448 GB avail
                 450 active+clean
  client io 33469 kB/s rd, 6063 kB/s wr, 3181 op/s

Code:
ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    3448G     1350G        2098G         60.85
POOLS:
    NAME     ID     USED      %USED     MAX AVAIL     OBJECTS
    ceph     1      1036G     60.10          632G      269868

Code:
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 10.15.15.0/24
     filestore xattr use omap = true
     fsid = ab9b66eb-4363-4fca-85dd-e67e47aef05f
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 10.15.15.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.2]
     host = smiles3
     mon addr = 10.15.15.52:6789

[mon.0]
     host = smiles1
     mon addr = 10.15.15.50:6789

[mon.1]
     host = smiles2
     mon addr = 10.15.15.51:6789

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host smiles1 {
    id -2        # do not change unnecessarily
    # weight 1.130
    alg straw
    hash 0    # rjenkins1
    item osd.0 weight 0.230
    item osd.1 weight 0.900
}
host smiles2 {
    id -3        # do not change unnecessarily
    # weight 1.110
    alg straw
    hash 0    # rjenkins1
    item osd.2 weight 0.210
    item osd.3 weight 0.900
}
host smiles3 {
    id -4        # do not change unnecessarily
    # weight 1.110
    alg straw
    hash 0    # rjenkins1
    item osd.4 weight 0.210
    item osd.5 weight 0.900
}
root default {
    id -1        # do not change unnecessarily
    # weight 3.350
    alg straw
    hash 0    # rjenkins1
    item smiles1 weight 1.130
    item smiles2 weight 1.110
    item smiles3 weight 1.110
}

# rules
rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

Server View
Datacenter
smiles1
101 (pfsense21)
131 (freepbx90)
301 (Win10-01-112)
888 (pxeboot)
161 (UbuntuServerTemplate)
ceph (smiles1)
local (smiles1)
local-backup (smiles1)
smiles2
111 (ltspserver70)
ceph (smiles2)
local (smiles2)
local-backup (smiles2)
smiles3
122 (ucarp75)
141 (unifi51)
151 (learn110)
171 (zabbix61)
302 (Win10Pro-113)
ceph (smiles3)
local (smiles3)
local-backup (smiles3)
Logs
Tasks
Cluster log
Start Time
End Time
Node
User name
Description
Status
Feb 14 03:06:06
Feb 14 03:06:13
smiles2
root@pam
Update package database
OK
Feb 14 02:43:05
Feb 14 02:43:12
smiles3
root@pam
Update package database
OK
Feb 14 02:32:06
Feb 14 02:32:20
smiles1
root@pam
Update package database
OK
Feb 13 17:05:02
Feb 13 17:25:51
smiles1
root@pam
Backup
OK
Feb 13 17:05:01
Feb 13 17:10:16
smiles2
root@pam
Backup
OK
Feb 13 17:05:01
Feb 13 17:36:47
smiles3
root@pam
Backup
OK
Feb 13 03:06:07
Feb 13 03:06:13
smiles2
root@pam
Update package database
OK
Feb 13 02:43:04
Feb 13 02:43:11
smiles3
root@pam
Update package database
OK
Feb 13 02:32:06
Feb 13 02:32:12
smiles1
root@pam
Update package database
OK
Feb 12 17:05:02
Feb 12 17:10:07
smiles2
root@pam
Backup
OK
Feb 12 17:05:02
Feb 12 17:36:44
smiles3
root@pam
Backup
OK
Feb 12 17:05:01
Feb 12 17:25:49
smiles1
root@pam
Backup
OK
Feb 12 03:06:05
Feb 12 03:06:12
smiles2
root@pam
Update package database
OK
Feb 12 02:43:05
Feb 12 02:43:12
smiles3
root@pam
Update package database
OK
Feb 12 02:32:05
Feb 12 02:32:12
smiles1
root@pam
Update package database
OK
Feb 11 17:05:02
Feb 11 17:25:41
smiles1
root@pam
Backup
OK
Feb 11 17:05:01
Feb 11 17:10:10
smiles2
root@pam
Backup
OK
Feb 11 17:05:01
Feb 11 17:34:17
smiles3
root@pam
Backup
OK
Feb 11 16:52:16
Feb 11 16:52:16
smiles3
root@pam
VM 171 - Resume
OK
Feb 11 16:51:56
Feb 11 16:51:57
smiles3
root@pam
VM 171 - Start
OK
Feb 11 16:51:55
Feb 11 16:52:18
smiles2
root@pam
VM 171 - Migrate
OK
Feb 11 16:51:17
Feb 11 16:51:17
smiles3
root@pam
VM 151 - Resume
OK
Feb 11 16:51:15
Feb 11 16:51:15
smiles3
root@pam
VM 141 - Resume
OK
Feb 11 16:50:53
Feb 11 16:50:53
smiles3
root@pam
VM 302 - Resume
OK
Feb 11 16:50:38
Feb 11 16:50:38
smiles3
root@pam
VM 122 - Resume
OK
Feb 11 16:50:36
Feb 11 16:50:37
smiles3
root@pam
VM 151 - Start
OK
Feb 11 16:50:36
Feb 11 16:50:36
smiles3
root@pam
VM 141 - Start
OK
Feb 11 16:50:35
Feb 11 16:51:18
smiles2
root@pam
VM 151 - Migrate
OK
Feb 11 16:50:35
Feb 11 16:51:16
smiles2
root@pam
VM 141 - Migrate
OK
Feb 11 16:50:26
Feb 11 16:50:26
smiles3
root@pam
VM 122 - Start
OK
Feb 11 16:50:25
Feb 11 16:50:40
smiles2
root@pam
VM 122 - Migrate
OK
Feb 11 16:49:55
Feb 11 16:49:56
smiles3
root@pam
VM 302 - Start
OK
Feb 11 16:49:54
Feb 11 16:50:54
smiles1
root@pam
VM 302 - Migrate
OK
Feb 11 16:26:39
Feb 11 16:26:41
smiles3
root@pam
Start all VMs and Containers
OK
Feb 11 16:25:48
Feb 11 16:25:48
smiles3
root@pam
Stop all VMs and Containers
OK
Feb 11 16:24:42
Feb 11 16:25:33
smiles1
root@pam
VM/CT 888 - Console
OK
Feb 11 16:24:39
Feb 11 16:24:40
smiles1
root@pam
VM 888 - Start
OK
()
 
...
My question is: Will Ceph try to rebalance even if there is not enough storage space to meet its requirements with one node down? Or will it continue to operate in degraded mode (without rebalancing things) until the node is replaced or additional capacity is added?
Hi,
with your setting ceph does'nt rebalance if one node fails, because the normal (and safe) setting is an replica of 3 - mean you need three nodes and on all nodes are the same data.
Rebalancing start, if one single OSD die - then all data are moved to the remaining OSD of this node.
Or, if you uses and 4. node and one node die.
Rebalancing is done until it's finished or something happens like osd_backfill_full_ratio is reached for one OSD.

If ceph can't write, because an disk is full (mean like 90% (osd_backfill_full_ratio)) your IO blocks - e.g. all VMs stop - until you change your system (bring back the failed node, add an OSD, change osd_backfill_full_ratio).
This is the reason, why an ceph-cluster should filled with 60-70% only.

Udo
 
Hi,
with your setting ceph does'nt rebalance if one node fails, because the normal (and safe) setting is an replica of 3 - mean you need three nodes and on all nodes are the same data.
Rebalancing start, if one single OSD die - then all data are moved to the remaining OSD of this node.
Or, if you uses and 4. node and one node die.
Rebalancing is done until it's finished or something happens like osd_backfill_full_ratio is reached for one OSD.

If ceph can't write, because an disk is full (mean like 90% (osd_backfill_full_ratio)) your IO blocks - e.g. all VMs stop - until you change your system (bring back the failed node, add an OSD, change osd_backfill_full_ratio).
This is the reason, why an ceph-cluster should filled with 60-70% only.

Udo

Maybe I understand wrong, but I/O won't stop if osd_backfill_full_ratio is reached, only recovery will stop.

I/O will only stop @ mon osd_full_ratio, which are two different values (backfill by default lower than osd_full_ratio to stop a backfill making OSD go full and cause write's to be blocked.
 
  • Like
Reactions: udo
Maybe I understand wrong, but I/O won't stop if osd_backfill_full_ratio is reached, only recovery will stop.

I/O will only stop @ mon osd_full_ratio, which are two different values (backfill by default lower than osd_full_ratio to stop a backfill making OSD go full and cause write's to be blocked.
Hi,
yes you are right - but osd_full_ratio can reach fast, if you have only 3 nodes with 2 OSDs and one OSD die.
You can write only osd_full_ratio - osd_backfill_full_ratio with new data.

Udo
 
Hi,
yes you are right - but osd_full_ratio can reach fast, if you have only 3 nodes with 2 OSDs and one OSD die.
You can write only osd_full_ratio - osd_backfill_full_ratio with new data.

Udo

Sure agree with you there was just confirming.
 
Thanks for the advice Ashley and Udo - good to hear that Ceph won't do anything crazy should a node fail, and thank you for pointing me in the direction of docs so I can better understand this.

We will make sure to keep a generous amount of space to spare to avoid any issues during failures. We have 3x 1 TB drives in the mail that we will add 1 TB OSD to each node, which should bring total usage down to safer levels again.

(this is an SSD-only cluster, so we're talking small storage quantities here!)

Cheers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!