[SOLVED] Cluster with Ceph become degraded if I shut down or restart a node

vaschthestampede

Active Member
Oct 21, 2020
115
7
38
38
I have a 4 node cluster with two disks each node.
When for maintenance or upgrade of a server I have to shut down or restart a node become degraded.

1665596163348.png

Initially the degraded PGs are exactly half of the total, in the image the reconstruction had started a few minutes earlier.
The problem is that when there are red PGs the virtual machines become completely unresponsive and, if the situation goes on for too long, it sends a crash.

Ceph configuration:
Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 172.25.2.20/25
     fsid = df623737-f7cc-4aeb-92fd-fec2c2b7ad51
     mon_allow_pool_delete = true
     mon_host = 172.25.2.20 172.25.2.10 172.25.2.30
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 2
     public_network = 172.25.2.20/25

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.atlantico]
     public_addr = 172.25.2.20

[mon.indiano]
     public_addr = 172.25.2.30

[mon.pacifico]
     public_addr = 172.25.2.10

Ceph Crush Map.
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class NVMeCluster
device 1 osd.1 class NVMeCluster
device 2 osd.2 class NVMeCluster
device 3 osd.3 class NVMeCluster
device 4 osd.4 class NVMeCluster
device 5 osd.5 class NVMeCluster
device 6 osd.6 class NVMeCluster
device 7 osd.7 class NVMeCluster

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host atlantico {
    id -3        # do not change unnecessarily
    id -4 class NVMeCluster        # do not change unnecessarily
    # weight 13.973
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 6.986
    item osd.1 weight 6.986
}
host pacifico {
    id -5        # do not change unnecessarily
    id -6 class NVMeCluster        # do not change unnecessarily
    # weight 13.973
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 6.986
    item osd.3 weight 6.986
}
host indiano {
    id -7        # do not change unnecessarily
    id -8 class NVMeCluster        # do not change unnecessarily
    # weight 13.973
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 6.986
    item osd.5 weight 6.986
}
host artico {
    id -9        # do not change unnecessarily
    id -10 class NVMeCluster        # do not change unnecessarily
    # weight 13.973
    alg straw2
    hash 0    # rjenkins1
    item osd.6 weight 6.986
    item osd.7 weight 6.986
}
root default {
    id -1        # do not change unnecessarily
    id -2 class NVMeCluster        # do not change unnecessarily
    # weight 55.890
    alg straw2
    hash 0    # rjenkins1
    item atlantico weight 13.973
    item pacifico weight 13.973
    item indiano weight 13.973
    item artico weight 13.973
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule miniserver_NVMeCluster {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class NVMeCluster
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Server View

How can I make sure that the PGs, during maintenance or upgrade moments, are only remapped (yellow) so that the virtual machines are responsive?
 
Last edited:
I think there is min_size parameter on each pool, and according to your config it will be 2 by default. If your pool replicated size is 2 then you need pool min_size to be 1 to be able to survive a downtime.

You can check your pool parameters with ceph osd pool ls detail

You can set the pool size with ceph osd pool set <pool> min_size 1

While they say it's not recommended for production use (recommendation is to set the replicated size to 3), I am running a configuration with replicated size of 2 and min_size of 1 in my home lab for several years now without issues...
 
So I could set up
osd_pool_default_min_size = 1
or
osd_pool_default_size = 3

In the first case I would have more space and less redundancy, in the second the opposite.
 
Please do NOT set min_size to 1! This increases the chances for data loss and inconsistencies a lot.

If you only have your pools configured to use a size=2, then you will be below the min_size if one node or OSD goes down. The pool will only be operational again, once the node / OSD is back up, or the OSD has been marked as out which will trigger the recovery to get back to the number of replicas configured in the "size" parameter. Ceph will automatically mark an OSD as out, if it is down for more than 10 minutes.

The default size/min_size of 3/2 will still keep the pool operational if one node or OSD is down, as there will be 2 replicas still available.
 
So I could set up
osd_pool_default_min_size = 1
or
osd_pool_default_size = 3

In the first case I would have more space and less redundancy, in the second the opposite.

I believe those are the defaults for the newly created pools. For the existing pool you need to use 'ceph osd pool set' command
And yes, min_size 1 with replicated_size 2 is risky, only use it for something that you can afford to lose and re-create easily, like a lab... You can have multiple pools with different settings, btw...
 
I just set osd_pool_default_size = 3 through the webgui.

1665683764976.png

I expected there were only remapped PGs.
The reconstruction should finish in two / three hours

Tomorrow I try to restart a node to confirm that the change has the desired results.
 
Now i have this message:
Code:
Degraded data redundancy: 189517/12115939 objects degraded (1.564%), 6 pgs degraded, 6 pgs undersized
pg 2.d is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,3]
pg 2.2f is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,6]
pg 2.32 is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,7]
pg 2.48 is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [3,0]
pg 2.6e is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,3]
pg 2.76 is stuck undersized for 13h, current state active+undersized+degraded+remapped+backfill_toofull, last acting [1,7]

What can i do?
 
Given the problem described above I moved on to:
osd_pool_default_min_size = 1
AND
osd_pool_default_size = 2

I know it's not ideal but now I don't think I can do otherwise.
Tonight I will try to shut down a server, as I will need to add a graphics card.

I have a question, could the fact that I have two disks per server be a problem?
 
I have a question, could the fact that I have two disks per server be a problem?
In general, not really. Ceph (in the default settings) does redundancy on the host level. No 2 replicas of the same data should end up on the same node.

The more resources you give Ceph (nodes, OSDs, ...) the easier it is for it to recover from a failure. If you have a small cluster, you will have to take a closer look at how you set it up.

A common situation is a 3 node cluster, the smallest possible one. Given the default size=3 which is equal to the nodes available, you need to pay close attention to how many OSDs you give each node.

In the simplest case of 1 OSD per node, you are fine if node or an OSD fails. There are no other nodes or OSDs to recover to, given the limitation that only 1 replica can be stored per node. So the cluster will be in degraded state until the failed node or OSD is fixed.

If you add more OSDs to each node and only one OSD fails, Ceph can still recover the replicas to the remaining OSDs as it still fits into the "1 replica per node" limit. This is where things can go south quite quickly. Let's assume 2 OSDs per node, each filled 40%. Now one OSD fails and Ceph will recover the data on the other remaining OSD. It is now 80% full. If the OSDs are fuller, you run into problems as the remaining OSD will be nearfull or actually full. Ceph can handle a lot of things, but the one thing you want to avoid is to run out of available space!

Therefore, especially in smaller clusters, it is good to have smaller but more OSDs to be able to handle the loss of a single OSD. The more nodes you have, the less important it is, as Ceph can recover the data on the lost OSD to other nodes as well.


With that in mind, shutting down a server out of 4, should not be too much of a problem, if you have enough space. If you keep the node down for too long, Ceph will automically set the OSDs that are in the downed node, to "out". This will happen after 10 minutes and cause the recovery of the data on the remaining nodes.

If it is a planned maintenance, you can set the following OSD flags for the duration of the maintenance to stop Ceph from doing that.
  • noout
  • norebalance
  • norecover
should avoid the automatic "out" for those OSDs and the other 2 flags are just in case. Once you are done and the node is back up, remove the flags again to give Ceph the ability to handle problems.
 
Last edited:
Thank you for the extended explanation.

If it is a planned maintenance, you can set the following OSD flags for the duration of the maintenance to stop Ceph from doing that.
  • noout
  • norebalance
  • norecover
should avoid the automatic "out" for those OSDs and the other 2 flags are just in case. Once you are done and the node is back up, remove the flags again to give Ceph the ability to handle problems.

How?
 
In the GUI via the "Global Flags" button in the OSD panel or via the CLI with ceph osd set <flag> and ceph osd unset <flag>
 
  • Like
Reactions: vaschthestampede

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!