[SOLVED] 3-node cluster

NdK73

Renowned Member
Jul 19, 2012
107
6
83
Bologna, Italy
www.csshl.net
Hello all.

I deployed a 3-node cluster w/ ceph, one node per server room, to be able to keep working when (not if) one of the rooms goes down.
Too bad it seems that Proxmox stops IO even when ceph cluster would still be working: the result is that a downed room blocks all the VMs in the cluster as soon as they try to write.

In an old thread @leesteken says "If you want Ceph with redundancy, you need more than three nodes." but that's never said in the docs. On the contrary the docs says that the minimum is 3 nodes. A difference with that thread is that my config uses the default of:
osd_pool_default_min_size = 2
osd_pool_default_size = 3

With a downed room, the other nodes still have quorum, so why do they stop writes? And how should the config be changed to avoid this issue?

Tks
 
Correct, the resources in the cluster should not stop working.

Are MONs running on each node? That's the requirement to keep up two --> "OK".

Three is the absolute minimum. As soon as anything "bad" happens with one node you are degraded. And you stay degraded as there is no space for self-healing...

Also: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Note: I do NOT use Ceph currently.
 
  • Like
Reactions: Johannes S
Correct, the resources in the cluster should not stop working.
That's what I thought. If only Proxmox thought the same... :)

There's a monitor on each host, as default setup, and the quorum is met.

Moreover, now the Proxmox interface hangs when trying to access Ceph dashboard. Paint me even more confused!

Actually ceph does not say anything about being "readonly", just:
Code:
root@virt1:~# ceph health detail
HEALTH_WARN 1/3 mons down, quorum virt1,virt2; 6 osds down; 1 host (6 osds) down; 1 room (6 osds) down; Reduced data availability: 83 pgs inactive; Degraded data redundancy: 793382/2444972 objects degraded (32.450%), 84 pgs degraded, 84 pgs undersized
[WRN] MON_DOWN: 1/3 mons down, quorum virt1,virt2
    mon.virt3 (rank 2) addr [v2:192.168.1.33:3300/0,v1:192.168.1.33:6789/0] is down (out of quorum)
[WRN] OSD_DOWN: 6 osds down
    osd.12 (root=default,room=a51,host=virt3) is down
    osd.13 (root=default,room=a51,host=virt3) is down
    osd.14 (root=default,room=a51,host=virt3) is down
    osd.15 (root=default,room=a51,host=virt3) is down
    osd.16 (root=default,room=a51,host=virt3) is down
    osd.17 (root=default,room=a51,host=virt3) is down
[WRN] OSD_HOST_DOWN: 1 host (6 osds) down
    host virt3 (root=default,room=a51) (6 osds) is down
[WRN] OSD_ROOM_DOWN: 1 room (6 osds) down
    room a51 (root=default) (6 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 83 pgs inactive
[...cut...]
 
As far as i understand it, if you have a 3 node ceph cluster and you loose one node ceph will go into a read only mode because there is no node left to copy data to (if you use 3/2) to keep the redundancy up.

At least that's what some community members suggest

And from UdoB's post
Code:
When (not:if) one OSD (or a whole node, at this point there is no difference!) fails, Ceph is immediately degraded. There is no room for Ceph to heal itself, so being "degraded" is permanent. For a stable situation you really want to have nodes that can jump in and return to a stable condition - automatically. For Ceph in this picture this means to have at least four nodes. (In this specific aspect; in other regards you really want to have five or more of them...)

Essentially: we want one more node than Ceph has “size=N”.

And don't confuse a PVE cluster (corosync) with ceph cluster, they work differently and have different requirements.
 
  • Like
Reactions: UdoB
As far as i understand it, if you have a 3 node ceph cluster and you loose one node ceph will go into a read only mode because there is no node left to copy data to (if you use 3/2) to keep the redundancy up.
No, that's not what is happening. It just works perfectly fine with 2 nodes, because it still has quorum (PVE as well as CEPH), yet you are in a degraded state, because you don't have 3 copies of your data, just 2. In the 4 nodes left out of 5, the ceph cluster will also be degraded until the data is redistributed automatically (self-healing) and will be healthy after that. An additional node can fail and everything will be fine with 3 out of 5 nodes running.
 
  • Like
Reactions: Johannes S and UdoB
Well, I'm OK for ceph to be degraded for some time: it should just continue writing to the other two copies!
That's the whole reason to keep 3 copies: continue to write even if one goes down! Else I'd choose just 2 copies, so that if a node goes down the cluster heals on the remaining two.
With 3 data copies, it should go RO *only* when two copies are down: there's no quorum.

I know well that the two clusters use different quorums (althought they're on the same hosts), but ceph's choice to go RO when hitting min_size seems illogical: that would mean a 3-node setup can not keep working when any node is unavailable.
 
No, that's not what is happening. It just works perfectly fine with 2 nodes, because it still has quorum (PVE as well as CEPH), yet you are in a degraded state, because you don't have 3 copies of your data, just 2. In the 4 nodes left out of 5, the ceph cluster will also be degraded until the data is redistributed automatically (self-healing) and will be healthy after that. An additional node can fail and everything will be fine with 3 out of 5 nodes running.
That's what I understood, too. But practice says it does not work. So I'm trying to understand why and what can I do to make it work ASAP...
 
Well, as others mentioned, if one node is down, The Ceph MONs and Proxmox VE nodes should still have a quorum with 2 out of 3.
Datawise, if you have set size/min_size to 3/2 in all the pools, things should keep working as you should still have 2 of 3 replicas.

The question is, in what state was the cluster when the one node/room was down?

As in, could the remaining two nodes still talk to each other on all the network? (ping?)

What was the output of ceph -s and ceph osd df tree? Did you get any output or was the command stuck? That would indicate that the Ceph cluster isn't working anymore, which is most likely a networking issue.
 
The cluster was healthy. And virt3 node is still down :(

The remaining nodes (virt1 and virt2) can see each other and are actually recognized to form a quorum:
Code:
root@virt1:~# ceph -s
  cluster:
    id:     a76df3b3-8b6d-42da-a901-47e80c660df4
    health: HEALTH_WARN
            1/3 mons down, quorum virt1,virt2
            6 osds down
            1 host (6 osds) down
            1 room (6 osds) down
            Reduced data availability: 83 pgs inactive
            Degraded data redundancy: 793382/2444972 objects degraded (32.450%), 84 pgs degraded, 84 pgs undersized
 
  services:
    mon: 3 daemons, quorum virt1,virt2 (age 3h), out of quorum: virt3
    mgr: virt2(active, since 6M), standbys: virt1
    osd: 18 osds: 12 up (since 3h), 18 in (since 6M)
 
  data:
    pools:   2 pools, 129 pgs
    objects: 1.22M objects, 4.6 TiB
    usage:   9.3 TiB used, 56 TiB / 65 TiB avail
    pgs:     64.341% pgs not active
             793382/2444972 objects degraded (32.450%)
             83 undersized+degraded+peered
             45 active+clean
             1  active+undersized+degraded

root@virt1:~# ceph osd df tree
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME        
 -1         65.49637         -   65 TiB  9.3 TiB  9.2 TiB  163 KiB   53 GiB   56 TiB  14.17  1.00    -          root default     
-15         21.83212         -   22 TiB  3.0 TiB  3.0 TiB   56 KiB   17 GiB   19 TiB  13.79  0.97    -              room a51     
-25         21.83212         -   22 TiB  3.0 TiB  3.0 TiB   56 KiB   17 GiB   19 TiB  13.79  0.97    -                  host virt3
 12    ssd   3.63869   1.00000  3.6 TiB  594 GiB  590 GiB   10 KiB  3.4 GiB  3.1 TiB  15.93  1.12    0    down              osd.12
 13    ssd   3.63869   1.00000  3.6 TiB  558 GiB  555 GiB    6 KiB  3.1 GiB  3.1 TiB  14.97  1.06    0    down              osd.13
 14    ssd   3.63869   1.00000  3.6 TiB  443 GiB  441 GiB    8 KiB  2.6 GiB  3.2 TiB  11.90  0.84    0    down              osd.14
 15    ssd   3.63869   1.00000  3.6 TiB  595 GiB  592 GiB   20 KiB  2.8 GiB  3.1 TiB  15.96  1.13    0    down              osd.15
 16    ssd   3.63869   1.00000  3.6 TiB  409 GiB  407 GiB    4 KiB  2.4 GiB  3.2 TiB  10.98  0.77    0    down              osd.16
 17    ssd   3.63869   1.00000  3.6 TiB  484 GiB  482 GiB    8 KiB  2.6 GiB  3.2 TiB  13.00  0.92    0    down              osd.17
-13         21.83212         -   22 TiB  3.5 TiB  3.5 TiB   52 KiB   19 GiB   18 TiB  15.91  1.12    -              room b038b   
-21         21.83212         -   22 TiB  3.5 TiB  3.5 TiB   52 KiB   19 GiB   18 TiB  15.91  1.12    -                  host virt1
  0    ssd   3.63869   1.00000  3.6 TiB  556 GiB  553 GiB    8 KiB  2.8 GiB  3.1 TiB  14.91  1.05   15      up              osd.0
  1    ssd   3.63869   1.00000  3.6 TiB  519 GiB  515 GiB   16 KiB  3.2 GiB  3.1 TiB  13.92  0.98   14      up              osd.1
  2    ssd   3.63869   1.00000  3.6 TiB  628 GiB  625 GiB    6 KiB  3.1 GiB  3.0 TiB  16.86  1.19   17      up              osd.2
  3    ssd   3.63869   1.00000  3.6 TiB  593 GiB  590 GiB   10 KiB  2.8 GiB  3.1 TiB  15.92  1.12   16      up              osd.3
  4    ssd   3.63869   1.00000  3.6 TiB  633 GiB  629 GiB    5 KiB  3.4 GiB  3.0 TiB  16.98  1.20   18      up              osd.4
  5    ssd   3.63869   1.00000  3.6 TiB  630 GiB  626 GiB    7 KiB  3.4 GiB  3.0 TiB  16.90  1.19   17      up              osd.5
-14         21.83212         -   22 TiB  2.8 TiB  2.8 TiB   55 KiB   18 GiB   19 TiB  12.80  0.90    -              room s14     
-23         21.83212         -   22 TiB  2.8 TiB  2.8 TiB   55 KiB   18 GiB   19 TiB  12.80  0.90    -                  host virt2
  6    ssd   3.63869   1.00000  3.6 TiB  556 GiB  553 GiB   10 KiB  3.2 GiB  3.1 TiB  14.91  1.05   15      up              osd.6
  7    ssd   3.63869   1.00000  3.6 TiB  483 GiB  480 GiB   11 KiB  2.9 GiB  3.2 TiB  12.97  0.92   13      up              osd.7
  8    ssd   3.63869   1.00000  3.6 TiB  370 GiB  368 GiB    5 KiB  2.4 GiB  3.3 TiB   9.93  0.70   10      up              osd.8
  9    ssd   3.63869   1.00000  3.6 TiB  484 GiB  481 GiB    8 KiB  3.2 GiB  3.2 TiB  13.00  0.92   13      up              osd.9
 10    ssd   3.63869   1.00000  3.6 TiB  373 GiB  370 GiB    9 KiB  2.9 GiB  3.3 TiB  10.02  0.71   11      up              osd.10
 11    ssd   3.63869   1.00000  3.6 TiB  594 GiB  591 GiB   12 KiB  3.1 GiB  3.1 TiB  15.94  1.13   16      up              osd.11
                         TOTAL   65 TiB  9.3 TiB  9.2 TiB  171 KiB   53 GiB   56 TiB  14.17                                      
MIN/MAX VAR: 0.70/1.20  STDDEV: 2.26

The command are not stuck.

From my (limited) Ceph experience, it shouldn't block anything from writing to osd.[0-11] . But VMs are blocked :(
 
Last edited:
The problem is this:
Code:
    pgs:     64.341% pgs not active
             793382/2444972 objects degraded (32.450%)
             83 undersized+degraded+peered

Some PGs are not active, and therefore you have IO issues.

Was the cluster healthy before on node/room went down?

Could you also post the output of pveceph pool ls --noborder? Make sure the window is wide enough, output that doesn't fit is cut off.

Also, if you have a custom CRUSH rule, or modified the default replicated_rule, please copy and paste the Crush map info you have in the Node->Ceph->Configuration panel on the right side.Was the cluster h

A ceph pg dump_stuck might also show something interesting.
 
Yes, the cluster was healthy. Maybe the pgs are not active because they've never been used?

Code:
root@virt1:~# pveceph pool ls --noborder
Name   Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name               %-Used Used
.mgr      3        2      1           1              1 on                                                                   replicate_rooms 7.29225257600774e-06 409743360
main_3    2        2    128                        128 on                                                                   replicate_rooms    0.152933374047279 10144524982008

It's using the replicate_rooms rule to avoid having multiple replicas in the same room, exactly to avoid the current issue:
Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host virt3 {
    id -25        # do not change unnecessarily
    id -37 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item osd.12 weight 3.63869
    item osd.13 weight 3.63869
    item osd.14 weight 3.63869
    item osd.15 weight 3.63869
    item osd.16 weight 3.63869
    item osd.17 weight 3.63869
}
room a51 {
    id -15        # do not change unnecessarily
    id -30 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item virt3 weight 21.83212
}
host virt2 {
    id -23        # do not change unnecessarily
    id -36 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item osd.6 weight 3.63869
    item osd.9 weight 3.63869
    item osd.7 weight 3.63869
    item osd.8 weight 3.63869
    item osd.10 weight 3.63869
    item osd.11 weight 3.63869
}
room s14 {
    id -14        # do not change unnecessarily
    id -34 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item virt2 weight 21.83212
}
host virt1 {
    id -21        # do not change unnecessarily
    id -35 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 3.63869
    item osd.1 weight 3.63869
    item osd.2 weight 3.63869
    item osd.3 weight 3.63869
    item osd.4 weight 3.63869
    item osd.5 weight 3.63869
}
room b038b {
    id -13        # do not change unnecessarily
    id -38 class ssd        # do not change unnecessarily
    # weight 21.83212
    alg straw2
    hash 0    # rjenkins1
    item virt1 weight 21.83212
}
root default {
    id -1        # do not change unnecessarily
    id -39 class ssd        # do not change unnecessarily
    # weight 65.49637
    alg straw2
    hash 0    # rjenkins1
    item a51 weight 21.83212
    item s14 weight 21.83212
    item b038b weight 21.83212
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule replicate_rooms {
    id 1
    type replicated
    step take default
    step chooseleaf firstn 0 type room
    step emit
}

# end crush map
Logs
Given the current setup it is (well, should be) exactly the same as replicate_rule. But before replacing nodes, there were 9 hosts (3 per room). And IIRC it never had problems when a single room was down.

Code:
root@virt1:~# ceph pg dump_stuck
PG_STAT  STATE                       UP      UP_PRIMARY  ACTING  ACTING_PRIMARY
4.7e     undersized+degraded+peered     [0]           0     [0]               0
4.7c     undersized+degraded+peered     [1]           1     [1]               1
4.7a     undersized+degraded+peered     [0]           0     [0]               0
4.78     undersized+degraded+peered     [6]           6     [6]               6
4.77     undersized+degraded+peered     [7]           7     [7]               7
4.76     undersized+degraded+peered     [5]           5     [5]               5
4.75     undersized+degraded+peered     [4]           4     [4]               4
4.73     undersized+degraded+peered     [1]           1     [1]               1
4.71     undersized+degraded+peered     [8]           8     [8]               8
4.6f     undersized+degraded+peered     [8]           8     [8]               8
4.6e     undersized+degraded+peered    [11]          11    [11]              11
4.6d     undersized+degraded+peered     [1]           1     [1]               1
4.6c     undersized+degraded+peered     [0]           0     [0]               0
4.6b     undersized+degraded+peered     [2]           2     [2]               2
4.6a     undersized+degraded+peered     [1]           1     [1]               1
4.69     undersized+degraded+peered     [7]           7     [7]               7
4.66     undersized+degraded+peered     [4]           4     [4]               4
4.65     undersized+degraded+peered     [1]           1     [1]               1
4.64     undersized+degraded+peered     [9]           9     [9]               9
4.63     undersized+degraded+peered     [2]           2     [2]               2
4.62     undersized+degraded+peered     [0]           0     [0]               0
4.60     undersized+degraded+peered     [7]           7     [7]               7
4.5f     undersized+degraded+peered    [11]          11    [11]              11
4.5d     undersized+degraded+peered     [5]           5     [5]               5
4.5c     undersized+degraded+peered     [5]           5     [5]               5
4.e      undersized+degraded+peered     [8]           8     [8]               8
4.19     undersized+degraded+peered     [2]           2     [2]               2
4.5      undersized+degraded+peered     [6]           6     [6]               6
4.a      undersized+degraded+peered     [3]           3     [3]               3
4.1b     undersized+degraded+peered     [3]           3     [3]               3
4.c      undersized+degraded+peered     [4]           4     [4]               4
4.47     undersized+degraded+peered     [5]           5     [5]               5
4.1e     undersized+degraded+peered    [10]          10    [10]              10
4.1f     undersized+degraded+peered     [6]           6     [6]               6
4.35     undersized+degraded+peered    [10]          10    [10]              10
4.7f     undersized+degraded+peered     [6]           6     [6]               6
4.1      undersized+degraded+peered     [3]           3     [3]               3
5.0      active+undersized+degraded  [10,4]          10  [10,4]              10
4.44     undersized+degraded+peered     [5]           5     [5]               5
4.20     undersized+degraded+peered     [1]           1     [1]               1
4.5b     undersized+degraded+peered     [4]           4     [4]               4
4.1d     undersized+degraded+peered     [4]           4     [4]               4
4.32     undersized+degraded+peered     [2]           2     [2]               2
4.8      undersized+degraded+peered    [11]          11    [11]              11
4.43     undersized+degraded+peered     [8]           8     [8]               8
4.2d     undersized+degraded+peered    [10]          10    [10]              10
4.9      undersized+degraded+peered     [6]           6     [6]               6
4.4c     undersized+degraded+peered     [3]           3     [3]               3
4.2e     undersized+degraded+peered     [0]           0     [0]               0
4.2c     undersized+degraded+peered    [11]          11    [11]              11
4.2b     undersized+degraded+peered     [1]           1     [1]               1
4.26     undersized+degraded+peered     [7]           7     [7]               7
4.59     undersized+degraded+peered    [11]          11    [11]              11
4.2a     undersized+degraded+peered     [4]           4     [4]               4
4.0      undersized+degraded+peered    [10]          10    [10]              10
4.3b     undersized+degraded+peered     [4]           4     [4]               4
4.25     undersized+degraded+peered     [5]           5     [5]               5
4.b      undersized+degraded+peered     [8]           8     [8]               8
4.4e     undersized+degraded+peered     [2]           2     [2]               2
4.28     undersized+degraded+peered     [6]           6     [6]               6
4.2      undersized+degraded+peered     [9]           9     [9]               9
4.27     undersized+degraded+peered    [11]          11    [11]              11
4.f      undersized+degraded+peered     [1]           1     [1]               1
4.42     undersized+degraded+peered    [10]          10    [10]              10
4.11     undersized+degraded+peered     [3]           3     [3]               3
4.54     undersized+degraded+peered     [4]           4     [4]               4
4.13     undersized+degraded+peered     [4]           4     [4]               4
4.56     undersized+degraded+peered     [3]           3     [3]               3
4.15     undersized+degraded+peered     [5]           5     [5]               5
4.16     undersized+degraded+peered     [3]           3     [3]               3
4.17     undersized+degraded+peered     [7]           7     [7]               7
4.4a     undersized+degraded+peered     [2]           2     [2]               2
4.18     undersized+degraded+peered     [9]           9     [9]               9
4.53     undersized+degraded+peered     [9]           9     [9]               9
4.1c     undersized+degraded+peered     [6]           6     [6]               6
4.57     undersized+degraded+peered     [3]           3     [3]               3
4.3c     undersized+degraded+peered     [0]           0     [0]               0
4.3d     undersized+degraded+peered     [4]           4     [4]               4
4.3e     undersized+degraded+peered     [2]           2     [2]               2
4.3f     undersized+degraded+peered     [5]           5     [5]               5
4.40     undersized+degraded+peered     [3]           3     [3]               3
4.46     undersized+degraded+peered     [0]           0     [0]               0
4.4b     undersized+degraded+peered     [2]           2     [2]               2
4.4f     undersized+degraded+peered     [3]           3     [3]               3
ok
 
Code:
Name   Size Min Size
main_3    2        2

There you go. That pool has a size of 2. That means, that some PGs only have one replica present because the only other one was on the lost node.

Ceph should recover those once the DOWN OSDs are set to OUT (should happen after 10 min automatically).

To prevent that in the future, give it a size of 3. You should also set a target ratio. Any value will work right now as it is the only pool (ignore .mgr). Then the autoscaler can calculate the optimal PG Num for that pool so that each OSD will have around 100 PGs in the end. Which should also improve performance and data distribution among the OSDs.
 
Should you ever plan to have more nodes per room, the following CRUSH rule would be better, as it makes sure that replicas need to end up on different hosts:
Code:
rule replicate_3rooms {
    id {RULE ID}
    type replicated
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}
 
  • Like
Reactions: NdK73
Tks a lot! Couldn't see it. No idea WHY it's 2 instead of 3... Now I've run
ceph osd pool set main_3 size 3
and it started recovery...
It won't finish before virt3 comes back online, but I hope it will prevent a future down.
 
One more thing, if you want to prevent people (including yourself ;) ) to change the size property, you can run
Code:
ceph osd pool set {pool} nosizechange true
 
Should you ever plan to have more nodes per room, the following CRUSH rule would be better, as it makes sure that replicas need to end up on different hosts:
Code:
rule replicate_3rooms {
    id {RULE ID}
    type replicated
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}
Not sure I understand. If there's one replica per room, and each host can only be in one room, two replicas cannot be on the same host...
 
Not sure I understand. If there's one replica per room, and each host can only be in one room, two replicas cannot be on the same host...
That is true... Especially if you don't set the "size" larger than 3.
The additional step to distribute it per host is one more failsafe, just in case you have more nodes per room and a pool with a larger size.
 
  • Like
Reactions: Johannes S
Ah, ok.
So if I have 3 rooms, 3 nodes per room and size=5 I can be sure that downing a single host won't take down more than one replica. Downing a room, on the other hand, will take down up to two replicas.
Right?

I'm actually thinking about adding osds to have a "capacity pool" on hdd and thought about how to install just the ceph components w/o the full Proxmox overhead while maintaining manageability from Proxmox... Probably not worth it: either install more Proxmox nodes or setup a separate Ceph cluster managed via cephadm (I already refined some Salt states to make this really easy)...
 
So if I have 3 rooms, 3 nodes per room and size=5 I can be sure that downing a single host won't take down more than one replica. Downing a room, on the other hand, will take down up to two replicas.
Right?
Yep. Even though the situations sound a bit constructed. But in reality, who knows what sequence of steps might lead to something similar :)

If you want to have different device classes and want specific pools to make use of only one, you need to add one more step to match the device class.
For example, to have a slow and fast pool.

Then the rule should look something like this:
Code:
rule replicate_3rooms {
    id {RULE ID}
    type replicated
    step take default class ssd
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}

See the additional info in the first step take default line!

If you want to use Ceph, but without Proxmox VE, you will have to switch it all. Ceph can be deployed in a few ways and they are not all compatible with each other. So if you install it bare metal with cephadm, that's what you got. Don't manage it with Proxmox VE, because those tools expect a certain environment with Proxmox specific helper tools and how to deploy services which is not compatible with other deployment options.
You can connect to it from Proxmox VE as an external RBD or CephFS storage.
 
Last edited: