Stretched Cluster (dual DC) with Ceph - disaster recovery

hepo · May 29, 2021

Hello all, new member of the forum here... looking for help and advise.

We are trying to setup stretched cluster across two DC's with Ceph.
Our setup is the following:
- 6 nodes (3 in each DC)
- We have 10 Gig network between the DC's with latency of 1.5ms
- 4x2TB NVMe OSD's per node
- Ceph pool is using the default 3 replicas
- Ceph monitors on each node

We are trying to simulate disaster (e.g. one DC is down) by stopping the NW connectivity on 3 nodes in one DC.
And then attempting to recover the cluster and the workloads (about 15 VM's at the moment for testing).

The first issue we ran into was the lack of quorum in proxmox. "pvecm expect 3" appears to solve this rather quickly.
The second issue is the lack of quorum for the Ceph monitors (at least 4 out of 6 required).
From the research we did so far, the only solution appears to be updating the monmap and removing the monitors we lost (at least one of them so quorum can be established). This looks a bit intrusive since it is unclear how ceph will behave once the "lost" monitors return... Plus initial test was negative, basically broken the monitor.

Is there any alternative solution to restore the quorum for the Ceph monitors?
And secondly (I do expect many "you are not suppose to do this" responses), is there anything we need to improve in the setup (do better) e.g. up the ceph replicas to 4?

Many thanks!
Cheers, Dimitar

mgiammarco · May 29, 2021

Number of nodes should be odd. And you are not forced to put a monitor on each node, better to put 3 or 5 so the side with less monitors will lose when datacenters are disconnected.

hepo · May 30, 2021

Thanks for the response...
The services running on PVE will accessed exclusively over the internet and we are trying to mitigate the risk of DC going down (e.g. catch a fire).
If the 3 monitor DC is down, we are still not able to achieve quorum with 2 monitors...
I am not that concerned about the link between the DC's going down since the link is redundant.

thanks

hepo · May 30, 2021

Is there anyone that was successful in updating the monmap?
I am following the instruction here but the monitors are not coming back up after the map inject.
The monmap is injected correctly, the ceph-mon@pve.service is failing and I am not able to troubleshoot why (noting in the monitor logs)

Code:

root@pve13:~# systemctl status ceph.target
● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
   Loaded: loaded (/lib/systemd/system/ceph.target; enabled; vendor preset: enabled)
   Active: active since Sun 2021-05-30 10:56:46 UTC; 5min ago

May 30 10:56:46 pve13 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
root@pve13:~# systemctl status ceph-mon@pve13.service
● ceph-mon@pve13.service - Ceph cluster monitor daemon
   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
           └─ceph-after-pve-cluster.conf
   Active: failed (Result: exit-code) since Sun 2021-05-30 10:57:37 UTC; 4min 31s ago
  Process: 1607259 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id pve13 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
 Main PID: 1607259 (code=exited, status=1/FAILURE)

May 30 10:57:37 pve13 systemd[1]: ceph-mon@pve13.service: Service RestartSec=10s expired, scheduling restart.
May 30 10:57:37 pve13 systemd[1]: ceph-mon@pve13.service: Scheduled restart job, restart counter is at 5.
May 30 10:57:37 pve13 systemd[1]: Stopped Ceph cluster monitor daemon.
May 30 10:57:37 pve13 systemd[1]: ceph-mon@pve13.service: Start request repeated too quickly.
May 30 10:57:37 pve13 systemd[1]: ceph-mon@pve13.service: Failed with result 'exit-code'.
May 30 10:57:37 pve13 systemd[1]: Failed to start Ceph cluster monitor daemon.
root@pve13:~#

mgiammarco · May 30, 2021

hepo said:
Thanks for the response...
The services running on PVE will accessed exclusively over the internet and we are trying to mitigate the risk of DC going down (e.g. catch a fire).
If the 3 monitor DC is down, we are still not able to achieve quorum with 2 monitors...
I am not that concerned about the link between the DC's going down since the link is redundant.

thanks

So basically you are in a situation like two local servers connected together and Proxmox experts will tell you that this not work because you need also a quorum.

hepo · May 30, 2021

no, not really, 3 nodes can form quorum, I just cannot fix the cehp monitors

thanks for your comments!

hepo · May 31, 2021

This is very similar, not to say identical to the issue I have - https://forum.proxmox.com/threads/problem-with-ceph-cluster-without-quorum.81123/

At this stage I have deployed a 7th node on a VM that I can backup and make redundant in both DC's.

Rebuild the whole cluster and testing again... this time quorum is maintained for both Proxmox and Ceph (a step forward)
Ceph however, stops responding after 2 nodes are down... looking at implementing datacenter bucket in the crush map.
Complicated stuff... reading whole day and cannot wrap my head around it

hepo · May 31, 2021

Looks like I am the only one positing here... never mind I will continue

Updating the buckets in the crush map turned out to be very simple, very well described here - https://access.redhat.com/documenta...ions_guide/handling-a-data-center-failure-ops

Something I learned is that ceph is pausing the osd's when nodes are down (looks like second node down is causing this)

Code:

root@pve:~# ceph osd unpause
pauserd,pausewr is unset

Next, need to play with min/max replicas a bit more...

hepo · May 31, 2021

Dear Proxmox Team, I would really appreciate some brain cells here... please

I am trying to follow the Stretched Cluster instructions from the Ceph docs - https://docs.ceph.com/en/latest/rados/operations/stretch-mode/#stretch-clusters

However, most of the commands do not work/not recognised. e.g.

Code:

ceph mon set_location pve11 datacenter=NLW
ceph mon set election_strategy connectivity

I am not able to validate which version of Ceph supports this, supposedly this should be already in Octupus.

Please help, thanks!

dcsapak · Jun 1, 2021

hepo said:
I am trying to follow the Stretched Cluster instructions from the Ceph docs - https://docs.ceph.com/en/latest/rados/operations/stretch-mode/#stretch-clusters

However, most of the commands do not work/not recognised. e.g.

Code:

ceph mon set_location pve11 datacenter=NLW ceph mon set election_strategy connectivity

I am not able to validate which version of Ceph supports this, supposedly this should be already in Octupus.

at least the docs only references this for pacific and up
https://docs.ceph.com/en/pacific/rados/operations/stretch-mode/#stretch-clusters
https://docs.ceph.com/en/octopus/rados/operations/stretch-mode/#stretch-clusters -> not found

so my guess is that it only works with pacific

hepo · Jun 1, 2021

many thanks for the response!

This will be the silver bullet for what we are trying to achieve...

Do you happen to know if pacific adoption is on the roadmap?
Thinking we can go live and adopt this functionality once available.

dcsapak · Jun 1, 2021

hepo said:
Do you happen to know if pacific adoption is on the roadmap?

pacific should be available sometime later this year with 7.0

hepo said:
This will be the silver bullet for what we are trying to achieve...

after reading the docs, this "feature" seem dangerous to me, especially this part:

[...]the surviving data center will enter a degraded stretch mode. This will issue a warning, reduce the min_size to 1, and allow the cluster to go active with data in the single remaining site.[...]

in my experience, setting min_size to 1 is never a good idea, and with this, it happens automatically when one site goes down

not to speak of the split brain possibility (they do not mention that scenario, so i have no idea how they handle that)

all-in-all, this is not something i would rush to production...

hepo · Jun 1, 2021

Thanks for the comments Dominik, much appreciated.
The split brain issue we are planning to address with 3rd datacenter (or VM that will VPN into the environment).

I am failing to make the Ceph cluster operational after 3 nodes being down (DC1).
I have updated the Crush Map - created two datacenter bucket, added them to the root and added the hosts inside.
I have tried increasing the replicas to min3 max4, tried also 1/3. Quorum is in place.
With two nodes down the Ceph cluster is still operational, the moment the 3rd node goes down it looks like it pauses any reads and writes.
Recovery/rebalancing is not triggered, I really don't know what am I doing wrong...

Here's the cluster health, any suggestions on what I can do to restore operations in such scenario?

Code:

root@pve21:~# ceph health detail
HEALTH_WARN 2/5 mons down, quorum pve21,pve22,pve10; 1 datacenter (12 osds) down; 12 osds down; 3 hosts (12 osds) down;
Reduced data availability: 26 pgs inactive, 33 pgs stale;
Degraded data redundancy: 67926/130284 objects degraded (52.137%), 26 pgs degraded, 26 pgs undersized;
295 slow ops, oldest one blocked for 221 sec, daemons [osd.10,osd.11,osd.15,osd.16,osd.17,osd.21,osd.22,osd.23,osd.3,osd.4]... have slow ops.
[WRN] MON_DOWN: 2/5 mons down, quorum pve21,pve22,pve10
    mon.pve11 (rank 0) addr [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] is down (out of quorum)
    mon.pve12 (rank 2) addr [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] is down (out of quorum)
[WRN] OSD_DATACENTER_DOWN: 1 datacenter (12 osds) down
    datacenter DC1 (root=default) (12 osds) is down
[WRN] OSD_DOWN: 12 osds down
    osd.0 (root=default,datacenter=DC1,host=pve11) is down
    osd.1 (root=default,datacenter=DC1,host=pve12) is down
    osd.2 (root=default,datacenter=DC1,host=pve13) is down
    osd.6 (root=default,datacenter=DC1,host=pve11) is down
    osd.7 (root=default,datacenter=DC1,host=pve12) is down
    osd.8 (root=default,datacenter=DC1,host=pve13) is down
    osd.12 (root=default,datacenter=DC1,host=pve11) is down
    osd.13 (root=default,datacenter=DC1,host=pve12) is down
    osd.14 (root=default,datacenter=DC1,host=pve13) is down
    osd.18 (root=default,datacenter=DC1,host=pve11) is down
    osd.19 (root=default,datacenter=DC1,host=pve12) is down
    osd.20 (root=default,datacenter=DC1,host=pve13) is down
[WRN] OSD_HOST_DOWN: 3 hosts (12 osds) down
    host pve11 (root=default,datacenter=DC1) (4 osds) is down
    host pve12 (root=default,datacenter=DC1) (4 osds) is down
    host pve13 (root=default,datacenter=DC1) (4 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 26 pgs inactive, 33 pgs stale
    pg 1.0 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.0 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.2 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.3 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.4 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.5 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.6 is stuck stale for 2h, current state stale+active+clean, last acting [12,8,19]
    pg 2.7 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.8 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.9 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.a is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.b is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.c is stuck stale for 2h, current state stale+active+clean, last acting [2,0,19]
    pg 2.d is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.e is stuck stale for 2h, current state stale+active+clean+laggy, last acting [18,2,19]
    pg 2.f is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.10 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.11 is stuck stale for 2h, current state stale+active+clean, last acting [19,12,20]
    pg 2.12 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.13 is stuck stale for 2h, current state stale+active+clean, last acting [20,19,18]
    pg 2.14 is stuck stale for 2h, current state stale+active+clean, last acting [19,2,0]
    pg 2.15 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.16 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.17 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.18 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.19 is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.1a is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.1b is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1c is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.1d is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1e is stuck stale for 2h, current state stale+active+clean, last acting [6,2,19]
    pg 2.1f is stuck stale for 2h, current state stale+undersized+degraded+peered, last acting [7]
[WRN] PG_DEGRADED: Degraded data redundancy: 67926/130284 objects degraded (52.137%), 26 pgs degraded, 26 pgs undersized
    pg 1.0 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.0 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.2 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.3 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.4 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.5 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.7 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.8 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.9 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.a is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.b is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.d is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.f is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.10 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.12 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.15 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.16 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.17 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.18 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.19 is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [13]
    pg 2.1a is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.1b is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1c is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [1]
    pg 2.1d is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
    pg 2.1f is stuck undersized for 2h, current state stale+undersized+degraded+peered, last acting [7]
[WRN] SLOW_OPS: 295 slow ops, oldest one blocked for 221 sec, daemons [osd.10,osd.11,osd.15,osd.16,osd.17,osd.21,osd.22,osd.23,osd.3,osd.4]... have slow ops.

hepo · Jun 1, 2021

I think I managed, looking for feedback...

Changed "osd_pool_default_min_size = 1" in ceph.conf as well as changed the pool minsize to 1.
This basically keeps the PG's in active state which allows HA to migrate the VMs and continue the services when DC1 failure occurs.
Obviously the crush map buckets/placement groups (I used datacenter) needs to be well defined for this to work.
This is obviously risky state (only one replica), next step is to "out" the OSD's of the failed nodes with "ceph osd out ID".
This allows the cluster to rebalance and achieve the desired replica count of 3.

When DC1 is restored, all we need to do is to "in" the OSD's back to the cluster and wait for rebalance.

I will re-test this tomorrow again. Anything that I am missing?

hepo · Jun 5, 2021

mgiammarco said:
So basically you are in a situation like two local servers connected together and Proxmox experts will tell you that this not work because you need also a quorum.

Just coming back on this to say that the statement above is correct.

We were not able to modify the monmap, plus this action creates much bigger risk for the cluster once restored.
Instead we will place another node in 3rd datacenter that will ensure quorum is maintained for both PVE and Ceph.
The Crush Map describes the physical layout of the buckets (root > datacenter > host > OSD) and ended up with 4/2 replication rule.
Once we bring one DC down, PVE and Ceph maintain quorum, HA kicks-in and we maintain operations with 2 replicas in the second DC.

For anyone reading this in the future, the best solution will be to stretch the cluster over 3 datacenters.
Unfortunately the 3rd DC we have has higher latency (15ms) that is above the recommended latency for Ceph.

enlar · Apr 21, 2022

I have been researching a bit about this today:

- Split-brain situation is avoided because 5th mon needs to be placed in a different datacenter, cloud or whatever.

- hepo's 4/2 configuration has the issue stretch-cluster is trying to fix: as min size is 2, PG can be active with only 2 replicas in the same DC. If you lose that DC PG is lost.

- When in degraded-stretch mode, min_size is 1 because it isn't reasonable to expect a 100% perfect ceph cluster in one DC. With min_size=2 it would be too easy to block writes in the surviving DC with a simple OSD failure for example... No one uses size=2/min_size=2 pools for a reason...

Just my understanding of the issue

Search

Search

Stretched Cluster (dual DC) with Ceph - disaster recovery

hepo

Member

mgiammarco

Renowned Member

hepo

Member

hepo

Member

mgiammarco

Renowned Member

hepo

Member

hepo

Member

hepo

Member

hepo

Member

dcsapak

Proxmox Staff Member

hepo

Member

dcsapak

Proxmox Staff Member

hepo

Member

hepo

Member

hepo

Member

enlar

Renowned Member