Ceph rbd mirror force promote

willybong · Jan 9, 2025

Hello everyone,

I’m currently setting up my first Ceph mirror configuration and have a few questions regarding its behavior.
For example, I’m uncertain about how to force-promote an image on my DR cluster (site-b) during the synchronization process.
From what I’ve read in the documentation, in a disaster scenario occurring during synchronization, a force-promote operation promotes the last snapshot received by the DR cluster. However, as noted:

"Since this mode is not as fine-grained as journaling, the complete delta between two snapshots will need to be synced prior to use during a failover scenario. Any partially applied set of deltas will be rolled back at the moment of failover."

When I attempt to force-promote an image, I encounter the following error:

Code:

root@pve1-b:~# rbd mirror image promote ceph-pool/vm-103-disk-1 --force
2025-01-09T09:42:40.412+0100 7983d4e006c0 -1 librbd::mirror::snapshot::util:  can_create_primary_snapshot: cannot rollback
2025-01-09T09:42:40.412+0100 7983d4e006c0 -1 librbd::mirror::snapshot::PromoteRequest: 0x7983b0001d40 send: cannot promote
2025-01-09T09:42:40.412+0100 7983d4e006c0 -1 librbd::mirror::PromoteRequest: 0x7983b401a810 handle_promote: failed to promote image: (22) Invalid argument
rbd: error promoting image to primary
2025-01-09T09:42:40.412+0100 7983d84f1780 -1 librbd::api::Mirror: image_promote: failed to promote image

I’ve checked the snapshots on my DR cluster (site-b) and always see the latest snapshot of the image present there.

I have configured periodic snapshots to run every 3 minutes.
On the main cluster (site-a), I always retain the last 5 snapshots, while on the DR cluster (site-b), only the most recent snapshot is kept.
I assume that this latest snapshot is overwritten during the synchronization process

My main question is: How does Ceph handle promotion for an image when the data hasn’t been fully received on the DR cluster (site-b)?

Thank you!

Regards

KevinS · Jan 10, 2025

Hi, we also have a Wiki page for Ceph Mirroring[0], if site A is still available, you first need to demote it:

Code:
Promote images on site B
By promoting an image or a all images in a pool, we can tell Ceph that they are now the primary ones to be used. In a planned failover, we would first demote the images on site A before we promote the images on site B. In a recovery situation with site A down, we need to `--force` the promotion.

To promote a single image, run the following command:

Code:

root@site-b $ rbd mirror image promote <pool>/<image> --force

To promote all images in a pool, run the following command:

Code:

root@site-b $ rbd mirror pool promote <pool> --force

After this, our guests should start fine.

[0] https://pve.proxmox.com/wiki/Ceph_RBD_Mirroring#Failover_Recovery

willybong · Jan 10, 2025

KevinS said:
Hi, we also have a Wiki page for Ceph Mirroring[0], if site A is still available, you first need to demote it:

[0] https://pve.proxmox.com/wiki/Ceph_RBD_Mirroring#Failover_Recovery

Hi @KevinS,

Thank you for your response.
In a planned failover, the system works perfectly.

But the issue I’m facing is that, in a recovery scenario (with no connection between site-a and site-b, similar to a DR scenario), there might be ongoing synchronization processes.
As a result, the complete image may not be fully available on site-b.

For this reason, when I attempt to force-promote an image (VM disk), the command returns the error I mentioned earlier.

In general, during a disaster scenario, I cannot guarantee that all synchronization processes have been fully completed.

The message in the output, "cannot rollback," makes me think that Ceph doesn’t have a restore point to revert to the previous snapshot.
From what I’ve observed, Ceph RBD mirror in snapshot mode keeps, by default, 5 snapshots on site-a and 1 snapshot on site-b (the recovery site). However, if I interrupt the incremental sync process for the single snapshot on site-b, the image is no longer available, even if I use promote --force.
Is what I’m saying correct?
I hope I’ve been as clear as possible.

Thank you

willybong · Jan 10, 2025

Hi @KevinS,
I wanted to share my findings with you:
When I attempt to force-promote a snapshot that is 88% copied and still in a syncing state, the force promotion process gets stuck during the promotion.
please see below the snapshot number 1247065

Code:

Image: vm-103-disk-0
Snapshots:
SNAPID   NAME                                                                                           SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                 
1247064  .mirror.non_primary.0e0c83ba-b709-4bf9-832d-013ca194b00e.fff33483-93fa-4923-8240-6ca810a68744  21 GiB             Fri Jan 10 16:54:01 2025  mirror (non-primary peer_uuids:[] a946fac3-067c-47c1-a80c-05187eb77a30:431652 copied)
----------------------------------------
Image: vm-103-disk-1
Snapshots:
SNAPID   NAME                                                                                           SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                   
1247059  .mirror.non_primary.ab2549b1-78bc-48f5-be58-85cdc470b3ae.3bb6f0ff-b22d-4d6b-b210-2f30edbbc011  5 GiB             Fri Jan 10 16:51:00 2025  mirror (non-primary peer_uuids:[] a946fac3-067c-47c1-a80c-05187eb77a30:431647 copied)
1247065  .mirror.non_primary.ab2549b1-78bc-48f5-be58-85cdc470b3ae.2f65824c-f6bc-49f9-9982-c4804fa9a0e1  5 GiB             Fri Jan 10 16:54:01 2025  mirror (non-primary peer_uuids:[] a946fac3-067c-47c1-a80c-05187eb77a30:431653 88% copied)
----------------------------------------
Image: vm-104-disk-0
Snapshots:
SNAPID   NAME                                                                                           SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                 
1247066  .mirror.non_primary.04c18313-1f08-4343-89dd-b236a6e3937a.97d6f893-aa52-4d16-a951-cd9c3410711b  21 GiB             Fri Jan 10 16:54:01 2025  mirror (non-primary peer_uuids:[] a946fac3-067c-47c1-a80c-05187eb77a30:431654 copied)

This command is totally stuck

Code:

rbd mirror image promote ceph-pool/vm-103-disk-1 --force

I tried to catch the problem:

Code:

 subprocess.run(promote_command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  File "/usr/lib/python3.11/subprocess.py", line 550, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1199, in communicate
    self.wait()
  File "/usr/lib/python3.11/subprocess.py", line 1262, in wait
    return self._wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1997, in _wait
    (pid, sts) = self._try_wait(0)
                 ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1955, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

It's a very critical situation because in case on I have a disaster in site-a and the comunication between site-a and site-b no works I cannot promote site-b.
I hope I’ve been as clear as possible.

Thank you
bye

pauloh5 · Mar 12, 2025

I have the same problem. Here is my situation:

I have two ceph cluster called "main" and "secondary".
I created an rbd image named my-pool/test and enable mirroring in snapshot mode.
I mounted the image and wrote 1GB data using sudo dd if=/dev/zero of=/dev/rbd0 bs=1M count=1024
I unmounted the image and created mirror snapshot with rbd mirror image snapshot my-pool/test
While mirror snapshot was being synced, I powered off the "main" server.
I promote the image on the "secondary" cluster with rbd mirror image promote my-pool/test --force
The promote command got stuck or occurred core dumped error.

I think that the purpose of rbd mirroring is disaster recovery. However in a real disaster scenario, failure can occur at any time - even while an image is being synced. In this case, the image is no longer available.

It seems quite possible to promote the image using previous mirror snapshot while latest mirror snapshot is still being synced, is there a feature that allows this?

david_tao · Mar 12, 2025

Is it possible make multiple version of snapshot copies, then promote specified version of snapshot copy?
according to https://docs.ceph.com/en/squid/rbd/rbd-mirroring/#rbd-mirroring
--------------------------------------------------------------------------------------------------------------------------------------------
For example:
$ rbd --cluster site-a mirror image snapshot image-pool/image-1
By default up to 5 mirror-snapshots will be created per-image. The most recent mirror-snapshot is automatically pruned if the limit is reached.The limit can be overridden via the rbd_mirroring_max_mirroring_snapshots configuration option if required. Additionally, mirror-snapshots are automatically deleted when the image is removed or when mirroring is disabled.
--------------------------------------------------------------------------------------------------------------------------------------------

pauloh5 · Mar 12, 2025

Thank you for your reply

david_tao said:
By default up to 5 mirror-snapshots will be created per-image. The most recent mirror-snapshot is automatically pruned if the limit is reached.The limit can be overridden via the rbd_mirroring_max_mirroring_snapshots configuration option if required.

This option affects only the primary(promoted) image. Regardless of this option, a demoted image always has only one mirror snapshot. While syncing, two mirror-snapshots exist temporarily in demoted image. Once the sync is complete, the previous mirror-snapshots is deleted.

What I wonder is whether I can cancel the sync and promote the image using previous mirror-snapshot while two mirror-snapshots is existing in demoted image.

willybong · Mar 14, 2025

pauloh5 said:
Thank you for your reply

This option affects only the primary(promoted) image. Regardless of this option, a demoted image always has only one mirror snapshot. While syncing, two mirror-snapshots exist temporarily in demoted image. Once the sync is complete, the previous mirror-snapshots is deleted.

What I wonder is whether I can cancel the sync and promote the image using previous mirror-snapshot while two mirror-snapshots is existing in demoted image.

You are right..
this is the core of the problem.
From what I’ve observed, Ceph RBD mirror in snapshot mode keeps, by default, 5 snapshots on site-a and 1 snapshot on site-b (the recovery site). However, if I interrupt the incremental sync process for the single snapshot on site-b, the image is no longer available, even if I use promote --force.
I don't like that the Proxmox team hasn't addressed this topic.
I would have expected more engagement from them, or at least a response

david_tao · Mar 16, 2025

This is so frustrating, as I worked with Enterprise Storage (HDS/DellEMC) over two decade, In a properly designed long distance DR solution, to keep at least 2 or more copies at remote site is the basic requirement. Because no one can assume which time the disaster will happen! so how to let remote site can take over the primary site's work in any situation is always need to be keep in mind. I googled some rbd mirror keyword and couldn't find how to keep more then 2 copies at remote site between two Ceph cluster! As the result my personally will consider to use rbd mirror for disaster recovery solution in currently is not a good idea. And i'm also doubt this issue can be solved by Proxmox because the Ceph are upstream project. Instead may need to find some other ways to achieve DR propose with Ceph. But that will be more complexity...

kellogs · Mar 16, 2025

i wonder if linstor is a better solution https://linbit.com/ for this purpose.

aaron · Mar 24, 2025

So, I tried to reproduce this problem:

2 Proxmox VE + Ceph clusters (v18.2.4 / reef)
Snapshot based sync
Overwrite test disk image with random data (lots of changes, more data to sync)
Watch target cluster for any new snapshots: watch -n0.5 -d rbd snap list --all vm-101-disk-3
Once the new snapshot is being copied:
Code:
```
on-primary peer_uuids:[] b7d7c03d-573a-4028-ad04-a6530f15d4a9:33779 26% copied)
```
Kill all machines of the source cluster

Then try to promote the image on the target cluster: rbd mirror image promote vm-101-disk-3 --force

And so far, it hangs there. @willybong how long did you have to wait until you got the error message in the first post, roughly?
Which Ceph version did you test this with?

I will investigate further to see what can be done and if there are certain Ceph versions that work better.

For the Datacenter Manager we plan to integrate DR options into it. Then we don't need to use Cephs RBD mirroring directly anymore or other options like Backup -> remote sync -> (live) restore.

Off-site replication copies of guest for manual recovery on DC failure (not HA!)

https://pve.proxmox.com/wiki/Proxmox_Datacenter_Manager_Roadmap

willybong · Mar 24, 2025

aaron said:
So, I tried to reproduce this problem:

2 Proxmox VE + Ceph clusters (v18.2.4 / reef)

Snapshot based sync

Overwrite test disk image with random data (lots of changes, more data to sync)

Watch target cluster for any new snapshots: watch -n0.5 -d rbd snap list --all vm-101-disk-3

Once the new snapshot is being copied:

Code:

on-primary peer_uuids:[] b7d7c03d-573a-4028-ad04-a6530f15d4a9:33779 26% copied)

Kill all machines of the source cluster

Then try to promote the image on the target cluster: rbd mirror image promote vm-101-disk-3 --force

And so far, it hangs there. @willybong how long did you have to wait until you got the error message in the first post, roughly?
Which Ceph version did you test this with?

I will investigate further to see what can be done and if there are certain Ceph versions that work better.

For the Datacenter Manager we plan to integrate DR options into it. Then we don't need to use Cephs RBD mirroring directly anymore or other options like Backup -> remote sync -> (live) restore.

https://pve.proxmox.com/wiki/Proxmox_Datacenter_Manager_Roadmap

Hi Aaron,
thank you for your feedback!
I had to wait a few seconds before getting that error.
In my lab, I tried with both ceph 18 and 19.2.0 (my current version) but the problem persists.

Based on your experience, has Ceph RBD snapshot mirroring ever worked reliably in the past?

Since Ceph RBD mirror is used for DR scenario, I had assumed that this kind of critical case would been able to manage it, bu it is not the case.
In my humble opinion (and not just mine), a robust and secure disaster recovery mechanism will be the key factor at a corporate level in determining whether Proxmox is considered the right choice to adopt or not.

happy to hear that Proxmox will have its own DR mechanism

robertlukan · Saturday at 08:44

I guess it is the same problem as described here: https://tracker.ceph.com/issues/70520

Search

Search

Ceph rbd mirror force promote

willybong

Active Member

KevinS

Proxmox Staff Member

Promote images on site B

willybong

Active Member

willybong

Active Member

pauloh5

New Member

david_tao

Member

pauloh5

New Member

willybong

Active Member

david_tao

Member

kellogs

Active Member

aaron

Proxmox Staff Member

willybong

Active Member

robertlukan

New Member

We value your privacy

Ceph rbd mirror force promote

Active Member

Proxmox Staff Member

Promote images on site B​

Active Member

Active Member

New Member

Member

New Member

Active Member

Member

Active Member

Proxmox Staff Member

Active Member

New Member

We value your privacy

Promote images on site B