RDB mirroring slow in proxmox

tawh

Active Member
Mar 26, 2019
23
0
41
35
Hello all,

I have two proxmox clusters, namely A and B (updated to latest edition)
Cluster A: 3 hosts, 2 hosts with single 10T disk and 256GB SSD as OS, bluestore and bcache. Another host only have minimal hardware configuration for hosting proxmox for cluster quorum only, not intended to host any data and VM.
Cluster B: 1 host, with single 10T disk and 256GB SSD as OS, bluestore and bcache

I configured ceph on both clusters:
Cluster A: 2 OSD, 3 mon, 3 mgr.
I also configured rdbmirror between two ceph cluster.

All hosts are in the same subnet in 1Gbps LAN environment.

Problem:
When I force the resync of image from primary to secondary (bootstrapping)
I could get about 600~700 Mbps utilization in my network, which may be seen as the benchmark for disk speed and network bandwidth.

After the resync, I copy a 4GB file into the image (actually the image is a Windows OS, so I start the VM and copy the file from outside)
In the windows, the copying speed is around 60MBytes per second (~480Mbps).

However, the replay action is very slow and the behavior is very strange.
(a) Without any optimization, the primary host sends ~300Mbps for 1 second and then the network idle for around 10 seconds, and the whole pattern repeat again until the finish of replay.
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 172.16.0.10/24
         fsid = e5ab22c7-6876-4b68-9f43-d67edd4175c2
         mon_allow_pool_delete = true
         mon_host =  172.16.0.10
         osd_pool_default_min_size = 2
         osd_pool_default_size = 1
         public_network = 172.16.0.10/24

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring
         rbd_default_features = 125

(b) With the following "optimization" configuration, the network utilizes at 32Mbps until the end of the replay.
Code:
[global]
         auth_client_required = cephx
         auth_cluster_required = cephx
         auth_service_required = cephx
         cluster_network = 172.16.0.10/24
         fsid = e5ab22c7-6876-4b68-9f43-d67edd4175c2
         mon_allow_pool_delete = true
         mon_host =  172.16.0.10
         osd_pool_default_min_size = 2
         osd_pool_default_size = 1
         public_network = 172.16.0.10/24
         rbd_journal_max_payload_bytes = 524288
         rbd_mirror_journal_max_fetch_bytes = 1048576
         rbd_mirror_image_state_check_interval = 5
         rbd_mirror_pool_replayers_refresh_interval = 5
         rbd_mirror_sync_point_update_age = 5

[client]
         keyring = /etc/pve/priv/$cluster.$name.keyring
         rbd_default_features = 125
In both scenario, it took around 15 minutes to complete the replay.

How can I utilize full bandwidth for replaying (i.e. like the speed during bootstrapping)?
 
Last edited:
What I want to achieve is to use full bandwidth for replaying (i.e. like the speed during bootstrapping)
Journal based mirroring needs to replay the writes exactly as they happened on the primary image to be crash consistent. This is very likely why the transfer takes its time. The snapshot-based mirroring was only introduced recently in Octopus and could speed things up. But it's not yet available on Proxmox VE.
 
Journal based mirroring needs to replay the writes exactly as they happened on the primary image to be crash consistent. This is very likely why the transfer takes its time. The snapshot-based mirroring was only introduced recently in Octopus and could speed things up. But it's not yet available on Proxmox VE.

Thanks for your reply.

I understand the "replay" phenomenon. But the matter of fact is I can write to the primary CEPH storage at a speed of 480Mbps while the secondary CEPH can only replays at ~30Mbps. I also configured a dual mirror which I can do the reverse, but the result are the same. I researched this problem in the Internet for around 3 to 4 months but no progress at all.

If that is the fact for the performance of rbd mirroring, I believe there is no use case for that kind of synchronization. o_O Is there any users built a practicable rbd mirror environment?
 
Last edited:
After the resync, I copy a 4GB file into the image (actually the image is a Windows OS, so I start the VM and copy the file from outside)
In the windows, the copying speed is around 60MBytes per second (~480Mbps).
If you refer to this speed, then please bear in mind that there are a couple of levels in between till the Ceph storage is reached. What performance does a rados bench show?
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
 
If you refer to this speed, then please bear in mind that there are a couple of levels in between till the Ceph storage is reached. What performance does a rados bench show?
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark

Code:
rados bench 60 write -b 4M -t 16 --no-cleanup
Cluster A:
Code:
Total time run:         60.3526
Total writes made:      2798
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     185.443
Stddev Bandwidth:       66.8916
Max bandwidth (MB/sec): 352
Min bandwidth (MB/sec): 64
Average IOPS:           46
Stddev IOPS:            16.7229
Max IOPS:               88
Min IOPS:               16
Average Latency(s):     0.345065
Stddev Latency(s):      0.138755
Max latency(s):         0.959484
Min latency(s):         0.0310857

Cluster B:
Code:
Total time run:         60.6366
Total writes made:      1977
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     130.416
Stddev Bandwidth:       88.9065
Max bandwidth (MB/sec): 492
Min bandwidth (MB/sec): 12
Average IOPS:           32
Stddev IOPS:            22.2266
Max IOPS:               123
Min IOPS:               3
Average Latency(s):     0.490735
Stddev Latency(s):      0.299194
Max latency(s):         1.53251
Min latency(s):         0.02516
 
There is a big deviation (stddev) on those clusters. This will bring down performance. It is more or less expected from the size and hardware of the cluster.

For journal based mirroring, while the VM writes data onto Ceph, it will be also read from it. At best this cuts the performance in half again.
 
There is a big deviation (stddev) on those clusters. This will bring down performance. It is more or less expected from the size and hardware of the cluster.

For journal based mirroring, while the VM writes data onto Ceph, it will be also read from it. At best this cuts the performance in half again.

Thus such deviation brings the speed replaying of the mirror to slump to about 1/16 of that of bootstrapping ? From the network utilization graph, The bandwidth used for replaying was very stable to near ~30Mbps.

By the way, with the best or optimum configuration, what is the speed of both bootstrapping and replaying? Is there any real life configuration so that I can reference?

Thanks.
 
Thus such deviation brings the speed replaying of the mirror to slump to about 1/16 of that of bootstrapping ? From the network utilization graph, The bandwidth used for replaying was very stable to near ~30Mbps.
I suppose the difference is the write pattern. Running along the journal might just introduce many small writes.

By the way, with the best or optimum configuration, what is the speed of both bootstrapping and replaying? Is there any real life configuration so that I can reference?
This may be best answered by other users here in the forum.
 
Is there any member can share any experience on rbd mirror ?
If rbd mirror is not practical regarding its performance, is there any block level real time replication tools can be used in proxmox?

Thanks.
 
Is there any member can share any experience on rbd mirror ?
If rbd mirror is not practical regarding its performance, is there any block level real time replication tools can be used in proxmox?

Thanks.

I am trying to play with rbd-mirror as well, following the howto on the wiki:

https://pve.proxmox.com/wiki/Ceph_RBD_Mirroring

Primary cluster is composed of seven nodes, with four 2TB bluestore OSD with SSD cache each.

Secondary cluster has three nodes with similar configuration. Both clusters share a dedicated 10Gb/s storage network.

The mirrored image is a debian buster installation, with native (without mirroring) ~400MB/s write performance with some GB of zeroes. With mirroring enabled performance decrease to ~ 120MB/s, but this was expected. The replay speed is very slow as @tawh reported; after the write operation (6GB of zeros with dd) complete, i can demote the primary image, and I have to wait several minutes the secondary replaying before I can promote it.
 
...
The replay speed is very slow as @tawh reported; after the write operation (6GB of zeros with dd) complete, i can demote the primary image, and I have to wait several minutes the secondary replaying before I can promote it.

After some research, i came on this post of ceph-users ml:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028898.html

«If you are trying to optimize for 128KiB writes, you might need to tweak
the "rbd_journal_max_payload_bytes" setting since it currently is defaulted
to split journal write events into a maximum of 16KiB payload [1] in order
to optimize the worst-case memory usage of the rbd-mirror daemon for
environments w/ hundreds or thousands of replicated images.»

Other posts mention a tuning of "rbd_mirror_journal_max_fetch_bytes":

https://lists.ceph.io/hyperkitty/li...SKEB5X3N4S4/#IHMGNFLWCD5E5R4W5S2BSSKEB5X3N4S4

Anyway, as Jason Dillaman says in the first post, it seems that the default values,

"rbd_journal_max_payload_bytes": "16384"
"rbd_mirror_journal_max_fetch_bytes": "32768"

Are chosen on purpose for the aforementioned scenario (hundreds or thousands of replicated images)

rob
 
After some research, i came on this post of ceph-users ml:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028898.html

«If you are trying to optimize for 128KiB writes, you might need to tweak
the "rbd_journal_max_payload_bytes" setting since it currently is defaulted
to split journal write events into a maximum of 16KiB payload [1] in order
to optimize the worst-case memory usage of the rbd-mirror daemon for
environments w/ hundreds or thousands of replicated images.»

Other posts mention a tuning of "rbd_mirror_journal_max_fetch_bytes":

https://lists.ceph.io/hyperkitty/li...SKEB5X3N4S4/#IHMGNFLWCD5E5R4W5S2BSSKEB5X3N4S4

Anyway, as Jason Dillaman says in the first post, it seems that the default values,

"rbd_journal_max_payload_bytes": "16384"
"rbd_mirror_journal_max_fetch_bytes": "32768"

Are chosen on purpose for the aforementioned scenario (hundreds or thousands of replicated images)

rob

Wow, I forgot this thread as no one replied for several weeks. Really Thanks for bringing it out again.
I also tried the play with those parameters but no help. As a result, I gave up the rbd_mirror and used linstor.