After enabling CEPH pool one-way mirroring pool usage is growing up constantly and pool could overfull shortly

Whatever

Renowned Member
Nov 19, 2012
393
63
93
After an upgrade to PVE 6 and CEPH to 14.2.4 I enabled pool mirroring to independent node (following PVE wiki)
From that time my pool usage is growing up constantly even-though no VM disk changes are made

Could anybody help to sort out where my space is flowing out?
Pool usage size is going to became critical very shortly(((

The total size of all disks of all VMs ~8Tb
1572969909123.png

However it uses almost triple size of total VM disks (with replica = 2)

1572970224905.png


1572969968448.png


Pool is defied as following:
4 nodes: 3x6Tb disks each (class hdd) + cache tier 3x480Gb SSD in writeback mode.

1572970312622.png

Ceph cluster is healthy:

Code:
root@pve-node1:~# ceph -s
  cluster:
    id:     c2d639ef-c720-4c85-ac77-2763ecaa0a5e
    health: HEALTH_OK

  services:
    mon: 4 daemons, quorum pve-node3,pve-node4,pve-node2,pve-node1 (age 8d)
    mgr: pve-node2(active, since 8d), standbys: pve-node3, pve-node4, pve-node1
    mds: cephfs:1 {0=pve-node3=up:active}
    osd: 33 osds: 33 up (since 4w), 33 in (since 4w)

  data:
    pools:   4 pools, 1344 pgs
    objects: 5.43M objects, 28 TiB
    usage:   57 TiB used, 31 TiB / 88 TiB avail
    pgs:     1343 active+clean
             1    active+clean+scrubbing+deep

  io:
    client:   3.0 MiB/s rd, 123 MiB/s wr, 253 op/s rd, 338 op/s wr
    cache:    156 MiB/s flush, 9.4 MiB/s evict, 2 op/s promote

on remote node mirroring is working:
Code:
root@pve-backup:~# rbd mirror pool status rbd
health: OK
images: 18 total
    18 replaying


How ever on cluser nodes with the same command I get:
Code:
root@pve-node1:~# rbd mirror pool status rbd --verbose
health: WARNING
images: 18 total
    18 unknown

vm-100-disk-0:
  global_id:   1beee4c3-331e-48bc-8926-fc21ff4cf00f
  state:       down+unknown
  description: status not found
  last_update:

vm-101-disk-0:
  global_id:   3ec05a6e-70c5-48dc-bd6a-df3d6d3a4dc9
  state:       down+unknown
  description: status not found
  last_update:

vm-101-disk-1:
  global_id:   1c32375c-28e0-4d81-aced-d10d58934ae7
  state:       down+unknown
  description: status not found
  last_update:

vm-102-disk-0:
  global_id:   efbd6c50-b27b-490e-95cf-10229f29a3ff
  state:       down+unknown
  description: status not found
  last_update:

vm-103-disk-0:
  global_id:   b62600d6-d8d0-4896-94cd-c74cc5dd4e66
  state:       down+unknown
  description: status not found
  last_update:

vm-104-disk-0:
  global_id:   76adbfe9-9ca1-46cf-b40b-c75999204a41
  state:       down+unknown
  description: status not found
  last_update:

vm-104-disk-1:
  global_id:   4de05037-c917-4ed1-98f5-b3d775481938
  state:       down+unknown
  description: status not found
  last_update:

vm-104-disk-2:
  global_id:   38cf89e3-0c2f-4f08-ab1a-89fb44c5acc4
  state:       down+unknown
  description: status not found
  last_update:

vm-104-disk-3:
  global_id:   9a5b345d-2450-4f64-9dd8-3306632a5ef8
  state:       down+unknown
  description: status not found
  last_update:

vm-105-disk-0:
  global_id:   1850b05c-b54e-4218-b055-f74d9e1dfac4
  state:       down+unknown
  description: status not found
  last_update:

vm-105-disk-1:
  global_id:   61ae2168-a4f3-48a8-8be7-194614e998fc
  state:       down+unknown
  description: status not found
  last_update:

vm-105-disk-2:
  global_id:   3cf1311b-fa74-498a-9a8b-c60236ad1b0e
  state:       down+unknown
  description: status not found
  last_update:

vm-105-disk-3:
  global_id:   fb8d0ad2-962d-43f6-af81-0d1a6abeb9f6
  state:       down+unknown
  description: status not found
  last_update:

vm-106-disk-0:
  global_id:   7ce1b1c3-59e4-4f1e-a934-12ace60a570c
  state:       down+unknown
  description: status not found
  last_update:

vm-107-disk-0:
  global_id:   c0aa5873-9e65-4b89-a53f-49a3fc351716
  state:       down+unknown
  description: status not found
  last_update:

vm-108-disk-0:
  global_id:   e1857775-cb9d-4bfa-b45c-5ed4ad49694e
  state:       down+unknown
  description: status not found
  last_update:

vm-108-disk-1:
  global_id:   fadc0e51-3f0c-4470-a3c2-6073079a1f91
  state:       down+unknown
  description: status not found
  last_update:

vm-111-disk-1:
  global_id:   e5a3aa06-db0e-498d-afbe-602ed0a28b53
  state:       down+unknown
  description: status not found
  last_update:

root@pve-node1:~#  ceph osd pool ls detail
pool 13 'rbd' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 14784 lfor 13285/13285/13643 flags hashpspool,selfmanaged_snaps tiers 15 read_tier 15 write_tier 15 stripe_width 0 application rbd
        removed_snaps [1~2d]
pool 15 'cache' replicated size 2 min_size 1 crush_rule 2 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 14784 lfor 13285/13285/13285 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 13 cache_mode writeback target_bytes 1869169767219 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 0s x0 decay_rate 0 search_last_n 0 stripe_width 0 application rbd
        removed_snaps [1~2d]
pool 16 'cephfs_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 13340 flags hashpspool stripe_width 0 application cephfs
pool 17 'cephfs_metadata' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 13344 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

An can see lots of journal_data on main rbd pool
Code:
root@pve-node1:~# rados -p rbd ls
...
rbd_data.12760829ba8dbe.0000000000021599
journal_data.13.2449a28e62bde7.1684855
journal_data.13.2449a28e62bde7.1068770
rbd_data.241461564eae34.0000000000015997
journal_data.13.2449a28e62bde7.946935
rbd_data.2449a28e62bde7.00000000000047fa
journal_data.13.2449a28e62bde7.848572
journal_data.13.2449a28e62bde7.1621718
journal_data.13.2449a28e62bde7.1116442
rbd_data.12760829ba8dbe.00000000000115dc
rbd_data.12760829ba8dbe.000000000000b103
journal_data.13.2449a28e62bde7.563886
rbd_data.2449a28e62bde7.000000000004c201
rbd_data.2449a28e62bde7.000000000002c7d4
journal_data.13.2449a28e62bde7.1639724
rbd_data.23bd5e38499a26.0000000000006fb8
journal_data.13.2449a28e62bde7.1481473
journal_data.13.2449a28e62bde7.1355715
rbd_data.23bd5e38499a26.0000000000032e72
...


Any ideas? Would very appreciate any assistance!
 
FYI: Replica/min 2/1 is not ideal and we highly recommend to not use that mode if your data is important to you.
2/2 is at least safe (avoids split brain), 3/2 is more ideal.

Any ideas? Would very appreciate any assistance!

Ceph mirroring requires RBD image journaling, which uses some space on it's own. Seems a bit much, to be honest, but could explain it.

You did not enable two way mirroring, or?
 
FYI: Replica/min 2/1 is not ideal and we highly recommend to not use that mode if your data is important to you.
2/2 is at least safe (avoids split brain), 3/2 is more ideal.

Ceph mirroring requires RBD image journaling, which uses some space on it's own. Seems a bit much, to be honest, but could explain it.

You did not enable two way mirroring, or?

I had to down Replica/min to 2/1 from 3/2 to get some "extra space". Any ideas why journaling data are not wiped after pulling from backup node?
If I was not mistaken I did one-way mirror. How could I check that?

Thanks
 
Tomas, could you please check/confirm that with one way mirroring next command:
Code:
rbd mirror pool status rbd  --verbose

gives normal output running on backup node:
Code:
root@pve-backup:~# rbd mirror pool status rbd
health: OK
images: 18 total
    18 replaying

and warning running on main cluster:
Code:
health: WARNING
images: 18 total
    18 unknown

Or something is miss configured here?
 
Ceph mirroring requires RBD image journaling, which uses some space on it's own. Seems a bit much, to be honest, but could explain it.

You did not enable two way mirroring, or?

Do I understand you correctly that two way mirroring requres installing rdb-mirror deamon on both sides? (muster and backup cluster)

However in PVE WIki is clearly written:

  • rbd-mirror installed on the backup cluster ONLY (apt install rbd-mirror).

With PVE 6.4 I still get

health: WARNING
images: 18 total
18 unknown

On master CEPH cluster
 
@Whatever ... hi, i ran into the same problem.
the usage grows up to 5tb in 10 days after enabling the journaling flag.

did you solve the problem?

kind regards,
ronny
 
unbelievable that there is less information about this
which version of proxmox and ceph are you using?
 
Hi i've had the same issue and i'm facing 2 different kind of issues:
A) some images have journal_data files which keep increase in number till i don't disable the journal feature, then it slowly starts to "release" the journal_data objects and my ceph space starts to come back to normal (i.e. consistently decrease in size)

B) some images have journal_data files stuck, for example, a couple of VMs have 2560 and 150k journal_data objects stuck, and the number doesn't change during the day, despite on the DR site the "rbd mirror image status" tells me that "entries_behind_primary" are 0

My ceph version is 15.2.14

Do some1 has any update on the reason why journal_data objects aren't released after successful rbd mirror?
I've tried looking for some documentation on this argument but i can't really find anything that helps me understand why the mirroring can't keep up with the journal (case A) or if it's normal to have journal_data entries despite my mirror is up-to-date.
 
we actually also switched to snapshot mode and try our luck
we will try in the future journal based again
 
now, we put new ssd in the server and upgraded to pve 7.2.
on week later, our vms got high i/o waiting values.
You could also try to enable krbd for the storage. With it enabled, VMs will be started using the kernel rbd implementation and not the Qemu one.
 
You could also try to enable krbd for the storage. With it enabled, VMs will be started using the kernel rbd implementation and not the Qemu one.
thanks for that, we will do this. i think we hit something ... see attachment
its i/o wait of one our linux vms, an 26.4. we updated and on 2.5. the problems started

any idea, why the space is growing with journal mode?
 

Attachments

  • io_snaphot_based.png
    io_snaphot_based.png
    31.2 KB · Views: 4
Last edited:
we got snapshot based running about 50 days without problems.
now, we put new ssd in the server and upgraded to pve 7.2.
on week later, our vms got high i/o waiting values.

we did'nt find anything and had to disable the snapshots.

maybe we hit the bug in:
pve-qemu-kvm=6.2.0-7
https://forum.proxmox.com/threads/p...reeze-if-backing-up-large-disks.109272/page-4


we will test again next week

Did you follow PVE Wiki or another tutorial?
From my perspective PVE wiki is not 100% suitable for current CEPH version in order to setup one way mirroring.
Tried several times and always got errors described in this post
https://forum.proxmox.com/threads/ceph-rbd-mirroring-snapshot-based-not-working.91431/#post-418988
 
Did you follow PVE Wiki or another tutorial?
From my perspective PVE wiki is not 100% suitable for current CEPH version in order to setup one way mirroring.
Tried several times and always got errors described in this post
https://forum.proxmox.com/threads/ceph-rbd-mirroring-snapshot-based-not-working.91431/#post-418988
i think, more or less, i followed that
i had some trouble to understand the master and backup names. these where used later in the setup
"handle_notify: our own notification, ignoring" sounds strange

here are my notes during the setup ... not clear and a little chaotic ;)

While each cluster sees itself as ceph the backup cluster sees the master cluster as master. This is set by the name of the config and keyring file.

- enable mirroring for pool
# master cluster
rbd mirror pool enable vm-ceph pool|image
(snapshot mirroring geht nur mit mode image)

- add cluster peer manually

# master cluster
ceph auth get-or-create client.rbd-mirror.master mon 'profile rbd' osd 'profile rbd' -o /etc/pve/priv/master.client.rbd-mirror.master.keyring
scp /etc/ceph/ceph.conf root@<rbd-mirror-node>:/etc/ceph/master.conf
scp /etc/pve/priv/master.client.rbd-mirror.master.keyring root@<rbd-mirror-node>:/etc/pve/priv/


# backup cluster
ceph auth get-or-create client.rbd-mirror.backup mon 'profile rbd' osd 'profile rbd' -o /etc/pve/priv/ceph.client.rbd-mirror.backup.keyring

- start rbd-mirror (install first)

# backup cluster
systemctl enable ceph-rbd-mirror.target
cp /lib/systemd/system/ceph-rbd-mirror@.service /etc/systemd/system/ceph-rbd-mirror@.service
sed -i -e 's/setuser ceph.*/setuser root --setgroup root/' /etc/systemd/system/ceph-rbd-mirror@.service
systemctl enable ceph-rbd-mirror@rbd-mirror.backup.service
systemctl start ceph-rbd-mirror@rbd-mirror.backup.service

- add peer

# backup cluster
rbd mirror pool enable vm-ceph pool|image
rbd mirror pool peer add vm-ceph client.rbd-mirror.master@master

- verify
rbd mirror pool info vm-ceph
rbd mirror pool status vm-ceph --verbose


- remove peer

# beide cluster
rbd mirror pool info vm-ceph
-> UUID: ffdef20a-19a2-40c5-8f0e-26ca54ed71a0
rbd mirror pool peer remove vm-ceph ffdef20a-19a2-40c5-8f0e-26ca54ed71a0

native german speaker :)



actually we need to switch back from standby to live cluster
so, i have to follow these steps in the other direction. wo, i will test again


ronny
 
i think, more or less, i followed that
i had some trouble to understand the master and backup names. these where used later in the setup
"handle_notify: our own notification, ignoring" sounds strange

here are my notes during the setup ... not clear and a little chaotic ;)

...

Thanks! I will try to setup mirroring one more time with respect to your notes
 
Any chance people with issues could paste the output of
rbd journal status and rbd journal info ?
It should show more info about the journal right now.
So I just had an issue where a VM would not start (start get a timeout), the KVM-process however remains.
I checked the above command and turns out that the remote cluster is fine (it shows that it's on the latest journal replaying commits).
The local cluster however is stuck replaying commits.

Thus the journal files does not dissapear, since they are needed for the local image to work.

What happens at reboot is that when the VM starts, all the journal commits are replayed, when they are complete the KVM-process can continue.

There is also rbd journal reset which causes the same thing, but live (I/O to the disk stops in the VM), but also a full sync is happening on the remote cluster.

rbd journal takes one image as a parameter.

Not sure why the local cluster is not working as intended though. Maybe it's slow and the default settings is too low. I'm trying to send a mail to the mailing list about it :)