After upgrading to Ceph Reef, getting DB spillover.

BloodBlight · Apr 28, 2024

I have a slightly unusual configuration, but nothing too crazy I don't think. I have 2 OSDs (HDDs) per DB/WAL device. The DB is an SSD and the WAL is a tiny optane, this is on the 3 node cluster using erasure encoding and a custom (something I cooked up) balancing algorithm. Okay, "weirdness" out of the way.

After upgrading to Ceph Reef (not right away, about two days after), I now get this warning:

Code:

4 OSD(s) experiencing BlueFS spillover

Ceph health detail shows:

Code:

[WRN] BLUEFS_SPILLOVER: 4 OSD(s) experiencing BlueFS spillover
     osd.3 spilled over 3.2 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.4 spilled over 4.3 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.5 spilled over 3.1 GiB metadata from 'db' device (20 GiB used of 45 GiB) to slow device
     osd.6 spilled over 4.5 GiB metadata from 'db' device (18 GiB used of 45 GiB) to slow device

Note that the amount spilling over is very small compared to the size of the DB and their respective free space (each having over 50% free).

I checked bluestore_max_alloc_size, it is set to 0 on all OSDs....

Ideas? Things to check?

BloodBlight · Apr 30, 2024

Small update to this. OSD 3 seems to have reduced its spillage, while the other increased. There is no replication at this time, and very little IO being done in general... OSD 5 seems to have increased by more than a GB of DB.. This seems really odd.

Code:

[WRN] BLUEFS_SPILLOVER: 4 OSD(s) experiencing BlueFS spillover
     osd.3 spilled over 3.1 GiB metadata from 'db' device (20 GiB used of 45 GiB) to slow device
     osd.4 spilled over 4.6 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.5 spilled over 4.3 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.6 spilled over 4.7 GiB metadata from 'db' device (18 GiB used of 45 GiB) to slow device

BloodBlight · May 14, 2024

And suddenly after several weeks, the error just cleared. No explanation..

nh2 · Aug 6, 2024

The underlying issue might be this:

https://tracker.ceph.com/issues/44509#note-8

I could fix it on my cluster running

Bash:

ceph tell 'osd.*' compact

which reduced the spillover to a few KiB per OSD, and then the script linked above.

quanto11 · Apr 26, 2025

the method nh2 mentioned works. Here is an example for osd1 for those who do not want to search for it themselves:

ceph tell 'osd.1' compact

systemctl stop ceph-osd@1.service

ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-1/ --devs-source /var/lib/ceph/osd/ceph-1/block --dev-target /var/lib/ceph/osd/ceph-1/block.db

systemctl start ceph-osd@1.service

Uberduck · May 23, 2025

My cluster had 21 of these when I upgraded, so I wrote a script to take care of it.

Bash:

#!/bin/bash

set -ex

export OSD=$1
export FSID=$(ceph-volume lvm list ${OSD} | grep -m 1 "osd fsid" | awk '{print $3}')
export VG_NAME=$(ceph-volume lvm list ${OSD} | grep '\[block\]' | awk '{print $2}' | awk -F '/' '{print $3"/"$4}')

systemctl stop ceph-osd@${OSD}.service
ceph-volume lvm migrate --osd-id ${OSD} --osd-fsid ${FSID} --from db wal --target ${VG_NAME}
systemctl start ceph-osd@${OSD}.service

Usage, to fix OSDs 1-7 on a host:

Bash:

 for x in 1, 2, 3, 4, 5, 6, 7; do ./migrate_db.sh $x; done

This must be run on each host for only the OSDs that appear on that host.

Also, this is a script that does things to your ceph installation without confirmation and is provided as a reference only. Make backups and all that.

Search

Search

After upgrading to Ceph Reef, getting DB spillover.

BloodBlight

Member

BloodBlight

Member

BloodBlight

Member

nh2

New Member

quanto11

Member

Uberduck

New Member

We value your privacy