After upgrading to Ceph Reef, getting DB spillover.

BloodBlight

Member
Aug 15, 2022
10
4
8
I have a slightly unusual configuration, but nothing too crazy I don't think. I have 2 OSDs (HDDs) per DB/WAL device. The DB is an SSD and the WAL is a tiny optane, this is on the 3 node cluster using erasure encoding and a custom (something I cooked up) balancing algorithm. Okay, "weirdness" out of the way.

After upgrading to Ceph Reef (not right away, about two days after), I now get this warning:
Code:
4 OSD(s) experiencing BlueFS spillover

Ceph health detail shows:
Code:
[WRN] BLUEFS_SPILLOVER: 4 OSD(s) experiencing BlueFS spillover
     osd.3 spilled over 3.2 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.4 spilled over 4.3 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.5 spilled over 3.1 GiB metadata from 'db' device (20 GiB used of 45 GiB) to slow device
     osd.6 spilled over 4.5 GiB metadata from 'db' device (18 GiB used of 45 GiB) to slow device

Note that the amount spilling over is very small compared to the size of the DB and their respective free space (each having over 50% free).

I checked bluestore_max_alloc_size, it is set to 0 on all OSDs....

Ideas? Things to check?
 
Last edited:
  • Like
Reactions: tconnors
Small update to this. OSD 3 seems to have reduced its spillage, while the other increased. There is no replication at this time, and very little IO being done in general... OSD 5 seems to have increased by more than a GB of DB.. This seems really odd.

Code:
[WRN] BLUEFS_SPILLOVER: 4 OSD(s) experiencing BlueFS spillover
     osd.3 spilled over 3.1 GiB metadata from 'db' device (20 GiB used of 45 GiB) to slow device
     osd.4 spilled over 4.6 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.5 spilled over 4.3 GiB metadata from 'db' device (19 GiB used of 45 GiB) to slow device
     osd.6 spilled over 4.7 GiB metadata from 'db' device (18 GiB used of 45 GiB) to slow device
 
the method nh2 mentioned works. Here is an example for osd1 for those who do not want to search for it themselves:

ceph tell 'osd.1' compact

systemctl stop ceph-osd@1.service

ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-1/ --devs-source /var/lib/ceph/osd/ceph-1/block --dev-target /var/lib/ceph/osd/ceph-1/block.db

systemctl start ceph-osd@1.service
 
My cluster had 21 of these when I upgraded, so I wrote a script to take care of it.


Bash:
#!/bin/bash

set -ex

export OSD=$1
export FSID=$(ceph-volume lvm list ${OSD} | grep -m 1 "osd fsid" | awk '{print $3}')
export VG_NAME=$(ceph-volume lvm list ${OSD} | grep '\[block\]' | awk '{print $2}' | awk -F '/' '{print $3"/"$4}')

systemctl stop ceph-osd@${OSD}.service
ceph-volume lvm migrate --osd-id ${OSD} --osd-fsid ${FSID} --from db wal --target ${VG_NAME}
systemctl start ceph-osd@${OSD}.service

Usage, to fix OSDs 1-7 on a host:
Bash:
 for x in 1, 2, 3, 4, 5, 6, 7; do ./migrate_db.sh $x; done

This must be run on each host for only the OSDs that appear on that host.

Also, this is a script that does things to your ceph installation without confirmation and is provided as a reference only. Make backups and all that.