BlueFS spillover detected on 30 OSD(s)

troycarpenter · Jul 22, 2019

Hi all,

After an upgrade on one cluster from 5.4 to 6.0, I performed the Ceph upgrade procedures listed here:
https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus

Somewhere along the way, in the midst of all the messages, I got the following WARN: BlueFS spillover detected on 30 OSD(s). In the information I see messages like the following for each of my OSDs:
osd.12 spilled over 762 MiB metadata from 'db' device (432 MiB used of 1024 MiB) to slow device

There are no other warnings. Everything (mons, managers, mds) all up and running. What part of the upgrade caused this, what does it mean, and how to I fix it?

There's no mention of that in the upgrade reference.

troycarpenter · Jul 22, 2019

I thought I would try re-creating the OSDs with Nautilus, but now it's creating the DB LVM size at about 370 GB, which I guess is 10% of the OSD. However, my SSD is only 1.7TB, so after creating 4 of the 10 OSDs in that server, it runs out of space on the SSD.

I have tried using the size limiter both in the GUI and in the CLI to 100GB, but the system still creates the DB LV at 370GB. Now I need to figure that problem out.

dcsapak · Jul 23, 2019

troycarpenter said:
I have tried using the size limiter both in the GUI and in the CLI to 100GB, but the system still creates the DB LV at 370GB. Now I need to figure that problem out.

that is sadly a bug that slipped through (see https://bugzilla.proxmox.com/show_bug.cgi?id=2292 )
as a workaround you can set the values in the ceph.conf
bluestore_block_db_size

troycarpenter · Jul 23, 2019

dcsapak said:
that is sadly a bug that slipped through (see https://bugzilla.proxmox.com/show_bug.cgi?id=2292 )
as a workaround you can set the values in the ceph.conf
bluestore_block_db_size

Thanks for confirming that. I was about to post a followup with my observations that the bluestore setting was the only way I could get it to create the partition size I needed.

The other thing that got me is that you can't just delete one converted OSD from luminous and re-add it with Nautilus because Nautilus wants to use LVMs and needs to have the SSD clear to create the initial PV/VG for the DB store. Which means once I deleted one of the converted OSDs, I had to delete all of them on that host before I could re-add them.

cmonty14 · Aug 23, 2019

Hello!
I'm facing the same issue with just the difference that 192 OSD(s) are affected.

When I created the OSD(s) in PVE 5 + Luminous there was a 1GB partition on the SSD created for the DB (metadata).

Question:
How can I determine the amount of spilled metadata?

I run this command (documented in Ceph help)
ceph daemon osd.10 bluestore bluefs available
w/o success.
root@ld5505:~# ceph daemon osd.10 bluestore bluefs available
no valid command found; 10 closest matches:
dump_mempools
dump_historic_slow_ops {<filterstr> [<filterstr>...]}
dump_blocked_ops {<filterstr> [<filterstr>...]}
dump_blacklist
dump_historic_ops_by_duration {<filterstr> [<filterstr>...]}
dump_historic_ops {<filterstr> [<filterstr>...]}
config set <var> <val> [<val>...]
config help {<var>}
config unset <var>
config show
admin_socket: invalid command

I tried the ceph-bluestore-tool as well, but this shows no output either.
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-sizes --device osd.10 --path /var/lib/ceph/osd/ceph-10/
too many positional options have been specified on the command line

root@ld5505:~# ls -l /var/lib/ceph/osd/ceph-10/
insgesamt 60
-rw-r--r-- 1 root root 402 Jun 7 14:25 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jun 7 14:25 active
lrwxrwxrwx 1 ceph ceph 58 Jun 7 14:25 block -> /dev/disk/by-partuuid/85030952-ab20-4b5f-bb05-d860648aa712
lrwxrwxrwx 1 ceph ceph 58 Jun 7 14:25 block.db -> /dev/disk/by-partuuid/aa784a14-bdd1-48a8-a264-2fb267231928
-rw-r--r-- 1 ceph ceph 37 Jun 7 14:25 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jun 7 14:25 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jun 7 14:25 bluefs
-rw-r--r-- 1 ceph ceph 37 Jun 7 14:25 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jun 7 14:25 fsid
-rw------- 1 ceph ceph 57 Jun 7 14:25 keyring
-rw-r--r-- 1 ceph ceph 8 Jun 7 14:25 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jun 7 14:25 magic
-rw-r--r-- 1 ceph ceph 4 Jun 7 14:25 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jun 7 14:25 ready
-rw-r--r-- 1 ceph ceph 3 Aug 23 09:50 require_osd_release
-rw-r--r-- 1 ceph ceph 0 Aug 21 11:18 systemd
-rw-r--r-- 1 ceph ceph 10 Jun 7 14:25 type
-rw-r--r-- 1 ceph ceph 3 Jun 7 14:25 whoami

The most important question is:
How much disk space is required for DB (metadata)?
Is this size depending on the relevant OSD size?
In my case, each OSD is 1.64TB... what would be the optimal size for DB (metadata)?

THX

Update:
After stopping the relevant OSD the command ceph-bluestore-tool is working.
However, I cannot interpret the output.
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-10/
inferring bluefs devices from bluestore path
slot 2 /var/lib/ceph/osd/ceph-10/block -> /dev/sdbo2
slot 1 /var/lib/ceph/osd/ceph-10/block.db -> /dev/sdbl1
1 : device size 0x40000000 : own 0x[2000~3fffe000] = 0x3fffe000 : using 0x153fe000
2 : device size 0x1a327831000 : own 0x[c931a00000~10c4300000] = 0x10c4300000 : using 0x0

What does this mean?

tom · Aug 23, 2019

c.monty said:
PVE 5 + Nautilus

there is no Ceph Nautilus for Proxmox VE 5.x. (Nautilus was introduced in Proxmox VE 6.0)

cmonty14 · Aug 23, 2019

tom said:
there is no Ceph Nautilus for Proxmox VE 5.x. (Nautilus was introduced in Proxmox VE 6.0)

Thank you for this hint.
I corrected my posting accordingly.

Alwin · Aug 26, 2019

@c.monty, see the link about sizing.
https://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/#sizing

cmonty14 · Aug 26, 2019

Alwin said:
@c.monty, see the link about sizing.
https://docs.ceph.com/docs/luminous/rados/configuration/bluestore-config-ref/#sizing

Thanks for providing this link.

This means my current Ceph setup is somehow obsolete because the command
pveceph osd create <hdd-device> --journal-dev <ssd-device>
created a partion of size 1G on the SSD.

What is the recommended procedure to correct this?

THX

Alwin · Aug 27, 2019

@c.monty, check out our forum search.

https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-212147

cmonty14 · Aug 27, 2019

Alwin said:
@c.monty, check out our forum search.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-212147

Well, my issue is not the OSD performance. Therefore tuning was not my request.
The issue is that my setup originated from Proxmox 5 + Ceph Luminous with every OSD of type HDD has a journal on SSD with 1GB each.
According to Ceph this is by fare to small for block.db (see here):
It is recommended that the block.db size isn’t smaller than 4% of block. For example, if the block size is 1TB, then block.db shouldn’t be less than 40GB.

In my understanding there are 2 options:
1. Ignore the warning "BlueFS spillover detected on ... OSD"
2. Re-create every single OSD

Option 2 is not nice if the cluster consists of hundred OSD as this must be done sequentially after marking the relevant OSD as out and waiting for Ceph to finish remapping & cleaning.

In my opinion you must inform your users about this as this can hardly impact the cluster operations after upgrading to Nautilus.

Alwin · Aug 27, 2019

c.monty said:
Well, my issue is not the OSD performance. Therefore tuning was not my request.

In the link it is written, how to permanently set the DB size for Luminous, when a new OSD is created. This is tuning.

c.monty said:
According to Ceph this is by fare to small for block.db (see here):

This is known by Ceph since Luminous, but ceph-disk did not take this into account. The 4% are also an estimation and may still lead to spill-over.

c.monty said:
1. Ignore the warning "BlueFS spillover detected on ... OSD"

Sure, if the performance impact is negligible for you.

c.monty said:
2. Re-create every single OSD

This is not necessarily needed, depending on available space and setup, the DB can be moved (starting with Nautilus) or its partition increased (starting with Luminous), man ceph-bluestore-tool. With our API or a shell script and ceph-volume the OSD destruction/creation can be automated too.

c.monty said:
In my opinion you must inform your users about this as this can hardly impact the cluster operations after upgrading to Nautilus.

This has had an impact since bluestore was introduced. With the above mentioned tools the situation can be mitigated and is not set in stone (even on a production cluster).

cmonty14 · Aug 27, 2019

Hi,
thanks Alwin for the explanation.

However there's one thing that is not mentioned.
With Nautilus all OSDs are now created using LVM when using command pveceph createosd <device>.
Before this command creates primary partitions with GPT.
Or is this command obsolete now? It is still documented here.

This means I cannot easily extend the size of the relevant partition for DB.
With LVM this is different, though.

Alwin · Aug 27, 2019

c.monty said:
However there's one thing that is not mentioned.
With Nautilus all OSDs are now created using LVM when using command pveceph createosd <device>.
Before this command creates primary partitions with GPT.
Or is this command obsolete now? It is still documented here.

Till Ceph Luminous the ceph-disk utility was used and replaced (now obsolete) by ceph-volume with Nautilus. Newly created OSDs will have all the content that was in the 100MB mounted XFS partition is now written to the LVs as tags.

c.monty said:
This means I cannot easily extend the size of the relevant partition for DB.
With LVM this is different, though.

I don't understand. You can resize a partition or a LV. The handling of the LV might be easier though. And ceph-volume can split the DB/WAL LVs (partitions) evenly when ceph-volume is used in batch mode.

cmonty14 · Aug 27, 2019

Well, the partitions on SSD are created sequentially.
The design now looks like this:
sdbl 67:240 0 372,6G 0 disk
├─sdbl1 67:241 0 1G 0 part
├─sdbl2 67:242 0 1G 0 part
├─sdbl3 67:243 0 1G 0 part
├─sdbl4 67:244 0 1G 0 part
├─sdbl5 67:245 0 1G 0 part
├─sdbl6 67:246 0 1G 0 part
├─sdbl7 67:247 0 1G 0 part
├─sdbl8 67:248 0 1G 0 part
├─sdbl9 67:249 0 1G 0 part
├─sdbl10 67:250 0 1G 0 part
├─sdbl11 67:251 0 1G 0 part
├─sdbl12 67:252 0 1G 0 part
├─sdbl13 67:253 0 1G 0 part
├─sdbl14 67:254 0 1G 0 part
├─sdbl15 67:255 0 1G 0 part
├─sdbl16 259:20 0 1G 0 part
├─sdbl17 259:21 0 1G 0 part
├─sdbl18 259:22 0 1G 0 part
├─sdbl19 259:23 0 1G 0 part
├─sdbl20 259:24 0 1G 0 part
└─sdbl21 259:25 0 1G 0 part

I don't know how to extend a partition w/o moving the subsequent partitions to the end.

Of course I could delete the partition and create a new partition with the demanded size.
However then I would need to modify the UUID of the relevant /var/lib/ceph/osd/ceph-<id>/.

Alwin · Aug 27, 2019

c.monty said:
I don't know how to extend a partition w/o moving the subsequent partitions to the end.

I see. Yes, some partitions would need to be moved. As you wrote, in some other post that your cluster is now on Nautilus, you could create a new and bigger partition on the end and use the ceph-bluestore-tool to move the DB (offline).

c.monty said:
Of course I could delete the partition and create a new partition with the demanded size.
However then I would need to modify the UUID of the relevant /var/lib/ceph/osd/ceph-<id>/.

This should be taken care of by the ceph-bluestore-tool, when moving the DB.

cmonty14 · Aug 27, 2019

Based on my calculation I need much more SSD disk space.
260x HDD 2TB = 520TB total
5% for DB = 26TB
distributed over 4 nodes = 6.5TB

Once I have the required SSD drives I will create new DB storage location.

Can you please advise how to proceed for the following 2 scenarios:
1. HDD - Single drive config, means no separate DB/WAL device and LVM
2. HDD+SSD - data on HDD, separate DB on SSD with 1GB partition

This is the content of /var/lib/ceph/osd/ceph-<id> for the relevant scenario:
root@ld5505:~# ls -l /var/lib/ceph/osd/ceph-11/
insgesamt 52
-rw-r--r-- 1 ceph ceph 418 Aug 27 08:31 activate.monmap
lrwxrwxrwx 1 ceph ceph 93 Aug 27 08:31 block -> /dev/ceph-546d5cea-6e20-4527-a60b-40c8f64275b3/osd-block-8018a4c9-3c9e-48ec-9099-e6a5fc7268c8
-rw-r--r-- 1 ceph ceph 2 Aug 27 08:31 bluefs
-rw-r--r-- 1 ceph ceph 37 Aug 27 08:31 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Aug 27 08:31 fsid
-rw------- 1 ceph ceph 56 Aug 27 08:31 keyring
-rw-r--r-- 1 ceph ceph 8 Aug 27 08:31 kv_backend
-rw-r--r-- 1 ceph ceph 21 Aug 27 08:31 magic
-rw-r--r-- 1 ceph ceph 4 Aug 27 08:31 mkfs_done
-rw-r--r-- 1 ceph ceph 41 Aug 27 08:31 osd_key
-rw-r--r-- 1 ceph ceph 6 Aug 27 08:31 ready
-rw-r--r-- 1 ceph ceph 3 Aug 27 08:31 require_osd_release
-rw-r--r-- 1 ceph ceph 10 Aug 27 08:31 type
-rw-r--r-- 1 ceph ceph 3 Aug 27 08:31 whoami

root@ld5507:~# ls -l /var/lib/ceph/osd/ceph-57/
insgesamt 60
-rw-r--r-- 1 root root 402 Jul 2 14:01 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Jul 2 14:01 active
lrwxrwxrwx 1 ceph ceph 58 Jul 2 14:01 block -> /dev/disk/by-partuuid/f4d306ec-9dd5-4f75-8b4b-d53519464aff
lrwxrwxrwx 1 ceph ceph 58 Jul 2 14:01 block.db -> /dev/disk/by-partuuid/ca9eba14-c192-4fae-98e0-2b0e37049c90
-rw-r--r-- 1 ceph ceph 37 Jul 2 14:01 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Jul 2 14:01 block_uuid
-rw-r--r-- 1 ceph ceph 2 Jul 2 14:01 bluefs
-rw-r--r-- 1 ceph ceph 37 Jul 2 14:01 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Jul 2 14:01 fsid
-rw------- 1 ceph ceph 57 Jul 2 14:01 keyring
-rw-r--r-- 1 ceph ceph 8 Jul 2 14:01 kv_backend
-rw-r--r-- 1 ceph ceph 21 Jul 2 14:01 magic
-rw-r--r-- 1 ceph ceph 4 Jul 2 14:01 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Jul 2 14:01 ready
-rw-r--r-- 1 ceph ceph 3 Aug 23 09:56 require_osd_release
-rw-r--r-- 1 ceph ceph 0 Aug 21 12:41 systemd
-rw-r--r-- 1 ceph ceph 10 Jul 2 14:01 type
-rw-r--r-- 1 ceph ceph 3 Jul 2 14:01 whoami

THX

Alwin · Aug 29, 2019

c.monty said:
Based on my calculation I need much more SSD disk space.
260x HDD 2TB = 520TB total
5% for DB = 26TB
distributed over 4 nodes = 6.5TB

Are you certain that this will provide better performance? 65x OSDs per node, as the distribution is on host level, a single OSD might not get so many hits at all. OFC, that heavily depends on the workload.

c.monty said:
Can you please advise how to proceed for the following 2 scenarios:
1. HDD - Single drive config, means no separate DB/WAL device and LVM
2. HDD+SSD - data on HDD, separate DB on SSD with 1GB partition

I don't exactly know what you mean, but I suppose how to replace those DB/WAL partitions. Best re-create the OSDs, as it seems that the OSDs metadata (eg. tag - ceph.db_device) is not updated when the ceph-bluestore-tool is used to migrate. One would need to run a lvchange afterward for every OSD and activate the OSD again, so the block.db symlink is changed accordingly (or do the extra stuff by oneself). With the re-creation of the OSD, you can choose the disk for the DB and it will add each DB to the VG, where the PV is associated with.

Bash:

pveceph osd create /dev/sda --db_dev /dev/sdc --db_size 10
pveceph osd create /dev/sdb --db_dev /dev/sdc --db_size 10

root@pve6ceph01:~# lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                                                                                                     8:16   0   32G  0 disk
└─ceph--bb62defa--4fcd--4f64--802a--39c31b030bb4-osd--block--31f0ed2d--be53--4061--bf13--862c6f8ddf5f 253:6    0   31G  0 lvm 
sdb                                                                                                     8:32   0   32G  0 disk
└─ceph--4e963150--589e--4368--8cc0--ef4f12edb39b-osd--block--54ce2440--35ef--46c5--8d31--cbdced7c9da8 253:8    0   31G  0 lvm 
sdc                                                                                                     8:64   0   32G  0 disk
├─ceph--db--volume-osd--db--d01252d6--1941--477d--bb9e--1d93add82271                                  253:0    0   10G  0 lvm 
└─ceph--db--volume-osd--db--897fcab6--f1e8--401f--ba00--c788fd905e3f                                  253:7    0   10G  0 lvm

cmonty14 · Aug 29, 2019

Alwin said:
Are you certain that this will provide better performance? 65x OSDs per node, as the distribution is on host level, a single OSD might not get so many hits at all. OFC, that heavily depends on the workload.

The workload for the OSDs of type HDD is only:
OLTP Database backup / restore

This means that any DB server has mapped a single RBD to backup / restore the Database.

Would you confirm that for this workload a dedicated SSD for block.db is not required?

Alwin · Aug 29, 2019

c.monty said:
Would you confirm that for this workload a dedicated SSD for block.db is not required?

I can't assess how well the system is performing right now and how / if the load is handled adequately. This is up to your estimation. I just wanted to throw this in as a thought. To get a feel for it, you could monitor your system / disk utilization to estimate if they are running well below capacity (bandwidth, latency). And finally, test by re-creating the OSDs without separate DB. I doubt that you would need to re-do all of them, as you should see a load change on those OSD anyhow.

BlueFS spillover detected on 30 OSD(s)

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff