Bluestore WAL/DB on SSD sizing

Haider Jarral · Oct 20, 2018

Hello all,

I recently decided to use SSD in order to improve performance of my cluster. Here is my cluster setup

4 Nodes
36 HDD X 465 GB / node
CPU(s) 8 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (2 Sockets) /node
RAM 128GB / node

I wanted to move all my WAL DB to new SSD in order to improve performance. Originally they were on HDD.

More I read about WAL/DB size, recovery, backfilling more it boggles my mind. I know these questions have been asked many times but I am still confused. Here are the questions I have.

1. Would be beneficial to increase WAL/DB size while creating new OSDs, how do I do that ?
2. Is there any faster way to recreate new OSDs, my each OSD takes 20-30 minutes to backfill when I add it.
3. I am using attached script for moving WAL/DB to SSD.
4. The default setting for CEPH config shows these sizes

ceph --show-config | grep bluestore_block
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path =
bluestore_block_db_size = 0
bluestore_block_path =
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false
bluestore_block_wal_path =
bluestore_block_wal_size = 100663296

And these are the partitions it created on my SSD. It only consumed 30GB out of 250 GB SSD.

fdisk -l | grep sds
Disk /dev/sds: 223.6 GiB, 240057409536 bytes, 468862128 sectors
/dev/sds1 2048 2099199 2097152 1G unknown
/dev/sds2 2099200 3278847 1179648 576M unknown
/dev/sds3 3278848 5375999 2097152 1G unknown
/dev/sds4 5376000 6555647 1179648 576M unknown
/dev/sds5 6555648 8652799 2097152 1G unknown
/dev/sds6 8652800 9832447 1179648 576M unknown
/dev/sds7 9832448 11929599 2097152 1G unknown
/dev/sds8 11929600 13109247 1179648 576M unknown
/dev/sds9 13109248 15206399 2097152 1G unknown
/dev/sds10 15206400 16386047 1179648 576M unknown
/dev/sds11 16386048 18483199 2097152 1G unknown
/dev/sds12 18483200 19662847 1179648 576M unknown
/dev/sds13 19662848 21759999 2097152 1G unknown
/dev/sds14 21760000 22939647 1179648 576M unknown
/dev/sds15 22939648 25036799 2097152 1G unknown
/dev/sds16 25036800 26216447 1179648 576M unknown
/dev/sds17 26216448 28313599 2097152 1G unknown
/dev/sds18 28313600 29493247 1179648 576M unknown
/dev/sds19 29493248 31590399 2097152 1G unknown
/dev/sds20 31590400 32770047 1179648 576M unknown
/dev/sds21 32770048 34867199 2097152 1G unknown
/dev/sds22 34867200 36046847 1179648 576M unknown
/dev/sds23 36046848 38143999 2097152 1G unknown
/dev/sds24 38144000 39323647 1179648 576M unknown
/dev/sds25 39323648 41420799 2097152 1G unknown
/dev/sds26 41420800 42600447 1179648 576M unknown
/dev/sds27 42600448 44697599 2097152 1G unknown
/dev/sds28 44697600 45877247 1179648 576M unknown
/dev/sds29 45877248 47974399 2097152 1G unknown
/dev/sds30 47974400 49154047 1179648 576M unknown
/dev/sds31 49154048 51251199 2097152 1G unknown
/dev/sds32 51251200 53348351 2097152 1G unknown
/dev/sds33 53348352 55445503 2097152 1G unknown
/dev/sds34 55445504 56625151 1179648 576M unknown
/dev/sds35 56625152 58722303 2097152 1G unknown

5. If increasing WAL/DB size partition is beneficial, Can I increase the existing partition size on my SSD or I have to recreate them ?
6. Are these any parameters to increase recovery/backfilling time. I tinkered with these two values but it did not help much.

ceph tell 'osd.*' injectargs '--osd-max-backfills 10'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 5'

Here is proxmox information

pveversion -v

proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

Lastly my biggest concern is if my SSD dies what will happen and how can I prevent it from failing all my OSDs ?

Haider Jarral · Oct 21, 2018

Anyone

Pretty Please

I was hoping to do conversion to SSD over weekend. Any insight/help is highly appreciated.

tookimdown · Jan 29, 2020

I was hoping so much to find comments about your concerns. These are really interesting questions, at the heart of the reflection about migrating to BlueStore / optimizing an existing cluster.
Did you find answers to your questions in the end ? I would be happy to have a further conversation about that if you're still around here.

Alwin · Jan 29, 2020

Haider Jarral said:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)

All in all, best upgrade to the Proxmox VE 6.1, since it has Ceph Nautilus and may other improvements.

Haider Jarral said:
1. Would be beneficial to increase WAL/DB size while creating new OSDs, how do I do that ?

Depends on the size they are now. Something in terms of 3/30/300 GB for RocksDB disk space. If the size is to small, the DB spills over to the data device. In Nautilus, you will get a warning in such an event.

Haider Jarral said:
2. Is there any faster way to recreate new OSDs, my each OSD takes 20-30 minutes to backfill when I add it.

Only manually creating a new partition, stopping the OSD and moving the DB with the ceph-bluestore-tool. But this is a more error prone approach.

Haider Jarral said:
3. I am using attached script for moving WAL/DB to SSD.

IISC, this destroys the OSD and creates a new one. You can add the DB partition size into the ceph.conf. Then it will create the set size automatically.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/post-223919

Haider Jarral said:
4. The default setting for CEPH config shows these sizes

See the above thread.

Haider Jarral said:
5. If increasing WAL/DB size partition is beneficial, Can I increase the existing partition size on my SSD or I have to recreate them ?

IIRC, you will need to expand the BlueFS of the DB partition afterwards.

Haider Jarral said:
6. Are these any parameters to increase recovery/backfilling time. I tinkered with these two values but it did not help much.

Depending on the throughput of the disks, it might not be considerably faster. It could even heavily tax the system and client IO might suffer by it.
https://docs.ceph.com/docs/luminous/rados/configuration/osd-config-ref/#backfilling

Haider Jarral · Jan 29, 2020

Thank you thats very helpful, Can I upgrade the ceph and proxmox without any issues in production ? Does it require a reboot ? What is . the best way to do it in a clustered setup.

Alwin · Jan 30, 2020

Please follow our upgrade guide, it also includes a link to the upgrade procedure of Ceph.
https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

Search

Search

Bluestore WAL/DB on SSD sizing

Haider Jarral

Well-Known Member

Attachments

Haider Jarral

Well-Known Member

tookimdown

New Member

Alwin

Proxmox Retired Staff

Haider Jarral

Well-Known Member

Alwin

Proxmox Retired Staff