Bluestore WAL/DB on SSD sizing

Haider Jarral

Well-Known Member
Aug 18, 2018
121
5
58
37
Hello all,

I recently decided to use SSD in order to improve performance of my cluster. Here is my cluster setup

4 Nodes
36 HDD X 465 GB / node
CPU(s) 8 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (2 Sockets) /node
RAM 128GB / node

I wanted to move all my WAL DB to new SSD in order to improve performance. Originally they were on HDD.

More I read about WAL/DB size, recovery, backfilling more it boggles my mind. I know these questions have been asked many times but I am still confused. Here are the questions I have.

1. Would be beneficial to increase WAL/DB size while creating new OSDs, how do I do that ?
2. Is there any faster way to recreate new OSDs, my each OSD takes 20-30 minutes to backfill when I add it.
3. I am using attached script for moving WAL/DB to SSD.
4. The default setting for CEPH config shows these sizes

ceph --show-config | grep bluestore_block
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path =
bluestore_block_db_size = 0
bluestore_block_path =
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false
bluestore_block_wal_path =
bluestore_block_wal_size = 100663296

And these are the partitions it created on my SSD. It only consumed 30GB out of 250 GB SSD.

fdisk -l | grep sds
Disk /dev/sds: 223.6 GiB, 240057409536 bytes, 468862128 sectors
/dev/sds1 2048 2099199 2097152 1G unknown
/dev/sds2 2099200 3278847 1179648 576M unknown
/dev/sds3 3278848 5375999 2097152 1G unknown
/dev/sds4 5376000 6555647 1179648 576M unknown
/dev/sds5 6555648 8652799 2097152 1G unknown
/dev/sds6 8652800 9832447 1179648 576M unknown
/dev/sds7 9832448 11929599 2097152 1G unknown
/dev/sds8 11929600 13109247 1179648 576M unknown
/dev/sds9 13109248 15206399 2097152 1G unknown
/dev/sds10 15206400 16386047 1179648 576M unknown
/dev/sds11 16386048 18483199 2097152 1G unknown
/dev/sds12 18483200 19662847 1179648 576M unknown
/dev/sds13 19662848 21759999 2097152 1G unknown
/dev/sds14 21760000 22939647 1179648 576M unknown
/dev/sds15 22939648 25036799 2097152 1G unknown
/dev/sds16 25036800 26216447 1179648 576M unknown
/dev/sds17 26216448 28313599 2097152 1G unknown
/dev/sds18 28313600 29493247 1179648 576M unknown
/dev/sds19 29493248 31590399 2097152 1G unknown
/dev/sds20 31590400 32770047 1179648 576M unknown
/dev/sds21 32770048 34867199 2097152 1G unknown
/dev/sds22 34867200 36046847 1179648 576M unknown
/dev/sds23 36046848 38143999 2097152 1G unknown
/dev/sds24 38144000 39323647 1179648 576M unknown
/dev/sds25 39323648 41420799 2097152 1G unknown
/dev/sds26 41420800 42600447 1179648 576M unknown
/dev/sds27 42600448 44697599 2097152 1G unknown
/dev/sds28 44697600 45877247 1179648 576M unknown
/dev/sds29 45877248 47974399 2097152 1G unknown
/dev/sds30 47974400 49154047 1179648 576M unknown
/dev/sds31 49154048 51251199 2097152 1G unknown
/dev/sds32 51251200 53348351 2097152 1G unknown
/dev/sds33 53348352 55445503 2097152 1G unknown
/dev/sds34 55445504 56625151 1179648 576M unknown
/dev/sds35 56625152 58722303 2097152 1G unknown


5. If increasing WAL/DB size partition is beneficial, Can I increase the existing partition size on my SSD or I have to recreate them ?
6. Are these any parameters to increase recovery/backfilling time. I tinkered with these two values but it did not help much.


ceph tell 'osd.*' injectargs '--osd-max-backfills 10'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 5'


Here is proxmox information

pveversion -v

proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-40
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-36
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1



Lastly my biggest concern is if my SSD dies what will happen and how can I prevent it from failing all my OSDs ?
 

Attachments

  • ceph_migration.txt
    1.4 KB · Views: 13
  • Like
Reactions: tookimdown
Anyone o_O Pretty Please :)

I was hoping to do conversion to SSD over weekend. Any insight/help is highly appreciated.
 
I was hoping so much to find comments about your concerns. These are really interesting questions, at the heart of the reflection about migrating to BlueStore / optimizing an existing cluster.
Did you find answers to your questions in the end ? I would be happy to have a further conversation about that if you're still around here.
 
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
All in all, best upgrade to the Proxmox VE 6.1, since it has Ceph Nautilus and may other improvements.

1. Would be beneficial to increase WAL/DB size while creating new OSDs, how do I do that ?
Depends on the size they are now. Something in terms of 3/30/300 GB for RocksDB disk space. If the size is to small, the DB spills over to the data device. In Nautilus, you will get a warning in such an event.

2. Is there any faster way to recreate new OSDs, my each OSD takes 20-30 minutes to backfill when I add it.
Only manually creating a new partition, stopping the OSD and moving the DB with the ceph-bluestore-tool. But this is a more error prone approach.

3. I am using attached script for moving WAL/DB to SSD.
IISC, this destroys the OSD and creates a new one. You can add the DB partition size into the ceph.conf. Then it will create the set size automatically.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/post-223919

4. The default setting for CEPH config shows these sizes
See the above thread.

5. If increasing WAL/DB size partition is beneficial, Can I increase the existing partition size on my SSD or I have to recreate them ?
IIRC, you will need to expand the BlueFS of the DB partition afterwards.

6. Are these any parameters to increase recovery/backfilling time. I tinkered with these two values but it did not help much.
Depending on the throughput of the disks, it might not be considerably faster. It could even heavily tax the system and client IO might suffer by it.
https://docs.ceph.com/docs/luminous/rados/configuration/osd-config-ref/#backfilling
 
Thank you thats very helpful, Can I upgrade the ceph and proxmox without any issues in production ? Does it require a reboot ? What is . the best way to do it in a clustered setup.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!