Change WAL and DB location for running (slow) OSD's

lifeboy

Renowned Member
I need to do something about the horrible performance I get from the HDD pool on a production cluster. (I get around 500KB/s benchmark speeds!). As the disk usage has been increasing, so the performance has been dropping. I'm not sure why this is, since I have a test cluster, which higher storage % allocation, with much better performance on much older and slower hardware. However, here's what I'm planning to do.

Config:
2 x NVMe 1TB on each of 4 nodes in SSD pool
4 x SAS 2TB HDD on each of 4 nodes in HDD pool

Step 1: Remove one NVMe from the SSD pool. Split it into 5 lvm partitions: 1 of 800GB, 4 x 50GB. Add the 800GB one back as an OSD to the SSD pool.
Step 2: Remove one HDD at a time on the node that Step 1 has been performed on. Delete the OSD and recreate it assigning one 50GB partition to the WAL (I'm not planning to allocate the RocksDB, which should go with the WAL automatically if I read that correctly).

One completion of the process all my HDD based OSDs will now have a 50GB WAL/RocksDB partition on NVMe storage, which will make my HDD storage pool substantially faster. I'm hoping to get an average of between 6 and 8 times speed implovement to around 300MB/s instead of the 500KB I now get.

Does this look like a good plan of action?

Is 50Gb a good size for a 2GB OSD or can I make it less?

Are there things I should be doing differently?

All advice much appreciated.
 
Last edited:
I got advice from a seasoned ceph expert to do the following:

Split the NVMe drive into 3 OSD (with LVM) to really optimise the use the speed the NVMe offers. So I created additional 2 volumes (5% of the NVMe/ 47GB each) in the NVME to hold the RocksDB and WAL for 2 HDD drives. I'm in the process of performing these changes now, but with rebalancing having to happen in between each drive change, it will take some time.
 
I got advice from a seasoned ceph expert to do the following:

Split the NVMe drive into 3 OSD (with LVM) to really optimise the use the speed the NVMe offers. So I created additional 2 volumes (5% of the NVMe/ 47GB each) in the NVME to hold the RocksDB and WAL for 2 HDD drives. I'm in the process of performing these changes now, but with rebalancing having to happen in between each drive change, it will take some time.
You don’t have to recreate the OSD to move the WAL/DB. No need to rebalance.
 
Last edited:
  • Like
Reactions: lifeboy
That's excellent, I wasn't aware of it! It will save a lot of time since rebalancing HDD based OSD's is time-consuming to put it mildly!

However, when I attempt to do this I get an error which is not documented anywhere afaict

Code:
# lsblk /dev/sdb
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sdb                                                                                                     8:16   0  1.8T  0 disk
└─ceph--025b887e--4f06--468f--845c--0ddf9ad04990-osd--block--4de2a617--4452--420d--a99b--9e0cd6b2a99b 253:4    0  1.8T  0 lvm 
  └─0GVWr9-dQ65-LHcx-y6fD-z7fI-10A9-gVWZkY                                                            253:15   0  1.8T  0 crypt

# ceph-osd -i 14 --get-osd-fsid
2023-08-01T21:36:40.270+0200 7fae3505a240 -1 asok(0x55d5ee6b8000) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-osd.14.asok': (17) File exists
4de2a617-4452-420d-a99b-9e0cd6b2a99b
(not sure why this gives and error when getting the osd-fsid?)

# ceph-volume lvm migrate --osd-id 14 --osd-fsid 4de2a617-4452-420d-a99b-9e0cd6b2a99b --from db --target NodeC-nvme1/NodeC-nvme-LV-RocksDB1
--> Source device list is empty
Unable to migrate to : NodeC-nvme1/NodeC-nvme-LV-RocksDB1

I did stop osd.14 before attempting to move the DB/WAL...

Can you help please?

Update: I found this

Code:
# ls -la /var/lib/ceph/osd/ceph-14/block
lrwxrwxrwx 1 ceph ceph 50 Dec 25  2022 /var/lib/ceph/osd/ceph-14/block -> /dev/mapper/0GVWr9-dQ65-LHcx-y6fD-z7fI-10A9-gVWZkY

Should I refer to that id rather? edit: No, it doesn't work either and gives a different error "Unable to find any LV for source OSD"
 
Last edited: