CEPH shared SSD for DB/WAL?

barius

New Member
Feb 26, 2020
4
7
1
55
Hello,

I am preparing to order hardware for a 3-node hyper-converged cluster. I won't bore you with exact configuration details but I would be thankful for some general advice on this point.

To optimise performance with a limited budget (all SSD storage is not an option) i have read that it would be good to put the DB+WAL on fast SSD and use slow(er) disks for the main OSD storage. It also seems that I should create one OSD per main storage disk but that partitions (on the SDD) are OK for the DB/WAL. The DB/WAL is small compared to the main storage (10% ???) so I prefer not to dedicate 2 drive bays for each OSD. Especially since more smaller OSDs would reduce the rebuild time compared to fewer large OSDs.

Is it a bad idea to share an SSD for the DB/WAL of several OSDs? My main concern is that if (when!!!) the SSD fails, all the OSDs with their DB/WAL on the same SSD will have to be destroyed and rebuilt. Right?

Secondly, if I conclude to risk sharing the SSD as above, how do I actually implement this? (assume sdh is the HDD and sds is the SSD)
From the CEPH site (Multiple Devices) I have the impression that a partition will be created automatically. The PVECEPH manual says to do something like this but does not discuss partitions.
Code:
pveceph osd create /dev/sdh -db_dev /dev/sds

Or do I have to create the partition myself, then do something like this?
Code:
pveceph osd create /dev/sdh -db_dev /dev/sds1

Many thanks in advance for your valuable advice

Barius
 
Is it a bad idea to share an SSD for the DB/WAL of several OSDs? My main concern is that if (when!!!) the SSD fails, all the OSDs with their DB/WAL on the same SSD will have to be destroyed and rebuilt. Right?
A common recommendation is to have 1 SSD per 4-6 HDDs since it can speedup writes dramatically.
If I understand correctly, it is indeed so that if the SSD would fail, all HDDs it holds data for will need rebuilding.

The DB/WAL is small compared to the main storage (10% ???)
In CEPH docs, I've seen 4% recommended, I think the recommended minimum is ~30GB per OSD.

I'm pretty sure you have to do the partitioning yourself with taking proper care of partition alignment but I'm not sure about your other questions.
 
What did you end up doing @barius? I'm in a similar situation here. Building the cluster this weekend. 4 nodes, 2x4tb and 500gb ssd each. From the numbers I've seen the SSD should be large enough to upgrade to 4x4tb down the road.

Is the partition for the WAL and DB on the SSD shared between the nodes on the machine or did you have to partition it differently? Thank you for the help!
 
I've had a 4x Proxmox cluster running ceph with a fusion-io card in each machine for about a year. I have 4x 1TB 2.5" HDDs in each machine, and partition 300GB for db. No issues here.
 
Great news! So I'll format 2 partitions on the SSD, Wal and DB and all the nodes on that machine can use the same partition then.
 
Great news! So I'll format 2 partitions on the SSD, Wal and DB and all the nodes on that machine can use the same partition then.
It is my understanding (I may be wrong) that you need a separate partition on the SSD for each OSD. Did you find it worked to share a single partition?
it is indeed so that if the SSD would fail, all HDDs it holds data for will need rebuilding.
That is indeed what my (limited, but sufficient?) research has indicated, you would need to remove the OSD's and add them to the cluster again to refill with data. (Eg. see http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024267.html)
 
  • Like
Reactions: herzkerl
It is my understanding (I may be wrong) that you need a separate partition on the SSD for each OSD. Did you find it worked to share a single partition?
You can make ceph automatically handle this, it nicely create partitions accordingly.
 
I am still at the early stages of my journey with Ceph but can say I was able to share a SSD between OSD. Didn't find a way to accomplish this using proxmox Gui, only cli. OSD are 4tb and SSD is 500gb. 2 OSD per SSD...
Command used was:
pveceph osd create /dev/sda --db_dev /dev/sdd --db_size 200
 
I am trying to do the same setup (a shared SSD for block.db and block.wal) and actually couldn't manage to do it right yet. It would be a good feature for Proxmox to be able to do this from GUI. I hope Proxmox team is reading this thread and can consider adding a feature that we can set db and wal on same shared ssd and targeting not only the whole disk per OSD but a LVM as target store.
 
I am trying to do the same setup (a shared SSD for block.db and block.wal) and actually couldn't manage to do it right yet. It would be a good feature for Proxmox to be able to do this from GUI. I hope Proxmox team is reading this thread and can consider adding a feature that we can set db and wal on same shared ssd and targeting not only the whole disk per OSD but a LVM as target store.
Where did you hang up?
 
Where did you hang up?
I tried to select the OS disk but got some errors like (I do not remember the exact phrase) "no ceph volume group found". But today after reading your reply I added a new blank SSD on the test server and everything worked as expected.

So for my understanding we can not use a single SSD for both Proxmox OS and Ceph DB/WAL files.

In our production servers we are currently using Hyper-V and planning to migrate to Proxmox and evaluating and testing Proxmox capabilities on our testing environment.

On the production servers we have RAID 1 with 2 SSD for fault tolerance for Hypervisor installation. And 6 to 10 spinning drives for VM disks per server. If we can not use OS disk side by side with CEPH DB/WAL this means that we will need to install 2 more SSD as RAID1 for fault tolerance of WAL/DB which will cost us a 12 to 20 TB of raw disk space decrease per server.

My question is, is this proposed setup acceptable for performance and fault tolerance. If not what kind of a setup do you suggest?
 
I really wish I could help new any further but I'm just a hobbyist learning as well. Maybe and I say maybe a dual SD pci card could be a use case for that? I've noticed some of my dell servers have that pci card. It mirrors on the other SD. Other option could be a pci card with dual nvme...
For my use case (not professional again), I took the dvd player out and used that sata and power for the SSD to keep my 8 hot swap bays. Keep in mind that it's sata 2 and not sata 3 but I was after the IOPS and not the full bandwith (still pretty close).
Please let me know of your findings.
 
What did you end up doing @barius? I'm in a similar situation here. Building the cluster this weekend. 4 nodes, 2x4tb and 500gb ssd each. From the numbers I've seen the SSD should be large enough to upgrade to 4x4tb down the road.

Is the partition for the WAL and DB on the SSD shared between the nodes on the machine or did you have to partition it differently? Thank you for the help!
Finally I bought 4 x 4TB spinning drives and 2 x 900GB SSD for the Ceph on each node. I didn't want all the OSD on one node to die if just one SSD failed. So the DB+WAL of two OSDs share one SSD and the DB+WAL of the two remaining OSDs share the other SSD.

I can't really say how it performs because I only just set it up now (hence the delayed reply) and I don't really have any benchmarking foreseen.

Concerning the question of partitioning, it was not necessary to partition the SSD to share it between two OSDs. Ceph manages it with LVM -- it creates a separate VG and PV for the whole SSD and then allocates the required space with a LV for each OSD.

So with sde and sdf as my HDD and sdc as the SSD, it was as simple as this:
Code:
sudo pveceph osd create /dev/sde --encrypted 1 --db_dev /dev/sdc
sudo pveceph osd create /dev/sdf --encrypted 1 --db_dev /dev/sdc
 
Jumping to share my $0.02 as I've been heavily researching this myself, a newbie as others here are...

First, let's state for the record what exact SSD hardware you need for any and all journal (FileStore), block.db, block.wal, etc device, including just a straight SSD added as an OSD for usage: Enterprise SSD drives, and nothing else. No, your lot of Samsung 980 Pro M.2 are not enterprise drives nor any mainstream consumer drive you've used - and your performance will be horrible. Enterprise SSDs are a whole different level. Mainly because of one fact that you can dig into the tech details of your SSD to verify how they handle the Power Loss Prevention. Enterprise SSDs have super-capacitors that keep power to the device for a short time after the machine looses power - enough to write everything in its queue to disk before loosing power. Consumer drives, like the Samsung 980 Pro M.2, uses a transaction log.

Why is Power Loss Prevention key to performance for the CEPH cluster, as a backing disk or the block.db? Because CEPH immediately issues an flush/f-sync after each write. An Enterprise Drive will acknowledge this f-sync immediately without waiting to flush, as it has the super-capacitor to guarantee writes. Whereas consumer SSDs, like the Samsung 980 Pro M.2, will not - it will flush everything to NAND first, and then acknowledge. This tiny small handling of f-sync is paramount to CEPH performance, and is the key to 10 to even 30x the performance. Are you getting horrible IOPS or Bandwidth in your Ceph cluster? This is most likely why, unless your block.db device is too small.

For the record, you can pickup inexpensive Enterprise drives on eBay. I got over a dozen Samsung PM883 240GB at $30/ea (usually $50-ish) by waiting nearly 6 months for a "Lot of" deal. There's also the older PM863s, and many Intel DC drives. Even SATA2 3Gb/s drives are fine as you want the IOPS and f-sync instance-response, not really the bandwidth. Btw, you want the smaller of the disks, like the 240GB. Read on for why.

Now, some facts from the official CEPH docs related to the BlueStore Provider (I'm assuming everyone here is using BlueStore).

A quote from the page to clear up DB/WAL:

The BlueStore journal will always be placed on the fastest device available, so using a DB device will provide the same benefit that the WAL device would while also allowing additional metadata to be stored there (if it will fit). This means that if a DB device is specified but an explicit WAL device is not, the WAL will be implicitly colocated with the DB on the faster device.

So, we only need to specify the block.db device.

About that sizing...

barius said:
The DB/WAL is small compared to the main storage (10% ???)
Byron said:
In CEPH docs, I've seen 4% recommended, I think the recommended minimum is ~30GB per OSD.

Actually, the CEPH docs recommends:

The general recommendation is to have block.db size in between 1% to 4% of block size. For RGW workloads, it is recommended that the block.db size isn’t smaller than 4% of block, because RGW heavily uses it to store metadata (omap keys). For example, if the block size is 1TB, then block.db shouldn’t be less than 40GB. For RBD workloads, 1% to 2% of block size is usually enough.

So for RGW workloads, aka RADOS Gateway, aka AWS S3, you'd want no less than 4% of the total backing block device.

If you really in a bind and really need RGW but can't afford the Enterprise SSDs for block.db devices, you could get away with a smaller 1-2% of block size, and use RGW Data Caching. That's my plan anyways if I ever need RGW.

barius said:
Is it a bad idea to share an SSD for the DB/WAL of several OSDs? My main concern is that if (when!!!) the SSD fails, all the OSDs with their DB/WAL on the same SSD will have to be destroyed and rebuilt. Right?

Yes, that is correct. I took this fact, combined with the 2% sizing requirement of the Enterprise SSDs (no RGW), and the maximum number of OSDs the chassis can install (8x hot swap), and came up with my own formula of:

4x 4TB disks (expandable to 6x disks)
2x 240GB Enterprise SSDs (80GB partitions for each block.db)

At 2% of 4TB, that's an 80GB partition. 240GB / 80GB = 3x partitions each SSD. So, 2x 240GB SSDs allows me a total of 6x 4TB HDDs in the chassis. Considering I am building 3x chassis, and I already have 12x 4TB drives, this works out for me.

Also, i could swap that 8x 3.5" hot-swap supermicro chassis for an 12x 3.5" how-swap chassis, and get a total of 9x 4TBs with 3x 240GB SSDs. Perfect math and expand-ability.

There's one other redundancy fact hidden in this setup: if I loose an SSD, that means I only loose 2x OSDs (or 3x OSDs if I expand to 6x disks). This is why I chose 240GB, and not the 480GB drives - well, that and cost! To be clear, I purposely bought 2x 240GB instead of one single 480GB, as I plan to partition the 240GB with 2x 80GB partitions for now, leaving open 80GB for any future 4TBs I add. Also, the deal I got on the 240GB drives was killer, and two were much cheaper than a single 480GB. So, win-win all around.

Everyone's systems, nodes, and needs are different. Do the math (2% of not using RGW, 4% if using RGW - or more for heavy workloads, or using RGW cache).

I will be setting them up on Nextcloud for photos/docs/drive sharing internally. Some of you may say, "But, you absolutely want to use RGW with NextCloud, it's built for that using the S3 interface!" Actually, my use case is their 40+ years of digitized photos and old videos already organized in a massive collection. NextCloud has zero ordering/folders/etc in their S3 backend, dumping everything to a root folder with obtuse names. Hence, a straight File System object store, not S3.

---

Remember one final fact: BlueStore will use default/prefer the writes to block.db device and f-sync. However, there are times BlueStore will bypass your block.db all together and go straight to the backing device. IIRC, things like streaming.
 
Last edited:
So for my understanding we can not use a single SSD for both Proxmox OS and Ceph DB/WAL files.

As far as I have read, correct. However, you are free to install Debian, partition the drives how you like. As long as you can present the partitions as raw block devices to Proxmox, you'll be set.

On the production servers we have RAID 1 with 2 SSD for fault tolerance for Hypervisor installation.

Personally, I dropped RAID1/ZRAID1 from my OS drives. I treat the servers now as cattle instead of pets. I have ansible setup my machines so if an OS drive fails, all I need to do is install Proxmox, and then run my ansible scripts. In the past, this tool about 15 minutes end-to-end.

This also free'd up a SAS/SATA port. *hint* *hint* ;-)

Next step is to setup PXE and assign configs per MAC address, for auto provisioning when the machine turns on. The Holy Grail of automation: change SSD, turn on server, PXE boot sees the machine has no OS, goes to install automagically.

And 6 to 10 spinning drives for VM disks per server. If we can not use OS disk side by side with CEPH DB/WAL this means that we will need to install 2 more SSD as RAID1 for fault tolerance of WAL/DB which will cost us a 12 to 20 TB of raw disk space decrease per server.

You typically will not RAID/RAID1 your WAL/DB block device. That's what Ceph's replication is for, to ensure another full copy of the block is on another node somewhere.

You can also have multiple pools (with different OSDs) with different replication levels.

And then you have backups/block device diffs on top of that.

My question is, is this proposed setup acceptable for performance and fault tolerance. If not what kind of a setup do you suggest?

Performance... As long as you use Enterprise-grade SSDs (see my previous post above), performance should be on par.
 
Last edited:
Finally I bought 4 x 4TB spinning drives and 2 x 900GB SSD for the Ceph on each node. I didn't want all the OSD on one node to die if just one SSD failed. So the DB+WAL of two OSDs share one SSD and the DB+WAL of the two remaining OSDs share the other SSD.

I can't really say how it performs because I only just set it up now (hence the delayed reply) and I don't really have any benchmarking foreseen.

Concerning the question of partitioning, it was not necessary to partition the SSD to share it between two OSDs. Ceph manages it with LVM -- it creates a separate VG and PV for the whole SSD and then allocates the required space with a LV for each OSD.

So with sde and sdf as my HDD and sdc as the SSD, it was as simple as this:
Code:
sudo pveceph osd create /dev/sde --encrypted 1 --db_dev /dev/sdc
sudo pveceph osd create /dev/sdf --encrypted 1 --db_dev /dev/sdc

This is the most helpful entry for me, trying to add HDD OSDs to SSD DB/wal in probably 4 hours of research and combing through years of bad info floating around out there. It took me a moment to figure out why the Proxmox GUI was not doing this. The rabbit hole of manual partition creation and tomfoolery can 100 percent be mitigated by this one line. If only bad data over 1 year old could be scraped from the internet forever. Bonus points for pveceph for auto-managing DB size as well as disk label. Now I can grab SSDs for my other nodes. Thanks for this!
 
This is the most helpful entry for me, trying to add HDD OSDs to SSD DB/wal in probably 4 hours of research and combing through years of bad info floating around out there. It took me a moment to figure out why the Proxmox GUI was not doing this. The rabbit hole of manual partition creation and tomfoolery can 100 percent be mitigated by this one line. If only bad data over 1 year old could be scraped from the internet forever. Bonus points for pveceph for auto-managing DB size as well as disk label. Now I can grab SSDs for my other nodes. Thanks for this!
Can you show the output of db/way partitions structure? Just interesting. Thanks.
 
Can you show the output of db/way partitions structure? Just interesting. Thanks.

I had it in a notepad but can't seem to find it!

It was pretty clever. Something like 'processing /dev/sdb xxxxxGiB, using 256GiB on /dev/sda for db/wal' wth loads more technical info. You didn't have to actually specify any other sizes for anything, it was all auto. So if you had a 1TB OSD it would be 200GiB for instance, then 2TB would do 280 or so. This is from memory but I remember 4 2TB HDD OSDs fitting on a 1TB SDD with 50 megs or so to spare. I have no idea what would happen if you went over.

And unfortunately I don't have any spare SDD to throw in and test any more - it's all set now.
 
I am evaluating VMware vSphere and Proxmox VE. I think the relationship between OSD and DB/WAL in ceph may be very similar to the disk group in vSAN. Once the cache disk fails, the entire disk group will not work.
There is a very confusing point in ceph. What is the relationship between Cache Tiering and DB/WAL in BlueStore. If the HCI cluster needs to be built at present, what method should be adopted to handle SSD and HDD mixed disks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!