Bluestore / SSD / Size=2?

markmarkmia · Mar 13, 2018

I do see posts saying Size=2 minsize=1 is a bad idea. But some of the "worst" reasons this is/was a bad idea (data inconsistency if there are two mismatched copies of data because a rebalance started or writes happened and then an OSD comes back to life or something..) maybe seem to be corrected with Bluestore checksums.

With SSDs (which should rebalance fast after a dead OSD) and Bluestore (with checksums), is size=2 minsize=1 relatively safe now? Aside from losing a second SSD (or host it's on) during a rebuild are there any other risks? Otherwise this seems to be about on par with the risks of using RAID10, or am I missing something?

Alwin · Mar 13, 2018

While the checksum verifies that whatever object was written to the disk has not been altered and is still the same on that disk. It is not possible to recover, if you don't have a other valid copy of that object.

As an example, in the case of size=2, min_size=1. If an OSD dies, then you have only one object left, but what if that object is also not valid, then there is no way to recover. Also during the recovery (or a write) a object can be in-flight and not reside on an OSD anymore, leaving ceph only with a copy in RAM.

markmarkmia · Mar 13, 2018

Is it common for an object to be not valid? If I compare a ceph cluster on relatively reliable (dual Power supply, ECC RAM, UPS backed) servers with redundant switches/links, and enterprise grade MLC SSDs in the 400-500GB range, is my risk of data loss (roughly speaking) with Size=2 Minsize=1 any worse than, say, my Equallogic SAN with 1TB 7200rpm SAS drives in a RAID10? (edit: specifically referral to Bluestore here with COW, which I also understood helped reduce risk of write conflict/corruptions when running in temporarily degraded mode)

I'm obviously trying to balance cost with reasonable risk.

On the other hand, now that Ceph supports EC pools without a cache tier, any chance Proxmox will support that soon? I'd be keen to test and see how the pefromance is with 4+2 EC (edit: EC with SSDs only - I'm really interested in running pure SSD Ceph for VM storage, I have no problem with size=3 replication on spinning rust because it's cheap and the rebuild times are obviously slower, higher chance of UREs).

I can set all of the parameters with the 'ceph' command-line to update the crush map and get it to work, the only issue is that when creating an RBD image you select a replicated pool (to store the metadata) and pass an argument to specify the data pool being the EC pool. If you recall which Proxmox script creates the RBD images for VMs I could go and modify it and do some testing?

alexskysilk · Mar 13, 2018

markmarkmia said:
Is it common for an object to be not valid?

No. But this is similar to putting a gun with one bullet to your head and pulling the trigger; common or uncommon doesn't matter if the chamber was loaded.

There is a very simple calculation you can make here: what is the cost of losing data, in terms of direct consequence to your customers and indirect cost of your time in recovery? if its more then the cost of the added hardware, then its a no brainer.

markmarkmia · Mar 13, 2018

I agree, that's the standard calculation. But RAID10 seems to be acceptable for most people as the chances of a fault are statistically quite low as long as you aren't using enormous drives (or consumer drives with a lower BER). I'm trying to understand (as I somewhat but obviously not fully understand Ceph) if the size=2 on smaller SSDs (so faster rebuild) on enterprise grade equipment is the same/more/less risky than a typical SAN in RAID10 in terms of probability of data loss. I'm just wondering if there are Ceph specific gotchas not addressed by Bluestore that would make this more risky than typical RAID10.

If it's the same or even less risky, and RAID10 has served me well (as it has others) then I'd probably take my chances on size=2 (with daily backups). Though I certainly wouldn't if it were a 4TB or 8TB spinning disk as the rebuild could take a long time (during which other drives or hosts could fail- I get that).

I've got some non critical loads on a size=2 SSD pool for now, but I think I'd opt for 4-2 direct EC on SSD if/when Proxmox supports it and the performance hit isn't too noticeable for peace of mind.

Thanks for your input Alex!

alexskysilk · Mar 14, 2018

markmarkmia said:
but I think I'd opt for 4-2 direct EC on SSD if/when Proxmox supports it and the performance hit isn't too noticeable for peace of mind.

Proxmox doesnt care what kind of pool you create, you'll just have to do it by hand. FYI, EC pools have very poor performance for virtualization.

markmarkmia · Mar 14, 2018

I can create the pool, it's creating the RBD volume for the VM storage that is the problem.

Alwin · Mar 14, 2018

Erasure coding, has a higher computational cost and holds a higher possibility of data loss, in the simplest case it acts as a RAID-5 and sustains one OSD failure. On the ceph-user mailing list, there are reports that a use of erasure coding for RBD images deacreases performance. In the best case you gain 25% disk space.

As @alexskysilk said, it all depends on how much a (downtime) recovery costs to the gained performance / space. There is a calculation floating around somewhere on the ceph mailling list, that stats IIRC, that in small clusters the chance for data loss is ~11%, with size 2/1. This decreases with increasing size of the Ceph cluster, as it takes less time to recovery into a good state and PGs are more widespread. And in my opinion, this is undermined with the fact the dev's moved away from size 2/1 to 3/2.

markmarkmia said:
I've got some non critical loads on a size=2 SSD pool for now, but I think I'd opt for 4-2 direct EC on SSD if/when Proxmox supports it and the performance hit isn't too noticeable for peace of mind.

AFAIK, there are no plans to support EC pools.

PigLover · Mar 14, 2018

I've spun this around every way I can think of and have not found a scenario where EC pools make sense in a small cluster.

They have immense value in a large to very-large clusters. IMNSHO, EC pools start to make sense when your pool consists of at least 12 nodes (8+3 EC pools, with at least one additional node to allow some entropy in PG assignment and to permit recovery to fully stable operation with a node down). Note that this is 12 nodes, not just 12 OSDs - you need to ensure relatively independent failure modes between all of the PGs in the pool or whatever you've gained in storage utilization you will give up in resiliency.

Its also not clear that you really want EC pools for VM RBDs. It is not at all clear that the cost in performances is worth it for the relatively small storage demand of VM images. You also add additional risk of data loss with EC pools, especially small ones. It would probably work - but the gain vs pain quotient does not look favorable. EC pools rock for massive data storage. Not so much for RBD.

Of course, this is just my opinion. YMMV.

markmarkmia · Mar 14, 2018

Thanks PigLover. I had been thinking 4-2 EC pool for RBD, I had heard it got a bad rap with a cache tier in front of it, but everyone's use case is a bit different. I have thought it over and I agree, I think even with SSDs having to do (in 4-2) 6 reads to write a stripe (to 6 OSDs) for one operation is probably not going to be very good for performance vs replica. I posted a question to the ceph-users list hoping to get a bit more insight into "how risky" small SSDs in replica 2/1 is vs regular spinning raid 10 (which I've already been using for years). I may also consider enabling lz4 and using a 3/2 pool for critical VMs. Love how flexible Ceph is.

Long-long term I saw somebody who has Ceph set up so that reads are primarily done from SSDs and the replicas all go to spinners that have SSD/Optane WALs (so writes ACK fast), that also seems like a good idea to be able to get pure SSD performance but with better data durability... I just don't have any 3.5" bay capacity at the moment in any servers to try it.

alexskysilk · Mar 14, 2018

markmarkmia said:
Long-long term I saw somebody who has Ceph set up so that reads are primarily done from SSDs and the replicas all go to spinners that have SSD/Optane WALs (so writes ACK fast), that also seems like a good idea to be able to get pure SSD performance but with better data durability

For writes that might work; for reads you need quorum per pg, which will require you to have sufficient members of equal speed. if you only have one ssd in the PG you'll end up with read latency of a spinning disk.

markmarkmia · Mar 14, 2018

To add for posterity in case anyone else is googling this topic down the road. Here is probably the single biggest risk from some back-reading I've done on the Ceph-users list (credit goes to Ceph-user member Wido for explanation, I'm re-stating in my own words)

In a 2/1 scenario even with SSDs you could have a situation where:

- Host goes down for reboot (now some PGs are at min_size, but the cluster keeps chugging along, including writes to single-sized volumes at the moment)
- Now an OSD fails in the same PG as the rebooting (still not up yet) host.
- The rebooting host comes back up and "missing" PG objects are back, sort of -- but writes that were made to the now-crashed OSD "never happened" because the replica on the rebooted host never got it, and the OSD that took it is dead. I believe that Ceph detects this corruption but you'll have a big mess to clean up and probably have to restore from backups.

The other bad scenario I think in 2/1 is simply that two disks fail subsequently (before the first can rebuild, or during the rebuild a URE is found on the single remaining disk) the risk of both of these scenarios is greatly reduced in an all SSD environment I would still think. Though monitoring of wear level and other SSD SMART related stats would help in pulling SSDs nearing EOL out in advance, mitigating this risk to a large degree.

So yes, definitely some risk (improbable though it may be in a cluster my size). It could be manageable and still worth the risk depending on the situation/use case.

The whole-host-down (eg for maintenance) seems to be the highest risk (data corruption scenario above) imho... some ideas for mitigation:

- Backups right before any planned maintenance that involves node reboots
- Create a separate 3/2 pool for critical VMs (and leave VMs that you can tolerate to lose data up until last night's backup on 2/1)
- Shut down all VMs on the ceph pool during the maintenance window to avoid any possibly write inconsistencies if something goes wrong
- Reboot your nodes slowly, prior to that change your OSD reweights to 0 to balance out all data to other OSDs, change them back to 1 after reboot, make sure you have enough space across the rest of the cluster first
- If your ceph cluster is small, migrate your storage to NFS (that FreeNAS box everyone has to hold their ISOs and backups ;-) temporarily
during the maintenance Window (note- I think migration disks away from Ceph and then back may break thin provisioning, so this may negate the savings of going 2/1 anyway)

If I understand correctly an unplanned host-down will trigger an immediate rebalancing after the 300 second window, vs when it's a requested reboot the no-out flags on the host's OSDs are set?

However, at the end of the day, I think I've come around and agree that 3/2 is likely the way to go if you want to sleep well at night and not worry too much about maintenance reboots. I'll maybe do that and enable LZ4 to make up for some of the lost space if the performance hit is tolerable.

Sorry for the novel!

markmarkmia · Mar 14, 2018

alexskysilk said:
For writes that might work; for reads you need quorum per pg, which will require you to have sufficient members of equal speed. if you only have one ssd in the PG you'll end up with read latency of a spinning disk.

Thanks Alex. I may have mis-stated what I had read, I think the primary OSD for every object was on an SSD, I'm not sure the exact crush configuration details, but the effect was the all writes went to the SSDs and then replicated to the spinners (with SSD WAL), and all reads went to the SSD unless it failed, then rebuild was from HDDs as were reads temporarily. I think the affinity setting was used also, but I'm still largely a Ceph newb, so not 100% sure.

Search

Search

Bluestore / SSD / Size=2?

markmarkmia

New Member

Alwin

Proxmox Retired Staff

markmarkmia

New Member

alexskysilk

Distinguished Member

markmarkmia

New Member

alexskysilk

Distinguished Member

markmarkmia

New Member

Alwin

Proxmox Retired Staff

PigLover

Renowned Member

markmarkmia

New Member

alexskysilk

Distinguished Member

markmarkmia

New Member

markmarkmia

New Member

We value your privacy