To add for posterity in case anyone else is googling this topic down the road. Here is probably the single biggest risk from some back-reading I've done on the Ceph-users list (credit goes to Ceph-user member Wido for explanation, I'm re-stating in my own words)
In a 2/1 scenario even with SSDs you could have a situation where:
- Host goes down for reboot (now some PGs are at min_size, but the cluster keeps chugging along, including writes to single-sized volumes at the moment)
- Now an OSD fails in the same PG as the rebooting (still not up yet) host.
- The rebooting host comes back up and "missing" PG objects are back, sort of -- but writes that were made to the now-crashed OSD "never happened" because the replica on the rebooted host never got it, and the OSD that took it is dead. I believe that Ceph detects this corruption but you'll have a big mess to clean up and probably have to restore from backups.
The other bad scenario I think in 2/1 is simply that two disks fail subsequently (before the first can rebuild, or during the rebuild a URE is found on the single remaining disk) the risk of both of these scenarios is greatly reduced in an all SSD environment I would still think. Though monitoring of wear level and other SSD SMART related stats would help in pulling SSDs nearing EOL out in advance, mitigating this risk to a large degree.
So yes, definitely some risk (improbable though it may be in a cluster my size). It could be manageable and still worth the risk depending on the situation/use case.
The whole-host-down (eg for maintenance) seems to be the highest risk (data corruption scenario above) imho... some ideas for mitigation:
- Backups right before any planned maintenance that involves node reboots
- Create a separate 3/2 pool for critical VMs (and leave VMs that you can tolerate to lose data up until last night's backup on 2/1)
- Shut down all VMs on the ceph pool during the maintenance window to avoid any possibly write inconsistencies if something goes wrong
- Reboot your nodes slowly, prior to that change your OSD reweights to 0 to balance out all data to other OSDs, change them back to 1 after reboot, make sure you have enough space across the rest of the cluster first
- If your ceph cluster is small, migrate your storage to NFS (that FreeNAS box everyone has to hold their ISOs and backups ;-) temporarily
during the maintenance Window (note- I think migration disks away from Ceph and then back may break thin provisioning, so this may negate the savings of going 2/1 anyway)
If I understand correctly an unplanned host-down will trigger an immediate rebalancing after the 300 second window, vs when it's a requested reboot the no-out flags on the host's OSDs are set?
However, at the end of the day, I think I've come around and agree that 3/2 is likely the way to go if you want to sleep well at night and not worry too much about maintenance reboots. I'll maybe do that and enable LZ4 to make up for some of the lost space if the performance hit is tolerable.
Sorry for the novel!