Ceph below min_size why not read-only?

Dec 2, 2020
10
4
8
According to the Ceph docs, if a pool has less than min_size OSDs available, IO is blocked, that includes writes, but also reads. This seems counter-intuitive to me, does anyone know, why this is the case?

Examples:
  • Standard 3/2 pool, replicated with size 3, min_size 2: Can fully read and write with 2 OSDs, but with only 1, everything stops, but why should I not be able to read from the remaining OSD?
    (One might argue, that there are reads, that could cause metadata writes, so in case this a problem, let's try to avoid that with the next example.)
  • CephFS metadata and base data pool standard 3/2 replicated, but an additional 4+2 EC data pool on 6 separate OSDs. The EC pool is 6/5, size 6, min_size 5 as suggested in the docs. This seems reasonable, however when shutting down 2 OSDs of the EC pool (e.g. during maintenance of one host), it's completely blocked, despite all metadata changes (I hope) would go only to the replicated pools, which are still fully available.
In both scenarios one could lower min_size temporarily to gain access to the data, but the docs I guess rightfully warn about this. It seems it would be more feasible, if one could put a pool into read-only mode manually, but I've not found any reasonable way to do that (despite PGs apparently can have a read-only state). Any ideas? Thanks!

Related thread:
 
Ceph is all abaout data consistency. It is not guaranteed that the single remaining copy is a valid copy. This can only be assured when there is a "majority" of copies available that are all the same. It's basically the same principle like with the quorum of the MONs.

BTW: It is not the number of remaining OSDs but the number of copies, i.e. number of PGs (placement groups).
 
  • Like
Reactions: VictorSTS
Also keep in mind possible "split brain" situations, where both parts of the brain have some copy: Ceph can't never know which one should be read from if none reaches min_size, as it has no way to compare/reach any other copy.

If you want to access your PGs, the admin will have to lower min_size in the appropiate "side" of the "brain", provided that MON's have quorum, of course.
 
An actual "overall" split-brain scenario should already be prevented by min_size, so it seems the worst thing, that could happen if we allow reads below min_size, is reading outdated (but pool-wise consistent) data, right?

In the more complex example with the 4+2 EC pool, if we create a partially inconsistent state temporarily (e.g. by writing data with 5 OSDs, down 2, up the outdated one), Ceph already needs to detect this, as it can (hopefully) only recover shards from consistent data. According to the docs, it does recover below (https://docs.ceph.com/en/reef/rados/operations/erasure-code/#erasure-coded-pool-recovery), so when ceph can read internally, the clients certainly could as well? And I guess reading outdated data is not possible in that example as well, since we can't read from a minority?

If so, the first example (replicated) is maybe a problem, but why not allow reads in the second example (4+2 EC)?

If you want to access your PGs, the admin will have to lower min_size in the appropiate "side" of the "brain", [...]
But that's exactly the problem, I do not want to lower min_size, as this would indeed be risky, if any write happens. It would only be reasonable, if I could (manually) put the pool into read-only mode first, but I don't know if that's actually possible. Do you? Thanks!