Ceph pool size (is 2/1 really a bad idea?)

topquark

New Member
Sep 25, 2018
18
3
3
34
Situation: I've got a 3 node cluster, and want to use ceph for HA storage for VM's.
When making a ceph pool the default is 3/2 meaning you only have 33% of your total storage as capacity. So I'm thinking of making a 2/1 pool, having 50% of my capacity available. This would mean 1 node can fail, or any single osd can fail and the cluster would still keep running. Why is this a bad idea?

Specifically I've read some stuff about min size being the size at which no writes are allowed. Is this true? If so, why is 3/2 the default, after all wont the cluster stop running on a single node failure till data redundacy is backup to 3? Wouldn't 3/1 be better then?
 
https://docs.ceph.com/docs/master/rados/operations/pools/
says that

min_size:
Sets the minimum number of replicas required for I/O.

so no, this is actually the number of replicas where it can still write (so 3/2 can tolerate a replica of 2 and still write)

2/1 is generally a bad idea because it is very easy to lose data, e.g. bit rot on one disk while the other fails/flapping osds, etc.
the more osds you have the more likely your data loss will be with this

a more prominent example of how it can fail is this story from 2016:
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
even though they did not lose any(many?) data, it was much work to get it working again
 
  • Like
Reactions: ITT and Alwin
do u have calculator for size/min ?
like 3/2 will be around 33% of total size..

also how about pg_autoscale mode? seems default is YES.. so best way is using YES mode?
 
do u have calculator for size/min ?
How much space is being used? Calculate about roughly 1/3rd usable space. But keep in mind that you need enough resources/space to handle the loss of OSDs and complete nodes.

The smallest possible cluster size of 3 nodes, is a special case. If a full node dies, the cluster operates with 2 replicas until the 3rd node is back up. But what if only one OSD in a node fails? Then Ceph will try to get back up to 3 replicas with the OSDs it has.
So assume you have 2 OSDs per node and they are all full to about 45%. One OSD in a node now dies and Ceph will recreate the replicas of the failed OSD on the still remaining OSD. That remaining OSD will end up being about 90% full...

The more OSDs you have, the easier it is for Ceph to balance out the space usage. This is not that much of an issue if you have a larger cluster with more nodes, as it then can also spread out data to other nodes.

Regarding the autoscaler: If you have an idea how much space your pools will use, set the "target_ratio" accordingly. This will help the autoscaler to set the pg_num better right away, instead of waiting for the pool(s) to fill.

It will only change the pg_num if the difference between the current pg_num and the ideal one, is at least a factor of 3. If it is less, it will still show you, but you will have to manually set the pg_num to that value.
 
  • Like
Reactions: n0bie
Noted. If we use only one pool, then set target_ratio will be useless..
In that case, set it just to 1 and you still tell the autoscaler, that it will be using all the space in the cluster.
 
In that case, set it just to 1 and you still tell the autoscaler, that it will be using all the space in the cluster.
how about if we dont set ratio 1 ? also same behavior right? they will use all entire space in the cluster for 1 pool..

Also in the cephFS, there are 2 pool that serve 1 cephFS that is cephfs_data and cephfs_metadata
cephfs_data will contain many file there.. but cephfs_metadata is only save metadata which is the file size relatively small..correct?

also if we set target size 2GB in particular pool (A), how do we check that actually the maximum size pool (A) is about 2GB?
 
Last edited:
If you don't set any target for the autoscaler, it depends on the scale mode of the autoscaler. I think in 16.2.6 the default was in downscale mode, while before (and now again) it is in upscale mode. Upscale mode means, that the pg_num is rather low and will increase, if needed. Downscale mode will size the pg_num to the max and will reduce it for a pool that turns out to not take up as much space as other pools, once they start to grow.

Setting a target ratio is always a good idea to let the autoscaler know where the journey is going.

also if we set target size 2GB in particular pool (A), how do we check that actually the maximum size pool (A) is about 2GB?
This is no a limit, but just to give the autoscaler an idea how large the pools will be, in order to adjust the pg_num for each pool.
 
okay.. so minimum using 3 node.. and maximum node that can be down is only 1..
because we use pool size 3/2

if we have 4 node.. can 2 node down?

also if 1 node down,..all vm which is running under that node will be transferred to another running node.. and those vm is down and up booting again..
is there any way to really really high availability even vm belongs to that node (which is currently down) and still up 100% without reboot or anything.. possible?
 
if we have 4 node.. can 2 node down?
As always, it depends ;)

If both die within a very short time, then no. With a size 3/2 and 4 nodes, the 3 replicas will be spread over all 4 nodes. If two of those die, you will have some PGs that will have lost 2 of their 3 replicas. Until Ceph is able to recreate them to get them at least to 2 replicas (on the 2 remaining nodes), the affected pool(s) will be put into read only mode.

If a node fails and stays offline for some time, Ceph will consider those OSDs as out (default after 10min) and will recreate the lost replicas on the remaining nodes. If that is done and the cluster if healthy again with 3 out of 4 nodes, you could lose another node and the cluster would still work, in a degraded state.

Of course, you will need to have enough free space on the remaining nodes to recreate the lost replicas of the 4th node.

also if 1 node down,..all vm which is running under that node will be transferred to another running node.. and those vm is down and up booting again..
is there any way to really really high availability even vm belongs to that node (which is currently down) and still up 100% without reboot or anything.. possible?
The PVE HA stack can and will start those VMs on other nodes. If you need HA without even a short interruption, you will have to set up HA on the application level. You can still use HA groups to make sure that these VM pairs will never run on the same node. Create a group for each VM and select different nodes for each and enable the "restricted" checkbox to make sure, that they will never run on the same node.
 
okay thanks.. is it possible to create vm under cephFS? not under ceph block storage..
 
okay thanks.. is it possible to create vm under cephFS? not under ceph block storage..
Why would you want to do that?

With a few hacks a lot is possible, but be aware that it is possible that the CephFS will not react for a few seconds, maybe even minutes, if another MDS takes over and has to catch up! Not what you want for your underlying VM storage!

Also running a setup that is far away from the defaults will increase chances that you run into issues, since noone is testing that, and it makes it harder to help you because assumptions about defaults are not correct anymore.
 
  • Like
Reactions: herzkerl and n0bie
Why would you want to do that?

With a few hacks a lot is possible, but be aware that it is possible that the CephFS will not react for a few seconds, maybe even minutes, if another MDS takes over and has to catch up! Not what you want for your underlying VM storage!

Also running a setup that is far away from the defaults will increase chances that you run into issues, since noone is testing that, and it makes it harder to help you because assumptions about defaults are not correct anymore.
okay..

assume right now we have 4 node with proxmox 7.1 latest version in 1 cluster ceph..
next time will add more 2 node.. but with newest version promox for example 8.0
can we add more node with different version of proxmox? do we need to use same version ceph as well?
 
can we add more node with different version of proxmox? do we need to use same version ceph as well?
Please install updates on a regular basis. In a cluster, you can live migrate VMs between nodes to keep them running while the node needs to reboot.

Mixing Proxmox VE versions long term is not supported and should only be done during upgrades of the whole cluster.

The same goes for Ceph within a cluster. When it comes to external clients, it is possible to have a bit of a version difference.

Keeping mixed Proxmox VE versions will result in issues regarding live migration. While we make sure that a live migration from an older version to a newer version will always work, we cannot guarantee that a live migration from a newer to an older Proxmox VE version will work all the time.

If you have mixed versions of Proxmox VE in your cluster, it is also possible that the API versions on the nodes differ. This can be an issue if you use the GUI on node 1 to do something on node 2 which has a different version installed and might not understand the request.
 
  • Like
Reactions: n0bie
Please install updates on a regular basis. In a cluster, you can live migrate VMs between nodes to keep them running while the node needs to reboot.

Mixing Proxmox VE versions long term is not supported and should only be done during upgrades of the whole cluster.

The same goes for Ceph within a cluster. When it comes to external clients, it is possible to have a bit of a version difference.

Keeping mixed Proxmox VE versions will result in issues regarding live migration. While we make sure that a live migration from an older version to a newer version will always work, we cannot guarantee that a live migration from a newer to an older Proxmox VE version will work all the time.

If you have mixed versions of Proxmox VE in your cluster, it is also possible that the API versions on the nodes differ. This can be an issue if you use the GUI on node 1 to do something on node 2 which has a different version installed and might not understand the request.
okay so what is the best practice to update the version... proxmox version first then ceph version or vise versa?
 
okay so what is the best practice to update the version... proxmox version first then ceph version or vise versa?
between major versions? There are / will be upgrade guides that specify the steps needed. For example, from PVE 6 to 7: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

Ceph Major releases can happen within a Proxmox VE release. There usually also is an upgrade guide that will elaborate how to proceed.

For example PVE 6 to 7, you could keep running Ceph 15 with PVE 7 or upgrade to Ceph 16 once the cluster is running PVE 7.
 
As always, it depends ;)

If both die within a very short time, then no. With a size 3/2 and 4 nodes, the 3 replicas will be spread over all 4 nodes. If two of those die, you will have some PGs that will have lost 2 of their 3 replicas. Until Ceph is able to recreate them to get them at least to 2 replicas (on the 2 remaining nodes), the affected pool(s) will be put into read only mode.

If a node fails and stays offline for some time, Ceph will consider those OSDs as out (default after 10min) and will recreate the lost replicas on the remaining nodes. If that is done and the cluster if healthy again with 3 out of 4 nodes, you could lose another node and the cluster would still work, in a degraded state.

Of course, you will need to have enough free space on the remaining nodes to recreate the lost replicas of the 4th node.

I am experimenting with my new cluster still in a pre-production , 4 nodes, 8 OSDs 2 per node, 2.2GB each OSD, replicas 3/2, PG Autoscale set to warn, ceph version 16.2.7. What you described above did not work. First, I simulated 2 node failure one after the other which as you described would not work and ceph was out of order (time out on PVE web interface). Then I figured I need to rebalance first and then put down the second (out of 4) node. I waited for about 45 min and everything was rebalanced but when I put the second node down ceph timeout and was unusable.

My pool is set with 3/2, 128 PGs , size shown 6TB and use (again it is just for testing) 2.79%, 169GB.

My config is default as I am using the same type of drives, the one thing I added was a replicated rule for ssd but it was just for testing as I had also regular HDDs there at some point so in order to create two pools I added:

rule ssd {
id 1
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit }

which is a copy of default replicated_rule with name ssd so I could create two pools but now only one pool is configured.

Here is my global config:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.0.102.220/24
mon_allow_pool_delete = true
mon_host = 10.0.101.220 10.0.101.222 10.0.101.221 10.0.101.223
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.0.101.220/24

Any suggestion what I am doing wrong or not doing to make it work with only two out of 4 nodes up?


Thank you
 
I think your issue is that you got 4 MON nodes which means that after 2 nodes are down your cluster is not quorate any longer since 50% of the cluster is down and the remaining 50% cannot be certain it is the "surviving" part of the cluster or if it just subject to a split-brain scenario.

I think your options are using 3 MON nodes, or add a 5th MON-only node.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!