[SOLVED] 3-node Ceph: missing 11TB space, no failover anymore

FelixJ · Apr 5, 2022

Hi Everyone,
I run a 3-node proxmox+ceph cluster in my home-lab serving as rdb storage for virtual machines for 2 years now.
When I installed it, I did some testing to ensure, that when one node would fail, the remaining 2 nodes would keep the system up while the 3rd node is being replaced.
Recently I had to reboot a node on that cluster and realized, that the redundancy was gone.

Each of the 3 nodes has 4x4TB OSDs which makes 16TB per node or 48 in total.
As mentioned, I use proxmox, so I used their interface to set up the OSDs and Pools.
I have 2 Pools. One for my Virtual machines, one for ceph-fs.
Each pool's size/min is 3/2, has 256 PGs and Autoscaler on.
And now here's what I don't understand: I have the impression, that for what reason ever, it seams, as if my cluster would be over provisioned:

As the command outputs below show, ceph-iso_metadata consume 19TB accordingly to ceph df, how ever, the mounted ceph-iso filesystem is only 9.2TB big.
Same goes with my ceph-vm storage, that ceph belives is 8.3TB but in reality is only 6.3TB (accordingly to the proxmox gui).

The problem now is obvious: out of my 48TB Rawdata I should not be using more then 16TB, else I can't afford to loose a node.
Now Ceph tells me, that in total I am using 27TB, but compared to the mounted volumes/storages I am not using more then 16TB.
So, where are the 11TB (27-16) gone?

What am I not understanding?

Thank you for any hint on that.
regards,
Felix

Code:

ceph df
--- RAW STORAGE ---
CLASS  SIZE    AVAIL   USED    RAW USED  %RAW USED
hdd    44 TiB  17 TiB  27 TiB    27 TiB      61.70
TOTAL  44 TiB  17 TiB  27 TiB    27 TiB      61.70
 
--- POOLS ---
POOL                   ID  PGS  STORED   OBJECTS  USED     %USED  MAX AVAIL
device_health_metrics   1    1      0 B        0      0 B      0    3.0 TiB
ceph-vm                 2  256  2.7 TiB  804.41k  8.3 TiB  47.76    3.0 TiB
ceph-iso_data           3  256  6.1 TiB    3.11M   19 TiB  67.23    3.0 TiB
ceph-iso_metadata       4   32  3.1 GiB  132.51k  9.3 GiB   0.10    3.0 TiB

rados df
POOL_NAME                 USED  OBJECTS  CLONES   COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS      RD       WR_OPS       WR  USED COMPR  UNDER COMPR
ceph-iso_data           19 TiB  3105013       0  9315039                   0        0         0        75202  97 GiB        28776  9.2 MiB         0 B          0 B
ceph-iso_metadata      9.3 GiB   132515       0   397545                   0        0         0  15856613330  13 TiB  28336539064   93 TiB         0 B          0 B
ceph-vm                8.3 TiB   804409       0  2413227                   0        0         0     94160784  40 TiB     62581002  4.4 TiB         0 B          0 B
device_health_metrics      0 B        0       0        0                   0        0         0            0     0 B            0      0 B         0 B          0 B
total_objects    4041937
total_used       27 TiB
total_avail      17 TiB
total_space      44 TiB

df -h
Size    Used  Avail  Avail% mounted on
9,2T    6,2T  3,1T   67%    /mnt/pve/ceph-iso

ness1602 · Apr 5, 2022

With ceph and 3/2 you should not use more than 66% of 16TB you have, because when one node dies, ceph needs to reshuffle data to remaining two nodes.

aaron · Apr 5, 2022

Can you please post the output of ceph osd df tree inside [code][/code] tags? Please also put the output in your first post in code tags, as it is otherwise pretty much not possible to read table formatted CLI output with the spacing gone.

FelixJ · Apr 5, 2022

Code:

ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP      META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME           
-1         43.66425         -   44 TiB   27 TiB   27 TiB   9.5 GiB   62 GiB   17 TiB  61.92  1.00    -          root default         
-3         14.55475         -   15 TiB  9.0 TiB  9.0 TiB   3.3 GiB   20 GiB  5.5 TiB  61.92  1.00    -              host virtstore-01
 0    hdd   3.63869   1.00000  3.6 TiB  2.6 TiB  2.6 TiB   798 MiB  5.6 GiB  1.1 TiB  70.55  1.14  150      up          osd.0       
 1    hdd   3.63869   1.00000  3.6 TiB  2.0 TiB  1.9 TiB   1.2 GiB  4.7 GiB  1.7 TiB  53.61  0.87  124      up          osd.1       
 6    hdd   3.63869   1.00000  3.6 TiB  2.4 TiB  2.4 TiB   583 MiB  5.2 GiB  1.2 TiB  66.63  1.08  146      up          osd.6       
 7    hdd   3.63869   1.00000  3.6 TiB  2.1 TiB  2.1 TiB   689 MiB  4.8 GiB  1.6 TiB  56.89  0.92  125      up          osd.7       
-5         14.55475         -   15 TiB  9.0 TiB  9.0 TiB   3.5 GiB   20 GiB  5.5 TiB  61.92  1.00    -              host virtstore-02
 2    hdd   3.63869   1.00000  3.6 TiB  2.5 TiB  2.5 TiB   979 MiB  6.1 GiB  1.1 TiB  69.54  1.12  150      up          osd.2       
 3    hdd   3.63869   1.00000  3.6 TiB  2.3 TiB  2.3 TiB  1019 MiB  5.1 GiB  1.3 TiB  63.56  1.03  144      up          osd.3       
 8    hdd   3.63869   1.00000  3.6 TiB  2.3 TiB  2.3 TiB   885 MiB  4.9 GiB  1.3 TiB  62.99  1.02  133      up          osd.8       
 9    hdd   3.63869   1.00000  3.6 TiB  1.9 TiB  1.9 TiB   688 MiB  4.2 GiB  1.8 TiB  51.61  0.83  118      up          osd.9       
-7         14.55475         -   15 TiB  9.0 TiB  9.0 TiB   2.8 GiB   21 GiB  5.5 TiB  61.92  1.00    -              host virtstore-03
 4    hdd   3.63869   1.00000  3.6 TiB  2.1 TiB  2.1 TiB   914 MiB  5.1 GiB  1.5 TiB  57.70  0.93  134      up          osd.4       
 5    hdd   3.63869   1.00000  3.6 TiB  1.9 TiB  1.9 TiB   1.2 GiB  4.7 GiB  1.7 TiB  52.87  0.85  128      up          osd.5       
10    hdd   3.63869   1.00000  3.6 TiB  2.7 TiB  2.7 TiB   299 MiB  6.3 GiB  949 GiB  74.53  1.20  154      up          osd.10       
11    hdd   3.63869   1.00000  3.6 TiB  2.3 TiB  2.3 TiB   423 MiB  5.1 GiB  1.4 TiB  62.60  1.01  129      up          osd.11       
                        TOTAL   44 TiB   27 TiB   27 TiB   9.5 GiB   62 GiB   17 TiB  61.92

FelixJ · Apr 5, 2022

ness1602 said:
With ceph and 3/2 you should not use more than 66% of 16TB you have, because when one node dies, ceph needs to reshuffle data to remaining two nodes.

Thank your for pointing that out, how ever, that is not my question... my question is, where those 11TB are, that differ between pool-size usage and real size, as specially concerning my cephfs "ceph-iso"
And therefore I might have found a reason, how ever I am not sure:
I use the ceph-fs mainly as storage for proxmox-backup-server .chuncks.
I fell over a parameter bluestore_min_alloc_size_hdd which, apparently has a default value of 64K.
Now, my .chuncks store consists mainly of files smaller then 64K, which means, the rest of the blocks would be filled up with zeros (right?), so might that be the place, where my storage is disappearing? And if so, how can I tune my osd's to have a rather 4k then 64k?
Felix

aaron · Apr 5, 2022

FelixJ said:
Recently I had to reboot a node on that cluster and realized, that the redundancy was gone.

What do you mean by that? Did the Ceph cluster keep working or was IO blocked?

If that is the case, please post the output of pveceph pool ls --noborder. Either save it in a file or make sure your terminal windows is wide enough. The last columns are %-used and used.

Just to make sure, you do have a MON, MGR and MDS configured on each node?

If the cluster kept working, then I think there is a misunderstanding in how Ceph works.

By default, the size/min_size values are 3/2. They should never be lowered and in some special situations it can make sense to increase them.

What do they mean? "size" defines how many copiies / replicas of each data object should be there in the cluster. The "min_size" defines how many need to be present for the pool to keep working. Once you have less than that, IO will be blocked until Ceph can get more replicas.

By default, the failure domain for the rule that places the objects is on the host level. This means that each copy of an object needs to be placed on a different host in the cluster. Therefore, with a 3-node cluster and a size=3, each node contains one replica. If a node is down, you will be down to 2 replicas. Ceph will throw quite some warnings, but the cluster should keep working.

TL;DR: in a 3-node cluster with size=3, the loss of one node, should not cause the cluster to stop working.

The one thing one should keep an eye on in a 3 node cluster is how many OSDs are in a node and how full they are. Because if only the disk of an OSD fails, we still have enough nodes to satisfy the failure domain and Ceph will try to get back up to 3 replicas for the objects that were on that lost disk. That means, that the remaining OSDs in that node will get fuller, and it can be quite easy for them to get too full.

One thing that I notice in the ceph osd df tree output is that some OSDs are quite a bit fuller than others. ~75% and ~52% are the highest and lowest value. That will have an impact on the estimated available space as the fullest OSD will limit that.
Which version of Ceph are you running? If the CRUSH algorithm does not distribute the data well, we have the balancer that will actively try to balance it better. In newer versions (since Pacific, 16) it is enabled by default. If you run something older, you will have to enable it first: https://docs.ceph.com/en/octopus/rados/operations/balancer/

FelixJ · Apr 5, 2022

aaron said:
What do you mean by that? Did the Ceph cluster keep working or was IO blocked?

No, not at all. Planned Hardware-maintenance.

aaron said:
If that is the case, please post the output of pveceph pool ls --noborder. Either save it in a file or make sure your terminal windows is wide enough. The last columns are %-used and used.

Just to make sure, you do have a MON, MGR and MDS configured on each node?

Yes, sure. Every host runs one mon, mgr and mds.

aaron said:
If the cluster kept working, then I think there is a misunderstanding in how Ceph works.

That is part of my question to re-assure myself, that I have understood how it should work.

aaron said:
By default, the size/min_size values are 3/2. They should never be lowered and in some special situations it can make sense to increase them.

What do they mean? "size" defines how many copiies / replicas of each data object should be there in the cluster. The "min_size" defines how many need to be present for the pool to keep working. Once you have less than that, IO will be blocked until Ceph can get more replicas.

By default, the failure domain for the rule that places the objects is on the host level. This means that each copy of an object needs to be placed on a different host in the cluster. Therefore, with a 3-node cluster and a size=3, each node contains one replica. If a node is down, you will be down to 2 replicas. Ceph will throw quite some warnings, but the cluster should keep working.

TL;DR: in a 3-node cluster with size=3, the loss of one node, should not cause the cluster to stop working.

The one thing one should keep an eye on in a 3 node cluster is how many OSDs are in a node and how full they are. Because if only the disk of an OSD fails, we still have enough nodes to satisfy the failure domain and Ceph will try to get back up to 3 replicas for the objects that were on that lost disk. That means, that the remaining OSDs in that node will get fuller, and it can be quite easy for them to get too full.

It seams, basically I have understood how it should work, thank you for that summary. That was very helpful.

aaron said:
One thing that I notice in the ceph osd df tree output is that some OSDs are quite a bit fuller than others. ~75% and ~52% are the highest and lowest value. That will have an impact on the estimated available space as the fullest OSD will limit that.
Which version of Ceph are you running? If the CRUSH algorithm does not distribute the data well, we have the balancer that will actively try to balance it better. In newer versions (since Pacific, 16) it is enabled by default. If you run something older, you will have to enable it first: https://docs.ceph.com/en/octopus/rados/operations/balancer/

I have to update to pve7 anyways, but I'd preferred to do that, knowing, that during nodes restart the cluster will keep alive... so after update I will be on Pacific anyways.

I checked using ceph pg lshow the copies are distributed, and of course, I could not check every 454 of them, but I think, they are well distributed across the nodes, so that might not be my issue...
As already answered to ness1602, it appeares to me, as if my problem might have to do with bluestore_min_alloc_size_hdd. I have fount almost 55000 files in my pbs .chuncks store, that are smaller then 64K. Could that be the reason for the divergence of space usage between effective useable space and the ceph-pool size as displayed in the proxmox gui?

regards,
Felix

aaron · Apr 5, 2022

FelixJ said:
No, not at all. Planned Hardware-maintenance.

FelixJ said:
but I'd preferred to do that, knowing, that during nodes restart the cluster will keep alive

And that is where I am a bit confused. You did shut down one of the 3 nodes and then what happened exactly?

The Ceph cluster should have worked fine throughout it, meaning, that guests with disk images on it or services accessing the CephFS should work fine.

The only thing that I could imagine, is if the active MDS was on the node that got shut down. In that case, it could take a few moments for the new active MDS to catch up. This can cause the CephFS to not be responsive for a bit.

The OSDs are not full enough to cause issues yet, and the difference in used space should also not have an effect if the cluster works.

Since you have a 3-node cluster, each node stores one of the 3 copies. If you shut down one node, you only have 2 copies which will result in some warnings from Ceph, but again, that should not stop the cluster from working.

That is why I want to clear that up. If the cluster did stop working, then there must be something else that is not optimal.

FelixJ · Apr 5, 2022

aaron said:
And that is where I am a bit confused. You did shut down one of the 3 nodes and then what happened exactly?

The nice circular graphics that usually is nice and green and represents the current state of the OSDs, became completely ugly red and after a while the VMs were unresponsive and the ceph-fs went offline in the gui.
In words: All pgs are active+undersized+degraded
I would have expected to be 2/3rd green and 1/3rd yellow indicating that 1/3rd of my OSDs is degraded together with a bunch of warnings. active+degraded

aaron said:
The Ceph cluster should have worked fine throughout it, meaning, that guests with disk images on it or services accessing the CephFS should work fine.

The only thing that I could imagine, is if the active MDS was on the node that got shut down. In that case, it could take a few moments for the new active MDS to catch up. This can cause the CephFS to not be responsive for a bit.

The OSDs are not full enough to cause issues yet, and the difference in used space should also not have an effect if the cluster works.

Since you have a 3-node cluster, each node stores one of the 3 copies. If you shut down one node, you only have 2 copies which will result in some warnings from Ceph, but again, that should not stop the cluster from working.

That is why I want to clear that up. If the cluster did stop working, then there must be something else that is not optimal.

aaron · Apr 6, 2022

FelixJ said:
All pgs are active+undersized+degraded
I would have expected to be 2/3rd green and 1/3rd yellow indicating that 1/3rd of my OSDs is degraded together with a bunch of warnings. active+degraded

That is normal because all PGs will have one of their replica on the node that is down.
What is not normal, is that the guests became unresponsive and the CephFS down.

I just tested that scenario in one of my test clusters, 3 nodes, killing one results in the following output while the VM keeps working.

Code:

# ceph -s
  cluster:
    id:     e78d9b15-d5a1-4660-a4a5-d2c1208119e9
    health: HEALTH_WARN
            1/3 mons down, quorum cephtestupgrade1,cephtestupgrade2
            2 osds down
            1 host (2 osds) down
            Degraded data redundancy: 582/1746 objects degraded (33.333%), 232 pgs degraded, 281 pgs undersized
 
  services:
    mon: 3 daemons, quorum cephtestupgrade1,cephtestupgrade2 (age 3m), out of quorum: cephtestupgrade3
    mgr: cephtestupgrade2(active, since 3m), standbys: cephtestupgrade1
    mds: cephfs:1 {0=cephtestupgrade1=up:active} 1 up:standby
    osd: 7 osds: 5 up (since 3m), 7 in (since 8d)
 
  data:
    pools:   4 pools, 281 pgs
    objects: 582 objects, 2.0 GiB
    usage:   9.9 GiB used, 101 GiB / 111 GiB avail
    pgs:     582/1746 objects degraded (33.333%)
             232 active+undersized+degraded
             49  active+undersized
 
  io:
    client:   19 MiB/s rd, 26 MiB/s wr, 627 op/s rd, 191 op/s wr

Can you please post the output of ceph -s and also of pveceph pool ls --noborder. Either save it in a file or make sure your terminal windows is wide enough. The last columns are %-used and used.

Somewhere, something seems to be not configured optimally.

FelixJ · Apr 6, 2022

New behavior: VMs are still up. That is truly new. The difference is, 2 days ago, I increased the numb of PGs and set a pool target size, which btw. seams again to be a problem, as there is such a complain:

Code:

ceph -s
  cluster:
    id:     38fd1dee-7958-4203-bd1a-3c5b82e1e9ab
    health: HEALTH_WARN
            1/3 mons down, quorum virtstore-01,virtstore-02
            Degraded data redundancy: 4054649/12163947 objects degraded (33.333%), 544 pgs degraded, 545 pgs undersized
            1 subtrees have overcommitted pool target_size_bytes
            26831 slow ops, oldest one blocked for 1871 sec, daemons [osd.10,osd.4,osd.5,mon.virtstore-03] have slow ops.
 
  services:
    mon: 3 daemons, quorum virtstore-01,virtstore-02 (age 31m), out of quorum: virtstore-03
    mgr: virtstore-02(active, since 2d), standbys: virtstore-01
    mds: ceph-iso:1 {0=virtstore-02=up:active} 1 up:standby
    osd: 12 osds: 8 up (since 30m), 8 in (since 20m)
 
  data:
    pools:   4 pools, 545 pgs
    objects: 4.05M objects, 9.2 TiB
    usage:   18 TiB used, 11 TiB / 29 TiB avail
    pgs:     4054649/12163947 objects degraded (33.333%)
             544 active+undersized+degraded
             1   active+undersized
 
  io:
    client:   17 KiB/s wr, 0 op/s rd, 2 op/s wr

This time, also ceph fs remained online:

Code:

ceph fs status
ceph-iso - 2 clients
========
RANK  STATE       MDS          ACTIVITY     DNS    INOS
 0    active  virtstore-02  Reqs:    0 /s  1480k  1475k
       POOL          TYPE     USED  AVAIL
ceph-iso_metadata  metadata  9385M  5478G
  ceph-iso_data      data    18.6T  5478G
STANDBY MDS 
virtstore-01
MDS version: ceph version 15.2.13 (1f5c7871ec0e36ade641773b9b05b6211c308b9d) octopus (stable)

Can it be, that there is a coincidence between to few PGs and the fact, that the pools go inactive?

Code:

pveceph pool ls --noborder
Name                  Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name              %-Used Used
ceph-iso_data            3        2    256                        256 on                           8796093022208                           replicated_rule   0.635566234588623 20519628967697
ceph-iso_metadata        3        2     32          16             16 on                                                                   replicated_rule 0.00083575613098219 9841697635
ceph-vm                  3        2    256                        128 on                           4398046511104                           replicated_rule   0.436302989721298 9106884800874
device_health_metrics    3        2      1           1              1 on                                                                   replicated_rule                   0 0

regards,Felix

aaron · Apr 7, 2022

Okay, so far things look okay. And this time it seems like everything keeps working, even with one node down. As it should be.

Now, regarding the size thing. First of, the target_size and target_ratio are only there to give the autoscaler an idea how much space a pool is likely going to use. With that information it can calculate better how many placement groups each pool should get.

I would recommend, to either set ratios, or if you know that a pool will only consume a certain size, set it to that, and then set the remaining pool to a ratio. The available space in the cluster will always fluctuate a bit as it is an estimate.

FelixJ said:
Can it be, that there is a coincidence between to few PGs and the fact, that the pools go inactive?

No. If the cluster was healthy before you shut down the node, it should have behaved like now.

Regarding your question from the first post:

FelixJ said:
The problem now is obvious: out of my 48TB Rawdata I should not be using more then 16TB, else I can't afford to loose a node.
Now Ceph tells me, that in total I am using 27TB, but compared to the mounted volumes/storages I am not using more then 16TB.
So, where are the 11TB (27-16) gone?

You need to differentiate between the raw storage capacity of the cluster, which is the sum of what all nodes can store. If you have a pool with size=3, then all the data in the pool is stored 3 times (1x per node) and therefore the raw used space is roughly 3 times the used space you see in the Proxmox VE GUI or the stored column in the pools section of ceph df.

You could have pools with a different size, and that would result in a different ratio of stored to raw used data.

You should also never fill up your Ceph cluster! Once an OSD reaches high use levels, you will see some warnings like "near-full" for example. Also consider that you could lose one OSD in a node. In such a case, the remaining OSDs should still have enough space available to store the data that was on the one that failed and still have enough free to not run into any issues.

FelixJ · Apr 8, 2022

Dear Aaron,
thank you very much for taking time and elaborate on all that. I'm glad, my cluster is healthy now and will survive with a node down for maintenance at least for a while!
I think, that case is solved!
regards,
Felix

aaron · Apr 8, 2022

You're welcome. I went ahead and marked the thread as solved. You can do so yourself if you edit the first post and select the prefix from the drop down next to the title

[SOLVED] 3-node Ceph: missing 11TB space, no failover anymore

FelixJ

Well-Known Member

ness1602

Famous Member

aaron

Proxmox Staff Member

FelixJ

Well-Known Member

FelixJ

Well-Known Member

aaron

Proxmox Staff Member

FelixJ

Well-Known Member

aaron

Proxmox Staff Member

FelixJ

Well-Known Member

aaron

Proxmox Staff Member

FelixJ

Well-Known Member

aaron

Proxmox Staff Member

FelixJ

Well-Known Member

aaron

Proxmox Staff Member

We value your privacy