Bluestore SSD wear

brosky

Well-Known Member
Oct 13, 2015
55
4
48
Hi,

I have a three node cluster, each node has 35 OSD + 1 SSD as bluestore db.
replica 2, failover host.


The cluster is used for cold-storage - and i'm starting to dump data on it.
I know that my bottleneck is the single SSD on each node - all the incoming traffic will hit it.
Now, my question is, what happens if that SSD dies ? Do I need to replace it and redeploy all the OSD's from that node ?

How can I mitigate this ?

Thank you.
 
Yes, all OSDs that use it for their WAL and DB will have to be recreated.
Usually the recommendation is 4-6 OSDs per WAL/DB disk.

What kind of SSD is it that you're using?
 
Yes, all OSDs that use it for their WAL and DB will have to be recreated.
Usually the recommendation is 4-6 OSDs per WAL/DB disk.

What kind of SSD is it that you're using?
I don't fully understand the concept of "4-6 OSD's per WAL/DB" disk,
You mean that on a single SSD , I should partition/assign only 4-6 OSD's ?
For this setup, I put 35 OSD's per WAL/DB ssd disk :)
For other clusters, I put 6 OSD's per NVME.

I use two Intel Datacenter 960Gb ssd that I had lying around and a Samsung EVO 1TB (I think it's consumer grade) - after a week it has 3% wearout.
My use case it to dump the data on the cluster and then shut it down - to be powered up again if ever there's need for the data.

After I have all my data in place, can I set a the cluster read-only, NOOUT/NOREBALANCE/ETC and then "image" the bluestore disks ?
 
DO NOT use consumer SSDs/NVMes for Ceph!

Especially not for 35 OSDs on a single one.
I'd suggest moving those OSDs away and splitting them up to multiple WAL/DB disks.

How many OSDs do you have in your whole cluster?
How many nodes with 35 OSDs on a single WAL/DB disk?
 
DO NOT use consumer SSDs/NVMes for Ceph!

Especially not for 35 OSDs on a single one.
I'd suggest moving those OSDs away and splitting them up to multiple WAL/DB disks.

How many OSDs do you have in your whole cluster?
How many nodes with 35 OSDs on a single WAL/DB disk?
Right now i'm rebalancing the cluster, I've marked out & stopped all OSD's from one node - the problem is that even if I have 10G link between the nodes, recovery speed is slow (110MiB/s) - max_backfills 8, osd_recovery_max_active 16

After the cluster is healthy I will go to the next server and do over again.

Now, my issue is that the Object store bucket is not responding.
Now, on a replica 2 , with quorum but with some rebalancing traffic - why that bucket is not responding ?
 
Last edited:
Please provide the output of ceph -s.

Do you have enough OSDs and free space on your other nodes?

Usually it would be best to destroy one OSD, wait for everything to rebalance/backfill and recreate it using a different DB/WAL disk. Then wait for it to rebalance and backfill again.
This is to make sure you still have all data available.

If you destroy too many OSDs at once, depending on the setup and distribution, you can lose data permanently.
 
Please provide the output of ceph -s.

Do you have enough OSDs and free space on your other nodes?

Usually it would be best to destroy one OSD, wait for everything to rebalance/backfill and recreate it using a different DB/WAL disk. Then wait for it to rebalance and backfill again.
This is to make sure you still have all data available.

If you destroy too many OSDs at once, depending on the setup and distribution, you can lose data permanently.

This is a new cluster that I will use for storing cold data. All the data that's on the cluster I have it nearby on a different Ceph, that's performing as expected (with some question marks regarding the R/W ratios)

Code:
root@pve--1:~# ceph -s
  cluster:
    id:     a37b3b74-098f-4afb-a2fc-65471ebf4f28
    health: HEALTH_WARN
            unable to send alert email
            Reduced data availability: 84 pgs inactive
            Degraded data redundancy: 10845713/34267948 objects degraded (31.650%), 84 pgs degraded, 84 pgs undersized
            5 daemons have recently crashed

  services:
    mon: 3 daemons, quorum pve--1,pve--2,pve--3 (age 2d)
    mgr: pve--1(active, since 46h)
    osd: 26 osds: 26 up (since 102m), 26 in (since 110m); 84 remapped pgs
    rgw: 2 daemons active (2 hosts, 2 zones)

  data:
    pools:   14 pools, 513 pgs
    objects: 17.13M objects, 8.6 TiB
    usage:   13 TiB used, 413 TiB / 426 TiB avail
    pgs:     16.374% pgs not active
             10845713/34267948 objects degraded (31.650%)
             2135347/34267948 objects misplaced (6.231%)
             427 active+clean
             84  undersized+degraded+remapped+backfilling+peered
             2   active+clean+scrubbing+deep

  io:
    recovery: 83 MiB/s, 164 objects/s

  progress:
    Global Recovery Event (102m)
      [=======================.....] (remaining: 19m)

Considering that the estimate recovery is stated as "62.23% (108.10 MiB/s - 5d 13h left" - I will remove all OSD's and add them again.

ceph -s shows 20 minutes, PVE dashboard shows six days. Which one is accurate ?
 
In that case it should be a lot faster than doing it one by one.

The recovery speed changes all the time, so I'd say neither is right. They both use the same information in the end.
 
In that case it should be a lot faster than doing it one by one.

The recovery speed changes all the time, so I'd say neither is right. They both use the same information in the end.
I'm doing it from scratch.

Nevertheless, my question remains - on a three node cluster, with replica 2 - if one node goes down (let's say unrecoverable) - I'm unable to access the buckets from the object storage.
 
No, it should work.

In your case the issue seems to be the following:
Reduced data availability: 84 pgs inactive
That means some data is simply not available and as a result the requirement for min_size of 2 is not satisfied. And if there are objects without the min_size amount of replicas available, I/O will be completely blocked.
 
No, it should work.

In your case the issue seems to be the following:
Reduced data availability: 84 pgs inactive
That means some data is simply not available and as a result the requirement for min_size of 2 is not satisfied. And if there are objects without the min_size amount of replicas available, I/O will be completely blocked.
Yes, you are right - the replica was 2/2 - so if I set 2/1 - I should still have access to the data.
 
No, it won't help in this case since you seem to have removed some OSDs.
Those 84 PGs do not exist on any of the 26 OSDs you have up and in.

If those other OSDs are already destroyed, there's no way to get those PGs back.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!