SSD Journal Failure

infinityM · May 17, 2020

Hey Guys,

Ok so i've been reading up a bit. And according to most guides everyone recommends using a seperate SSD drive for journals...
I have 1 concern though. From what I read, if the SSD goes down, the entire OSD cluster on that server also goes down...

So couple of quick questions...

1. Since Ceph stores data to 2 pg then replicates, I assume any pending data is immediately lost when journal fails?
2. When the journal fails, What does one need to doto bring it back online?
3. Can proxmox handle the journal? Meaning we regularly add drives. Will we need to manually add each journal partition or can it be automated to a sense to an SSD? (meaning we define the SSD, and the system handles the partitioning as and when OSD's are added/replaced)
4. When an OSD is replaced, I assume there's a process journal wise to replacing it. What is this process?
5. Is it worth the trouble?

My main concern would be if we add the SSD, 1 SSD failure could bring an entire server (which has 15 drives) down. I currently have 6 servers with 6-8 drives each, which I am planning to convert to 3 servers with 15 6TB SAS drives each.

I would love if someone can elaborate a bit for me. I understand the basic's but I am a bit new to ceph still and don't want to have data loss again (had a rough start already

)

Thanks Guys

LnxBil · May 17, 2020

infinityM said:
1. Since Ceph stores data to 2 pg then replicates, I assume any pending data is immediately lost when journal fails?

If it's not mirrored, yes. Therefore mirror!

infinityM said:
My main concern would be if we add the SSD, 1 SSD failure could bring an entire server (which has 15 drives) down. I currently have 6 servers with 6-8 drives each, which I am planning to convert to 3 servers with 15 6TB SAS drives each.

The rule of thumb is/was 4 Drives per OSD, you you should have much more SSDs in there.

infinityM · May 17, 2020

LnxBil said:
The rule of thumb is/was 4 Drives per OSD, you you should have much more SSDs in there.

Hey LnxBil,

I assume you mean 4 OSD's per SSD? If so, why only 4? Cause that then means that for 15 drives, mirrored I'd need 8 SSD's?

Seems highly inefficient?
Maybe rather using 4 SSD's in SW Raid 4 might be better for all 15 drives? Or is that a no-go?
Again, I am very new to the concept so please advise if I am miss-understanding the concept.

LnxBil · May 18, 2020

infinityM said:
I assume you mean 4 OSD's per SSD? If so, why only 4?

Yes, sorry, I meant that 4-5 journals should be on one SSD. That is the number I read everywhere but it hugely depends on the performance of your SSD. Best information is in Sebastiens Blog. If you read on the CEPH mailing list, there are examples of higher number of journals per SSD, but they do not perform well. I guess you can start with your setup, but it will not be optimal. Yet I have to say, it is very hard to find the 4 and 5 journals per ssd. Maybe my information is just old?

Alwin · May 18, 2020

infinityM said:
Ok so i've been reading up a bit. And according to most guides everyone recommends using a seperate SSD drive for journals...

A journal was used with the filestore backend. Since the Bluestore backend was introduced the journal was replaced by DB/WAL. Those are from the RocksDB and hold the metadata of the objects on the OSD.

LnxBil said:
If it's not mirrored, yes. Therefore mirror!

Please don't. This defeats the purpose of having a faster media for the RocksDB.

infinityM said:
My main concern would be if we add the SSD, 1 SSD failure could bring an entire server (which has 15 drives) down. I currently have 6 servers with 6-8 drives each, which I am planning to convert to 3 servers with 15 6TB SAS drives each.

You will not get around using a multiple of SSDs to speed up small writes/reads. Depending on the IO/s that a SSD can sustain, more or less OSD can be connected to it. Yes, when the SSD fails, all OSDs connected to that SSD will be dead. But this is to be expected and a node failure will result in the same. But one of the strong points of Ceph is that the data is replicated across nodes and only after it has done it successfully, it sends the acknowledgement to the client. With a replica of 3 the cluster still has two copies left.

See our Ceph precondition section in our docs, our benchmark paper and the corresponding forum thread for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

infinityM · May 19, 2020

Alwin said:
A journal was used with the filestore backend. Since the Bluestore backend was introduced the journal was replaced by DB/WAL. Those are from the RocksDB and hold the metadata of the objects on the OSD.

Thank you very much for the info!
1 question though, Using the proxmox interface, I see when adding an OSD, I can define the DB & WAL... How does this logic then now work?

Can I define any number of OSD's per SSD using the interface? Should I pre-partition it or does it want to handle the partitioning? or how does it work?
What is the difference between DB & WAL?

Alwin · May 20, 2020

infinityM said:
Can I define any number of OSD's per SSD using the interface? Should I pre-partition it or does it want to handle the partitioning? or how does it work?
What is the difference between DB & WAL?

Have you already consulted our docs on this? I think it will answer your questions.

https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_osds

elmo · May 20, 2020

Alwin said:
A journal was used with the filestore backend. Since the Bluestore backend was introduced the journal was replaced by DB/WAL. Those are from the RocksDB and hold the metadata of the objects on the OSD.

Please don't. This defeats the purpose of having a faster media for the RocksDB.

You will not get around using a multiple of SSDs to speed up small writes/reads. Depending on the IO/s that a SSD can sustain, more or less OSD can be connected to it. Yes, when the SSD fails, all OSDs connected to that SSD will be dead. But this is to be expected and a node failure will result in the same. But one of the strong points of Ceph is that the data is replicated across nodes and only after it has done it successfully, it sends the acknowledgement to the client. With a replica of 3 the cluster still has two copies left.

See our Ceph precondition section in our docs, our benchmark paper and the corresponding forum thread for more information.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition
https://proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

I apologize beforehand in case I got this wrong, but what you are saying does not really make sense to me. Yes, I understand the general concept and the technical limitation, but the whole reason for setting up a multi-node cluster using Ceph is so that you have clustered capabilities should
something go wrong. But your entire architectural design is such that all components in your system can take a failure should something bad happen.
And here, since SSD cache tiering has been deprecated from what I understand, you are more or less forced to implement a single point of failure, unless you are prepared to run a system with spinning disks only. Mirroring SDD's doesn't seem bad to me. Have I understood this correctly? If so, what are the options, except from running an all-flash system?

Alwin · May 20, 2020

Well, from one perspective, for example 16x OSDs that are placed on a mirrored SSD. So, all 16x OSD will effectively use the IOps and bandwidth of a single SSD. Instead of two (2x 8x OSDs). This will introduce the SSD as bottleneck. Now, one could argue that using 8x SSDs with 4x OSDs per mirror could bring back the performance benefit. But I doubt that there will be the budget or the physical space in the server to realize this option. Alongside the performance goes the wear level. Ceph OSDs do a lot of writes.

Then there is the part in, how do you create the SSD mirror, mdraid or physical RAID card? The latter is another add-in card that either won't fit or is pricey. Besides that Ceph doesn't work well with RAID controllers and has other issues. The option mdraid, will very likely kill the performance once a volume needs to be re-synced, since the whole RAID volume needs to be synced. Regardless of holding data or not. And for sure that doesn't help the wear level either.

The redundancy level of Ceph is in the numbers of nodes, OSDs, Ceph services.

More often then not, the redundancy for HA services on a Proxmox VE cluster is mixed with the redundancy of Ceph. While the hyper-converged approach makes it easy to run, it still needs to be considered while planning the cluster.

EDIT: corrected some obvious typos.

elmo · May 30, 2020

Alwin said:
Well, from one perspective, for example 16x OSDs that are placed on a mirrored SSD. So, all 16x OSD will effectively use the IOps and bandwidth of a single SSD. Instead of two (2x 8x OSDs). This will introduce the SSD as bottleneck. Now, one could argue that using 8x SSDs with 4x OSDs per mirror could bring back the performance benefit. But I doubt that there will be the budget or the physical space in the server to realize this option. Alongside the performance goes the wear level. Ceph OSDs do a lot of writes.

Then there is the part in, how do you create the SSD mirror, mdraid or physical RAID card? The latter is another add-in card that either won't fit or is pricey. Besides that Ceph doesn't work well with RAID controllers and has other issues. The option mdraid, will very likely kill the performance once a volume needs to be re-synced, since the whole RAID volume needs to be synced. Regardless of holding data or not. And for sure that doesn't help the wear level either.

The redundancy level of Ceph is in the numbers of nodes, OSDs, Ceph services.

More often then not, the redundancy for HA services on a Proxmox VE cluster is mixed with the redundancy of Ceph. While the hyper-converged approach makes it easy to run, it still needs to be considered while planning the cluster.

EDIT: corrected some obvious typos.

Thank you for your input Alwin. This is why a fast cache-pool could/should be leveraged. I'm sure there's a reason for this being deprecated but as such, it leaves few options. If you have a small static cluster you should be able to implement a caching solution that would give you a benefit as opposed to a potential issue. This is what I meant wen saying mirroring doesn't seem that bad. However, for a larger growing clusters, the approach with DB/WAL on a separate device becomes unmanageable and very expensive very quickly. Hence the need for careful planning as you mentioned.

infinityM · Jun 1, 2020

Alwin said:
Have you already consulted our docs on this? I think it will answer your questions.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_osds

This was helpfull thank you

infinityM · Jun 1, 2020

Alwin said:
Well, from one perspective, for example 16x OSDs that are placed on a mirrored SSD. So, all 16x OSD will effectively use the IOps and bandwidth of a single SSD. Instead of two (2x 8x OSDs). This will introduce the SSD as bottleneck. Now, one could argue that using 8x SSDs with 4x OSDs per mirror could bring back the performance benefit. But I doubt that there will be the budget or the physical space in the server to realize this option. Alongside the performance goes the wear level. Ceph OSDs do a lot of writes.

Then there is the part in, how do you create the SSD mirror, mdraid or physical RAID card? The latter is another add-in card that either won't fit or is pricey. Besides that Ceph doesn't work well with RAID controllers and has other issues. The option mdraid, will very likely kill the performance once a volume needs to be re-synced, since the whole RAID volume needs to be synced. Regardless of holding data or not. And for sure that doesn't help the wear level either.

The redundancy level of Ceph is in the numbers of nodes, OSDs, Ceph services.

More often then not, the redundancy for HA services on a Proxmox VE cluster is mixed with the redundancy of Ceph. While the hyper-converged approach makes it easy to run, it still needs to be considered while planning the cluster.

EDIT: corrected some obvious typos.

This does not answer the questions though... It only raises more...

Maybe referencing it to an example may be better... If you don't mind.

Let's say I have 3 servers, each with 10 SAS drives 6TB each drive...
What would be the optimal way to increase the read/write speeds of the drives using SSD's, but in such a way that should an SSD fail, there's no data loss.

I hope the above method may bring some clarity in an easier to understand way to us all

Alwin · Jun 2, 2020

I can point you to the architecture guide of Ceph. It explains the working of Ceph.
https://docs.ceph.com/docs/nautilus/architecture/

Search

Search

SSD Journal Failure

infinityM

Well-Known Member

LnxBil

Distinguished Member

infinityM

Well-Known Member

LnxBil

Distinguished Member

Alwin

Proxmox Retired Staff

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff

elmo

Member

Alwin

Proxmox Retired Staff

elmo

Member

infinityM

Well-Known Member

infinityM

Well-Known Member

Alwin

Proxmox Retired Staff