[SOLVED] Help with Ceph concepts and design

tipex · Nov 6, 2023

Below is my current home lab setup with regards to disks. This is a pretty standard ZFS style setup that you will all be familiar with. Note that its still in the experimental stage so I can tare things down if I want.

Node one:

OS = 2 x 500GB SSDs in ZFS raid 1 mirror
VM pool = 2 x 1TB SSDs in ZFS raid 1 mirror
TrueNAS disk pass-through via HBA in IT-Mode = 2 x 4TB SSDs in ZFS raid 1 mirror. Plan being I can add more disks over time to convert to ZFS raid 5 or 6 etc if I need more storage space.

Node two is exactly the same.

My reasoning for two servers is that if I need to take one down for maintenance, I can migrate VMs over to the other and therefore keep my network running. Also Node one TrueNAS is the file server while node two TrueNAS is the backup server.

I liked the concept of high availability but it was not something I was interested in…. until I tried it this weekend by adding a Raspberry Pi QDevice. Now I think its ace and want to use it

. Failing over small VMs is not a problem but my TrueNAS setup with the disks being passed through via a HBA is a different story

. I’ve always known the ultimate setup is having a separate storage area network but for a home lab it results in too many servers $$$.

At the weekend I discovered that you can run Ceph on the Proxmox nodes. This changes everything!!! If I run Ceph I could then failover TrueNAS. I might actually just use plain old Samba on a container rather than TrueNAS but you get the idea, I need to move a VM with TBs of data.

I’ve been reading up on Ceph but there are a few things I’m not sure about:

It seems you need a minimum of 3 nodes. Sounds a lot like Proxmox clustering. The question is - do all 3 nodes need to be full blown servers with all the disks. With Proxmox clustering I can get away with a simple QDevice. I can easily get a 3rd device for Ceph and also use it as the 3rd Proxmox node and therefore free up my Raspberry Pi for other uses. My main concern is disks. If I need to buy more 4TB SSDs for the file server thats a lot of money for a home lab. Does Ceph need to store the data across all 3 nodes? In my mind I’m thinking why cant two nodes have a full copy of the data and the 3rd node is there just for voting. This might not be how Ceph works though which is why I’ve written this post. I wouldn’t plan on having any VMs running on the 3rd node.

Do I have too many disks with my current setup? For example, having two 1TB drives for the VM pool in ZFS raid 1 mirror makes sense when using ZFS but do I still need the two disks if using Ceph? If one disk failed then another node would have the data so no need for 2nd disk?

If the data is present on all nodes – which node would serve the data to Proxmox? Ideally Proxmox would get the data from the Ceph disks that are on the same physical machine. Seems wasteful from a network bandwidth point of view to be using a VMs disk stored on a different node if there is a copy of the data on the same node.

As you can probably tell from my post I am very new to Ceph.

Maximiliano · Nov 6, 2023

Hello,

It seems you need a minimum of 3 nodes. Sounds a lot like Proxmox clustering. The question is - do all 3 nodes need to be full blown servers with all the disks. With Proxmox clustering I can get away with a simple QDevice. I can easily get a 3rd device for Ceph and also use it as the 3rd Proxmox node and therefore free up my Raspberry Pi for other uses. My main concern is disks. If I need to buy more 4TB SSDs for the file server thats a lot of money for a home lab. Does Ceph need to store the data across all 3 nodes? In my mind I’m thinking why cant two nodes have a full copy of the data and the 3rd node is there just for voting. This might not be how Ceph works though which is why I’ve written this post. I wouldn’t plan on having any VMs running on the 3rd node.

Ceph requires 3 nodes, having 3 copies of the data is the bare minimum for Ceph to guarantee data integrity in a reliable way.

Do I have too many disks with my current setup? For example, having two 1TB drives for the VM pool in ZFS raid 1 mirror makes sense when using ZFS but do I still need the two disks if using Ceph? If one disk failed then another node would have the data so no need for 2nd disk?

With Ceph the more disks you have, the better. Its preferable to have four 1TB disk than one 4TB disk. And if possible all of the same capacity. If you lose one disk, then Ceph will recover the data on other nodes so you still have 3 copies of each object.

If you have only one OSD per node, and one fails then there is nowhere to replicate the data, depending on your needs running in a degraded state until you find a replacement might be perfectly OK (You still have two replicas of each object). If you have two disks per node and lose one OSD, then Ceph will replicate the entire contents of that OSD on the other nodes which requires the OSDs to have enough free space on them, do note that Ceph will have problems if *one* of your OSDs is above ~90% full. In short, running one OSD per node can be preferable to having two per node depending on your use case. Do note that if you go bellow two replicas *all* IO will be blocked on the entire Ceph pool.

If the data is present on all nodes – which node would serve the data to Proxmox? Ideally Proxmox would get the data from the Ceph disks that are on the same physical machine. Seems wasteful from a network bandwidth point of view to be using a VMs disk stored on a different node if there is a copy of the data on the same node.

Each object is in a placement group and each placement group has a defined primary OSD which will be prefered for reading. So no, reads will be spread among the cluster.

tipex · Nov 6, 2023

Thanks for your reply Maximiliano

So it sounds like I need to decide if I want to build a full 3rd node then. Time to get a bank loan to fund my excessive home lab

For my setup I would like to be able to suffer a single disk failing somewhere in the clsuter and also a full host going down. If we take the simple case of the VM storage pool which I use for the VM disks. I can see the following posible options:

Option A (I think this would be the prefered setup?)
Node 1:
- OSD = 1TB
- OSD = 1TB

Node 2:
- OSD = 1TB
- OSD = 1TB

Node 3:
- OSD = 1TB
- OSD = 1TB

Option B (from what I read its better to have one OSD per disk so this option is probably a rubbish or maybe not even possible at all?)
Node 1:
- OSD = 2 x 1TB

Node 2:
- OSD = 2 x 1TB

Node 3:
- OSD = 2 x 1TB

Option C ( This cuts down on disks but it possibly not a good option?)
Node 1:
- OSD = 1TB

Node 2:
- OSD = 1TB

Node 2:
- OSD = 1TB

Once I understand the above I can think think through the large pool for use by file server which I think would be CephFS.

Maximiliano · Nov 7, 2023

In general, one disk is one OSD (Although some people use multiple OSDs in a single NVMe for performance, but this has some drawbacks and requires very performant NVMes).

For a small cluster 1 OSD per node is OK. If you use two you cannot have any of them using over ~40% of their raw capacity without risking problems in case a single OSD fails and the data has to be rebalanced to the other OSD in that node.

Do also note that since you store 3 replicas you usable space is 1/3 of the total raw space. So taking into account the failure scenario you end up with less usable space than with a single OSD per node.

Might I suggest you consider the option to have two nodes using ZFS and replicate the data periodically between them? That might perfectly well for your usecase. On the other hand for Ceph I would not recommend to have anything other than either 1, or 4+ OSDs per node on a 3-node cluster.

tipex · Nov 15, 2023

I’ve been thinking about this in the background.

For small VMs, using ZFS replication is fine for my use case. Where it falls down is my file server VM which would have too much data for replication to be a sensible option. Thats why I started looking at Ceph. Samba could then mount CephFS.

I think if I stick with ZFS replication I need to accept that I cant fail over the file server VM. This then means its pointless failing over some of the other VMs because they rely on the file server for their storage so they wont work properly if the file server is down.

Typical VMs:
- File server
- Backup server
- NextCloud – Relies on file server
- ZoneMinder for security cameras – Relies on file server
- Plex – Relies on file server

I would want a pool to store the VMs on. A CephFS pool for the file server to use. Another CephFS pool for the backup server.

I’ve already bought the 4TB drives so I need to stick with them now. Plus I don’t have enough drive bays to get the storage I want from loads of 1TB drives.

This is how I imagine each of the 3 nodes would then look like with Ceph:
- VM pool = ODS = 1TB
- File server CephFS pool = ODS = 4TB
- Backup server CephFS pool = ODS = 4TB

Say I needed more storage at some point in the future I would do this:
- VM pool = ODS = 1TB
- File server CephFS pool = ODS = 2 x 4TB
- Backup server CephFS pool = ODS = 2 x 4TB

Is the above sensible?

What about running the backup server in the same cluster? I would like to stick to the 3-2-1 backup rule. Without Ceph this is fine as the file server and backup server would be on different hosts and therefore meet the criteria of 2 copies should be on different media. If I use Ceph I guess this is still true because they would be on different CephFS pools?

It feels like running a 3 node Ceph cluster is a bit like running raid 5. Yes you have some redundancy but you are at risk during rebuilds. I’ve had raid 5 arrays die on me during rebuilds at which point I decided that raid 6 was the minimum I should use for large storage arrays. I tend to start with raid 1 for 2 disks and then jump to raid 6 if I need more than 2 disks.

I’m really not sure what to do. Ceph feels like the solution but its complicated and to do it properly I maybe need to invest quite a bit more money.

tipex · Dec 6, 2023

I have decided not to bother with Ceph because of the cost involved with creating a 3rd full data baring server and the fact that each server can only store a percent of the overall storage capacity. I love the idea of Ceph but it feels the wrong choice for what I'm doing.

Instead I'm going to use ZFS replication. This works fine for small size VMs which is almost all of my VMs. The tricky bit is my TrueNAS VM. This has several disks passed through via a SAS card. I did some testing and what I can do is replicate the VM but not use high availability. If the server dies I can manually migrate the VM to my other Proxmox node as the config and disk are available on the other node. Then I manually plug the disks into the other server and pass them through to the TrueNAS VM as the other server also has a SAS card. It sounds a bit messy but in practice it takes 5 minutes to do.

[SOLVED] Help with Ceph concepts and design

tipex

Member

Maximiliano

Proxmox Staff Member

tipex

Member

Maximiliano

Proxmox Staff Member

tipex

Member

tipex

Member

We value your privacy