Ceph cluster suggestion

MMartinez

Renowned Member
Dec 11, 2014
52
6
73
Hello,

I've been using a Ceph storage cluster on a Proxmox cluster for a year and a half and we are very satisfied with the performance and the behaviour of Ceph.

That cluster is on a 4node Dell C6220 server with 10GbE dual nic which it's been a very good server for us.

Now we've ordered a second C6220 with the same configuration and we're wondering about what configuration would be better for us. Two options:

1) Having two different Ceph cluster on the same Proxmox Cluster. Each C6220 would be a ceph cluster and they would serve one pool to the all the nodes in the Proxmox cluster on the 10GbE network.

2) Have just one Ceph cluster of 8 nodes on a Proxmox Cluster.

From a confidence point of view, I prefer the first one as I can upgrade Proxmox in a more controlled way. Moving critical machines from one Ceph pool on the first cluster to upgrade to the other one which is not supposed to be affected.

As I don't have a test Ceph cluster for now, I prefer to have two producction Ceph cluster of 4 nodes instead one of 8 nodes.

In terms of capacity, I believe it would be the same, as we have 3 copies of each PG.

Maybe in terms of performance it could be better to use one 8 node cluster, as with two 4 node cluster I would have to distribute the different VM between two cluster to balance usage on them but actually performance is great enought for us with just one 4 node cluster and it would be double by adding a second one.

So, what would you recommend? Any other considerations?

Kind regards,

Manuel Martínez
 
Last edited:
Performance is not the only factor that improves with a larger cluster. More OSD:s and more nodes means greater spread of data when repairing from an OSD or node failure. This translates to reduced recovery times as every possible OSD that can store the data of the failed OSD/node needs to store less data from the failed OSD/node. It also spreads network usage over more hosts meaning less saturation of each link. Reduced recovery times means less risk of yet another OSD or node failure during recovery.

There are of course reasons why you would have two clusters, some of which you have stated such as an alternate cluster for testing, or for upgrading. Another one might be from a security point of view and it would be beneficial to limit network access between the clusters. Yet another one would be if you have some VM guest that really put a strain on the cluster and you want to in a simple way limit their impact on other guests. Maybe you could go with one 3 and one 5 host cluster, using the 3 node one for backups.

Maybe you should set up new a cluster since you likely got greater experience and insight now than when you started with the first cluster so you will configure the networking and other things better than for the current cluster and in time you will actually integrate the old machines into the new cluster.

I would consider these things that are not obvious the first time one sets up a cluster:
  • Separate ceph public and private networks, or bonding more interfaces for ceph.
  • Two physically separate proxmox corosync networks (ring0 and ring1) to handle networking failures better.

Before you use the new cluster you should also do stress testing and see how the cluster behaves when losing OSD:s/nodes and network equipment.
 
Thanks Bengt,

Reducing recovery time when a OSD has failed is a good point. Thanks, I was not aware about that.

I've had two little problems with failing OSD and it is nice to know how to reduce recovery time and risk.

We are also using some freenas servers as NFS storage and iscsi. This was our only shared storage system until we gave a try to Ceph.

Since last year we've been moving our critical VM to Ceph pool as we have gained confidence in Ceph Storage and we have seen that it performs very well. Now we want to add more Ceph nodes (and OSD) to have more ceph storage and be able to move more VM to Ceph.

So, about limiting resources to test VM onto Ceph storage, is not something that actually worries me much as I think that we will probably keep them on FreeNAS storage.

Do you really recommend to use bonds on Ceph network interfaces? If I'm not wrong, I was strongly recommended on not to do it, and we followed that suggestion so, after we migrated to 10GbE we stopped using bonds on Proxmox Servers NIC.

Your idea on 5 nodes cluster and 3 nodes cluster is interesting, I'' think on it. Thanks

Kind Regards,

Manuel Martínez
 
I actually do not recommend anything since I'm not in a position to do so :)

I was not aware of any general recommendations against NIC bonding for ceph. The official ceph documentation suggests bonding as an option, as well as having several NIC:s and addresses for the ceph network per node. But maybe having multiple NIC:s and addresses are recommended over bonding since it is a somewhat simpler setup in some ways, and not as dependant on networking hardware?
 
I know you are not encouraging me to do anything! I'm glad to receive your opinion on this subject. Thanks!! ;)

In fact the recommendation I'm talking about was not to use 2x1gbit NIC and switch to 10GbE.

What I've read in the past is that a bond adds a complexity layer that doesn't add much benefit in terms of performance (it does in fault tolerance) so I understood it was better to avoid it.

Look at these threads to see the responses in their context:
Following Alwin suggestion we decided to implement that network model and it has worked perfectly for us.

His suggestion, for our needs and with our hardware, was:
  • Ceph should use 10GbE, the public and cluster network don't need to be split.
  • Corosync with separate network and better two rings (different interfaces)
  • FreeNAS Storage accesses on its own network (optional 10GbE)
  • Client traffic on its own network (better manageability and security

Probably we could improve it by adding more 10GbE NIC but it works fine for us.

Kind Regards,

Manuel Martínez
 
Last edited:
In fact the recommendation I'm talking about was not to use 2x1gbit NIC and switch to 10GbE.

In fact, I would suggest looking at 40 gig, I just picked up 16 40 gig ethernet cards for $35 each on ebay and a Cisco 3132 40 gig switch for $815.
 
It's good to know. At the moment, we're using 10K rpm SAS disks as Ceph OSD 4 disks each node, and we've not reached the 10GbE limits.

Perhaps in the future, if we switch to SSD disks we will consider to use 40 or 100GbE.

By the way, can you tell me a bit more of the 40GbE that you have ordered for 35$?

Thanks Natthan,

Manuel Martínez
 
Hi again,

Now that I've add my new nodes to our Proxmox cluster, and after installing Ceph Packages I've realised that, as I already have an initialitzed Ceph Network and a ceph.conf file on the pmxcfs storage, my new Ceph nodes become part of the Ceph cluster. So the configuration I was interested in (specially to avoid problems during packages upgrades) seems to be difficult to implement or maybe impossible.

Is there an easy way to configure two Ceph clusters that share the same Ceph network on one Proxmox cluster? Do you think it could be convenient?

Regards,

Manuel
 
Hello,

After realising that it is not possible to create two Ceph clusters on a Proxmox one I'm looking for a way to have just one Ceph Cluster but two Ceph Pools with one condition, each ceph pool has to use exclusively OSD from selected ceph nodes allocated in different containers.

I've seen that this can be done on Luminous using device classes. In fact, gdi2k on a the post: https://forum.proxmox.com/threads/recommended-method-for-secondary-ceph-pool.35821/#post-219889, explains how he did it last year.

The procedure looks easy an fits with what I want to do, that is:

1) Add 4 more ceph nodes to the cluster (the 4 nodes are on a single physical container - Dell C6220)
2) Add 16 OSD of different weight to the cluster, and mark them someway to be able to diferenciate the "class" of these disks. For example class "B"
3) Create one crush rule for these new disks (class B)
4) Create a new pool "CephPoolB"
5) Migrate all the VM disks to the new pool
6) Destroy the old pool
7) Replace the old 16 OSD by another 16 bigger disks. This new OSD will be added marked as class "A"
8) Create a new pool "CephPoolA", and thats all.

Would you consider this is the right way to do this with Proxmox 5.4? The procedure described by gdi2k is from June 2018 and there might be other ways to do it nowadays.

Regards,

Manuel Martínez
 
Last edited:
It is still done the same way AFAIK, using CRUSH rules to instruct ceph where to store data. I think the simplest way for you would be to group your hosts into "chassis", "racks" or some other object type in the CRUSH hierarchy, instead of creating new device classes or assigning incorrect devices classes just to group your data.

That would mean you need to do roughly this:
  1. Read http://docs.ceph.com/docs/luminous/rados/operations/crush-map/
  2. Create instances of the chosen level, say you pick "rack" to group your hosts, create "rackA" and "rackB".
    1. "ceph osd crush add-bucket rackA rack"
  3. Move the defined racks into your CRUSH hierarchy below your root (by default named "default")
    1. "ceph osd crush move rackA root=default"
  4. Move your hosts into the corresponding rack, so you have "rackA": { host1, host2, host3, etc }, "rackB": { host 4, host 5. etc }
    1. "ceph osd crush move host1 rack=rackA"
    2. Check the crush map in the Ceph "OSD" tab in Proxmox so verify hosts are grouped correctly
  5. Create two new replication rules each specifiying "rackA," and "rackB" as the root of from where to pick OSD:s, and also picking host as failrure domain and hdd (or ssd, nvme) as the device class (unless you want to skip device class):
    1. "ceph osd crush rule create-replicated host-replication-rackA-hdd rackA host hdd
  6. Create pools using the new rules (e.g. "rackA-pool")
  7. Verify that the pools only pick OSD:s from the correct host group
    1. "ceph pg ls-by-pool rackA-pool" should only show OSD from your host group for the up and acting columns
Reserverations for forgetting steps and typos.
 
Thanks lots Bengt, that seems to fit with my needs.

I've installed a proxmox test cluster on some VM using nested virtualization and I'll do some tests.

Kind regards,

Manuel