[SOLVED] CEPH "Caching" by joining an external CEPH cluster with PVE Hyperconverged CEPH

Tmanok

Renowned Member
An idea has been percolating in my mind for some time now... My PVE hyperconverged CEPH cluster does not perform how I would like it to and I am considering non-hyperconverged cluster options for future growth (capacity). We have two "hadoop" high density SuperMicro HDD Nodes, purchasing a third would be trivial. I am considering something like this for each node:
  • 12x2TB HDDs
  • 1xPCIe to NVMe for Caching
  • 2x 10GbE
  • 64GB-128GB DDR3 of RAM
  • Dual E5-2600 series V0 Intel Xeons
If I had three of these, that would give me more than enough space, but possibly not enough performance. My current four PVE nodes are hyperconverged with a single CEPH pool which consists of:
  • 4x 500GB SATA SSDs
  • 1x LSI 9800-i8
  • 2x 10Gbps SFP+
  • HPE Gen 8 Rackmount Servers
  • Dual E5-2470 V2 OR E5-2690 V2 Intel Xeons
  • Sufficient memory per node 48GB-160GB 1600MHz
The question is: Should you join an external CEPH cluster to a PVE-CEPH Hyperconverged Cluster? In this case I have two basic options if I do join them together, the three SuperMicro high density servers could serve as a secondary pool, or they could join the existing pool that is running on my SSD Pool in the PVE Cluster.

Some additional questions might be:
  • CEPH should automatically store data closest to where that data is being read and written to. I already know that the existing pool is performing outside of my expectations, would I see worse, similar, or better performance by joining three non-PVE nodes into the same pool?
  • Would it make more sense to install PVE on those three SuperMicro servers?
  • Is this all a fool's errand?
Thank you in advance for your time, I am in an exciting position heading many projects but I don't have anyone locally who can guide me in this field of advanced storage. I hope this post is somewhat fun to answer, or at the very least interesting to conceptualize. These forums have been very informative for me and have greatly improved my knowledge base, I appreciate everyone who participates here.
 
Last edited:
A few thoughts from my side.

Firstly, I hope the 2x 10Gbit NICs are the ones used for Ceph and you do have more NICs for the Proxmox VE cluster (Corosync), VM traffic and such ;)

Mixing different Ceph deployments is not a good idea. PVE deploys and manages Ceph in its own way and with its own packages. If you mix that with something else, for example Cephadm you will have issues! You could of course add an external Ceph cluster as Storage in your PVE Cluster.

Cache: How would you use the NVME for caching in Ceph? In a cache pool? That will not really help you if you have VM disks stored there, as there is no clear distinction between hot (cacheable) and cold data. Consider everything as hot data.

If you add the additional nodes to the Ceph Cluster (install PVE with Ceph on them), you now have a Ceph cluster with different OSD classes (SSD & HDD). In such a situation, it is best to create Crush rules that specify which OSD class to use and assign it to the pools, so you will have an SSD and an HDD pool. Otherwise the performance will be a very mixed bag, depending on where the involved placements groups are located.


If you do have enough space, I would rather investigate the current performance and if there might be some other things that you can change to increase the performance.
Overall performance monitoring in your cluster is definitely helpful to identify possible bottlenecks.

One thing could be the network. You have 4 nodes with 4SSDs each on a 10Gbit network. Take a look at the older Ceph benchmark paper from 2018, page 4. We tested a similar situation and looked how much more performance we can get if we add more nodes, depending on the network speed. The 10Gbit network was already quite at the limit.

You can also do a ceph tell osd.X bench on all your OSDs and check if you have some slow outliers. If you do have them, it might help to destroy and recreate them (one at a time). Otherwise, there might be some hardware issue (close to failure, or just connected via a slow HBA/backplane/...).
 
Hi Aaron!

Firstly, I hope the 2x 10Gbit NICs are the ones used for Ceph and you do have more NICs for the Proxmox VE cluster (Corosync), VM traffic and such ;)
Well... I have 1x 10Gbps for CEPH Public (VM DATA) and 1x 10Gbps for CEPH Cluster + PVE Corosync. Then each node has 4x 1Gbps for WAN and VNICs.

Mixing different Ceph deployments is not a good idea.
That is completely fair, I hadn't much thought about using various deployment methods, you're right that sounds pretty horrendous. Constantly in CEPH WARN or ERROR state lol.

I would rather investigate the current performance
This is probably the more responsible option, I'm merely thinking in my head about upgrades :p

Currently performance is somewhat sub-par. My VMs hardly ever seem to exceed 15MB/s which is pitiful considering they have 500MB/s per drive in the CEPH cluster and a really good LSI9300-8i in each of my servers. As a simple example, and to keep things simple, I downloaded Ubuntu Desktop 20.04.2.0 iso to a debian VM on my SSD pool, downloading the file happens at WAN speed for the VM (about 250Mbps or 31.25MB/s), but copying is on average 42MB/s (peak 47MB/s) and CEPH reports a peak of maybe 53MB/s.

This is about 1/10th the expected performance, even accounting for CEPH performing 6MB/s-15MB/s of other operations during peak. Obviously, this is complicated further by those other operations not being sequential reads and writes, and placement groups living on multiple disks, across multiple nodes, etc etc.

The 10Gbit network was already quite at the limit.
I read that paper in 2019 and again late last year, it was a good paper, but in my scenario I do not believe this is network related. If I use BMON or Glances or IFTOP while performing this same copy process for example, my peak network throughput is only 71.5MB/s (572Mbps aka 0.572Gbps). The sustained throughput is maybe 20MB/s or less, which is about what I'm used to seeing.

Windows VMs make this problem even worse, maybe because I've configured their drive parameters ineffectively, maybe because KVM drivers are not ideal, but they were originally migrated from a HyperV host and perform worse than my native Windows VMs.

(close to failure, or just connected via a slow HBA/backplane/...).
I have questioned the backplane, these are DL360e Gen8 HPE Proliant servers with one DL380p Gen8 server among them. They're running with adequate memory and processing power, so my understanding is that either my HBAs are misconfigured or the backplanes are possibly not cutting it. The SAS cabling is new and routed very nicely, I built these servers from barebones, their BIOS and out of band management are up to date...

You can also do a ceph tell osd.X bench on all your OSDs and check if you have some slow outliers.
This is my next move along with faster memory on one of the nodes. Very few in-depth benchmarks have taken place because of the rush I've been in to setup this infrastructure. Will report back benchmarks.
 
Well... I have 1x 10Gbps for CEPH Public (VM DATA) and 1x 10Gbps for CEPH Cluster + PVE Corosync. Then each node has 4x 1Gbps for WAN and VNICs.
Having only one Corosync link and that one on the same physical network as Ceph can be quite problematic. Ideally, Corosync would have at least one dedicated network for itself [0]. You can also configure multiple links [1] for Corosync for it to switch between if one becomes unavailable.

A reliable Corosync connection between the nodes is especially important if you use the Proxmox VE HA stack! It is used to determine if a node is still part of the cluster. If the node lost the connection to the cluster, there are usually two reasons. The node is actually down, or the node just lost the connection either because the network is down (e.g. pulled/broken cable) or because another service is using up all the bandwidth (such as Ceph) and the latency for the corosync packets is going up too much, to the point where Corosync determines the link as not usable anymore. In that case, the node will wait for a minute or two before it will fence itself (hard reset) to make sure that the guests on it are off before the remaining nodes will start them.

If you have the only Corosync link on the same network as a potential bandwidth intensive service (Ceph, Storage, Backup, ...) you might run into the situation that all nodes have a lot of traffic at the same time -> bandwidth fully used -> latency for Corosync packets going up -> nodes lose Corosync connection between each other -> each node fence itself -> looks like the whole cluster just rebooted itself.


You can try to use the writeback cache mode for the VM disks. It is possible that many small writes might kill your performance.
How are the VMs configured?
How full is the ceph cluster?

I have questioned the backplane, these are DL360e Gen8 HPE Proliant servers with one DL380p Gen8 server among them. They're running with adequate memory and processing power, so my understanding is that either my HBAs are misconfigured or the backplanes are possibly not cutting it. The SAS cabling is new and routed very nicely, I built these servers from barebones, their BIOS and out of band management are up to date...
How are they connected? One HBA for how many disks? Those are easily over provisioned, which will cost you at least some performance.
on an OSD that is IN a pool? Or should I mark it as OUT, wait until the system has rebalanced and then try to benchmark the OSD?
Should be fine during normal operation. It does not take long but should tell you if you have some hard outliers that are much slower.


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
Hi Aaron,

Networking
Thank you for the obligatory note about Corosync as it is important to mention however, I'm familiar with the networking requirements and have no concern. Corosync only shares the CEPH monitor network, and NOT the public (data/osd) network. Traffic on the corosync+ceph-mon link seldom goes above 100Mbps despite it being a 10Gbps link and latency is fewer than 0.2ms at any given time. Additionally, I have never witnessed an instance where one of my nodes was randomly unreachable. My available ports on each server are 4x1Gbps and 2x10Gbps, so I'm working carefully within those limitations. If you feel that giving Corosync a 1Gbps network link or the monitor network a 1Gbps network link instead is a good idea, then I'm willing to try that, but I have not seen my corosync+ceph-monitor network ever pass more than say 200Mbps.

You can try to use the writeback cache mode for the VM disks.
I use write back cache on all my virtual disks, it certainly offers the best performance.

How are they connected? One HBA for how many disks?
LSI 9300-8i with 4x SATA SSDs per node. That's 550MB/s * 4 in a single direction, 2.2GB/s. If you want to consider "full-duplex" then 4.4GB/s rough estimate if you count both reads and writes simultaneously. The LSI 9300-8i is on an x8 PCIe 3.0 with no bifurcation, it should have at most 7.88GB/s so I don't see using half of that causing a real issue or bottleneck. Lastly, this card has been tested to obtain above 380MB/s per SATA/SAS port with 8 storage devices attached, I don't see it causing a problem, the cables for it are all new.

Should be fine during normal operation.
Ok I will run the benchmark and post below.

Hypothesis
My hypotheses on the possible causes of my storage performance issues are:
  • CEPH Configuration
  • Backplane (DL360e Gen8)
  • LSI device drivers
Observed throughput speeds
Some up to date numbers I gathered while upgrading to PVE 7.0 and CEPH Pacific:
(All of my VMs and CTs were off during this time)
  • Single VM backup: 30MB/s
  • Two VM backup: 65-80MB/s
  • 5-6 VMs backing up: 150-170MB/s
  • Rebalancing: 250MB/s-510MB/s
It was not my intention to rebalance, that only happened because I was upgrading RAM in one of my nodes and forgot to set no-out like a dummy. But the interesting thing to me is, each VM seems to have limited performance and even when three nodes are actively rebalancing, they only hit the peak performance of what should essentially be a single disk. When I think about CEPH, sometimes I wonder why reading three parallel copies of data doesn't provide at least 2x the throughput when there are this many disks involved...

Benchmark
Lowest throughput OSD: osd.0 420MB/s
Highest throughout OSD: osd 472MB/s
Average throughput: ~455MB/s

Is there a way to benchmark all of the disks at once?

How full is the ceph cluster?
CEPH Dashboard Reports: 42% 3.05TiB out of 7.28TiB, there are 16x 500GB OSDs, 8TB, or 7.28TiB RAW.
CEPH config: "osd_pool_default_min_size = 2 osd_pool_default_size = 3", current PGs = 128, optimal = 256 since the fourth node has been added. Probably a good call to update that to 256 now. What do you think?


Thanks Aaron for your time and patience,

Tmanok
 
How do you backup and where do you backup to?
Have you seen the last Ceph benchmark paper?
Backups might not be the best benchmark metric, as it also highly depends on where you backup to.

I would start at the lowest level. First benchmark a single disk, then run rados benchmarks directly on the Ceph cluster, and then you can start benchmarking inside VMs. Each layer adds its own complexity and possible limits.


The one thing you should be aware is that it can happen, that a single client will not fully utilize the whole Ceph clusters possible bandwidth. We had that in the benchmark paper when doing the rados benchmarks. Doing the benchmark on two nodes in parallel gave a better overall performance for the whole cluster, and it was even better when we ran it on all 3 nodes in parallel. We did not have more nodes to go even further.

With VMs it might even be more so as the virtualization layer is also costing some performance. You could set the VMs to use SCSI with io-Threading and the Virtio-SCSI controller to have a separate thread for each VM disk handling IO. This can help if the VMs IO is CPU limited if the disk IO is running in the same thread as the virtualization itself.
 
CEPH Dashboard Reports: 42% 3.05TiB out of 7.28TiB, there are 16x 500GB OSDs, 8TB, or 7.28TiB RAW.
CEPH config: "osd_pool_default_min_size = 2 osd_pool_default_size = 3", current PGs = 128, optimal = 256 since the fourth node has been added. Probably a good call to update that to 256 now. What do you think?

Have a look at the Ceph PG calculator. Select the "All in One" in the drop down menu above it. Add the pools you have and how many OSDs each pool has (should you separate them by device type). It will then calculate the optimal number of PGs.

That message is coming from the autoscaler I assume? Did you set a target_size or target_ratio for the pool(s)? This will help the autoscaler even better to determine the pg_num right away and it will not have to adjust it as the pool grows.

If you have only 1 pool and 16 OSDs, the PG calculator says that 512 PGs would be ideal.

Also the Ceph OSD benchmarks seem to be quite okay. No OSD which is particularly slower. If you want to run it on all OSDs at (almost) the same time you can do so with ceph tell osd.* bench.
 
How do you backup and where do you backup to?
Currently PVE backs up to a PBS VM running on my file server. The file server has plenty of RAM and CPU power for its job, and 8x 4TB HDDs. The Link is only gigabit at this time, I've recently installed a 10Gbps SFP+ NIC into the file server, but it is not configured yet.



Backups might not be the best benchmark metric, as it also highly depends on where you backup to.
Fair, my file server could be more or less loaded given a certain time of day. However, given that for each VM being backed up at a given time, I saw a clear and consistent increase and later decrease in throughput. E.g. The first VM starts backing up, it uses 30-40MB/s, a second VM starts backing up a few moments after the first has started, now the total usage is 70-80MB/s, etc.

First benchmark a single disk
Ok I require some guidance here. My concern is overwriting files or damaging the file system whenver someone mentions fio or rados benchmarking. Is it safe to run a rados or fio benchmark while an OSD is within an active (production) CEPH pool? Or should I remove it from my pool and create a secondary pool for benchmarking?

then you can start benchmarking inside VMs
Oh I've certainly done that as well:
RDSH04.png
Windows VM performance. Peak performance was less than 200MB/s on the write and about 90MB/s on the read.

a single client will not fully utilize the whole Ceph clusters possible bandwidth
Thank you for linking that benchmark paper, yes that's very clearly explained on pages 5 and 6, it gave some insight into how CEPH works but I didn't see "why" that was the case, for example was there a network bottleneck or is it a latency issue with CEPH that causes a single client to never saturate all of the available bandwidth?


set the VMs to use SCSI with io-Threading and the Virtio-SCSI controller to have a separate thread for each VM disk handling IO
I may need some more clarification, but this is how the majority of my VMs are configured:
Windows Server VM HW Config.png
Of course the number of cores and amount of ram differs between VMs... But the rest is similar. (Sometimes the E1000 NIC gets replaced by VirtIO NIC because of a bug where some Windows VMs drop their NIC connection...

This can help if the VMs IO is CPU limited if the disk IO is running in the same thread as the virtualization itself.
You mean there is a way to prevent a CEPH OSD from using the same CPU thread as the VM?? Please explain.

Did you set a target_size or target_ratio for the pool(s)?
There is no target size set, mostly because I'm concerned that I will set a target that is too large. And to answer your other question, the 256 optimal came from "Datacentre>Node>Ceph>Pool" if you look there, the pool will have columns for "# of PGs" and "Optimal # of PGs".

Also the Ceph OSD benchmarks seem to be quite okay. No OSD which is particularly slower. If you want to run it on all OSDs at (almost) the same time you can do so with ceph tell osd.* bench.
Ok I will post the results of the parallel ceph bench tomorrow evening, tonight I may be busy celebrating something important in my personal life.

Could you please suggest safe fio benchmarks to run on my cluster? Anything that risks dataloss if off the table for me.
Thank you so very much for your time Aaron!

Tmanok
 
Ok I require some guidance here. My concern is overwriting files or damaging the file system whenver someone mentions fio or rados benchmarking. Is it safe to run a rados or fio benchmark while an OSD is within an active (production) CEPH pool? Or should I remove it from my pool and create a secondary pool for benchmarking?
To benchmark a single disk, you will have to take it out of the cluster as a direct FIO benchmark will be destructive. Meaning, marking the OSD as out, waiting for the cluster to become healthy again and then destroying the OSD. The Rados benchmark needs to be done with a working cluster and pool. Ideally you can do that in some off time when you can power off the VMs to not have any load on the cluster other than the benchmark. If that is not possible, at least a time when the load of the benchmark will not have too much of a negative effect on the VMs. And let it run for a while to reduce the impact of short term load spikes or any caches. We usually let our benchmarks run for 10 minutes.
Thank you for linking that benchmark paper, yes that's very clearly explained on pages 5 and 6, it gave some insight into how CEPH works but I didn't see "why" that was the case, for example was there a network bottleneck or is it a latency issue with CEPH that causes a single client to never saturate all of the available bandwidth?
If the network had been the bottleneck, then the overall performance would not have increased with multiple benchmarks at the same time. No, it is more that a single client itself is not fast enough to saturate the whole cluster, running itself into some local CPU bound problems, maybe also some latency.

I may need some more clarification, but this is how the majority of my VMs are configured:
Okay, so the disk bus type is set to IDE which is primarily there for compatibility reasons. Ideally you would switch that to SCSI. In combination with the VirtIO SCSI controller (already set) you should see better disk IO because the VirtIO implementation is a much thinner layer than emulating a full IDE device (or SATA for that matter).
To switch a Windows VM to VirtIO SCSI you need to do a few steps, unfortunately. They are described in our Proxmox VE Wiki (last section). The VirtIO driver ISO for Windows can be downloaded from the Fedora project.
In our experience, it is very important to first attach a small dummy disk with the SCSI bus to the running VM before you shut it down and change the actual disk to SCSI. Otherwise Windows will not be able to boot from a SCSI disk.
To change the bus type of the existing disk, you need to detach it. It will show up as unused disk. You can then edit it to attach it again. Make shure to set the Bus Type to SCSI. Also don't forget to adjust the boot order in the VMs Options afterwards. This is something that I like to forget.

You mean there is a way to prevent a CEPH OSD from using the same CPU thread as the VM?? Please explain.
No, usually, the Disk IO of a VM is handled in the same thread as the VM itself. IO Threading moves the handling of the Disk IO into a separate thread. This has nothing to do with Ceph but only the virtualization (KVM/Qemu).
There is no target size set, mostly because I'm concerned that I will set a target that is too large. And to answer your other question, the 256 optimal came from "Datacentre>Node>Ceph>Pool" if you look there, the pool will have columns for "# of PGs" and "Optimal # of PGs".
That number comes from the Ceph autoscaler. If there is no target_size or target_ratio defined, it will only take the current pool size into account. But this means, that if the pool grows it could happen that the autoscaler decides that a new, larger pg_num is better and you will have to do some rebalancing which could affect the performance of the cluster. If you have somewhat of an idea how much of the whole cluster will be used by the pool, you can tell that to the autoscaler by setting either the target_size or target_ratio parameter. It can then determine the best pg_num right away, saving you a rebalance later on.
The target_ratio takes precedence if you define both, and is in my opinion also the better parameter. If you only have one pool, then it is easy, set the target_ratio to 1 as it is the only pool.

You should also be aware that the autoscaler only becomes active if the better pg_num is off from the current one by a factor of 3. So if the ideal pg_num is 512 and the current one is 256, the autoscaler should not become active, as it is off by a factor of 2, but you can still set it manually.
 
  • Like
Reactions: Tmanok
First of all, Aaron, you're awesome. That was very direct, clear, and informative. Gold star for you.

FIO benchmark will be destructive
marking the OSD as out, waiting for the cluster to become healthy again and then destroying the OSD.
Ok, see this is exactly why I didn't perform benchmarks after creating my pool, there were some fears after reading various posts about what would happen. This makes sense, and I will be sure to follow your instructions very precisely before I perform any fio benchmarks.

The Rados benchmark needs to be done with a working cluster and pool.
We usually let our benchmarks run for 10 minutes.
This suggests that you have control over the RADOS benchmark duration, good to know, I will figure that out when I perfom a benchmark. And O.K. there are various times of the evening when disk activity is phenomenally low, that being said I have about 40 backups scheduled every day ( so I will likely turn off my VMs first anyway.

local CPU bound problems, maybe also some latency.
Ah ok this makes sense, the reason why I considered network being possible a bottleneck is because if it was a bottleneck, then hypothetically testing on all three simulataneously would still give you a better result than a single test. And when the network is no longer a bottleneck, you would simply see a much improved result both using a single host and multiple but with a less defined difference between single host vs multiple hosts. At least, that's how my traditional method of storage looks at it.

IDE which is primarily there for compatibility reasons
Dang I knew this would be the issue, I migrated from Hyper-V to KVM/QEMU with these VMs so I had to implement some compatibility...
To switch a Windows VM to VirtIO SCSI you need to do a few steps, unfortunately.
Yeah, that's why I haven't done them... Eventually we will move to two Windows Server VMs created using PVE so they use VirtIO storage and VirtIO SCSI controller.
Disk IO of a VM is handled in the same thread as the VM itself.
Oh! So how to I setup IO Threading? I'm pretty sure that's only available if I switch the disks to VirtIO by creating a new virtual disk (blue checkmark):

Creating VM or VDisk.png

The target_ratio takes precedence if you define both, and is in my opinion also the better parameter. If you only have one pool, then it is easy, set the target_ratio to 1 as it is the only pool.
Alright, you were right, once I added target_ratio of 1, it suggested 512 PG. Will having more placement groups on a half-full cluster increase performance though? Could have sworn there was also something in the literature about using more memory with more PG groups. That being said, tonight was backup night and I watched the cluster hit 426MB/s read! That's a record for this cluster, and gives me a lot more confidence in its performance.

the autoscaler only becomes active if the better pg_num is off from the current one by a factor of 3. So if the ideal pg_num is 512 and the current one is 256, the autoscaler should not become active, as it is off by a factor of 2, but you can still set it manually.
Woah that seems like critical information, I feel like it's something you could very easily overlook. Is your guess that I would see an even larger increase in performance if I scaled up to 512 PGs and PGPs?

Thanks Aaron, I think between my old Windows VMs being on IDE compatibility layer, Ceph having the wrong number of PGs, and some stress on my memory (recently upgraded to 48GB minimum per node, now only using 34-38GB), performance seems to have already improved very recently.

Tmanok
 
Last edited:
Exciting update, the same VM that previously ran Crystal Disk Mark just gave me some really great results!
Screenshot_20210814-084820_Microsoft Remote Desktop.jpg
I tried with both 1 and 2 tests in a row and saw similar results, but the incredible thing is that after doubling PGs from 128 to 256, I've nearly doubled throughput in every test... This may be a very simple benchmark tool but irl performance when using these servers is also improving. I'll let you know what happens after upgrading to 512 PGs.

Thanks,

Tmanok
 
I tried with both 1 and 2 tests in a row and saw similar results, but the incredible thing is that after doubling PGs from 128 to 256, I've nearly doubled throughput in every test... This may be a very simple benchmark tool but irl performance when using these servers is also improving. I'll let you know what happens after upgrading to 512 PGs.
Great to hear :) You only changed the pg_num?

Regarding the rados benchmarks, have a look the Ceph Benchmark paper as it lists the commands.
A rados -p <pool> bench <time in seconds> write -b 4M -t 16 --no-cleanup. Only use the --no-cleanup if you want to do read benchmarks afterwards: rados -p <pool> bench 600 seq -t 16. Once you are done you will need to clean up the benchmark data with rados -p <pool> cleanup
 
Great to hear :) You only changed the pg_num?

Regarding the rados benchmarks, have a look the Ceph Benchmark paper as it lists the commands.
A rados -p <pool> bench <time in seconds> write -b 4M -t 16 --no-cleanup. Only use the --no-cleanup if you want to do read benchmarks afterwards: rados -p <pool> bench 600 seq -t 16. Once you are done you will need to clean up the benchmark data with rados -p <pool> cleanup
Hi Aaron,

No no I changed both pg_num and pgp_num of course. I still need your help answering this question from earlier, however:
If you feel that giving Corosync a 1Gbps network link or the monitor network a 1Gbps network link instead is a good idea, then I'm willing to try that, but I have not seen my corosync+ceph-monitor network ever pass more than say 200Mbps.
Do you recommend giving Corosync a dedicated 1Gbps link instead of a 10Gbps with Ceph-Monitor sharing it? They're both low BW and high latency sensitive so I thought they wouldn't be too bad together.

Additionally, I'm considering a new network topology where 2x 10Gbps (16 Port) switches take one SFP+ 10Gbps connection from each server, turn that into a XOR/Balanced link and then make a linux Bridge. The linux bridge would then serve 20Gbps (2x10Gbps) fault tolerant link to Corosync, Ceph Monitor, and Ceph Public (Data). Is this sane or will the latency be too high??

Currently I'm concerned that both my Corosync+Ceph-monitor network and my Ceph-public network are each on their own single link (high-risk failure domain). Here's a quick example of the current topology (not accurate, only depicts one node and simplifies the rest of the network).Example Current Networking.png

And here's the change I'm considering / proposing:
Example of Proposed Network Design.png
That way I can double my max throughput when the switches are operating and I can continue operating during a switch or SFP+ module failure. If something were to occur on the link connecting switch1 with switch2 however, that could become an issue.

The other possible solution if I need lower latency, would be to do something like this:
Example Proposed 3.png
Then that way CEPH would be on a dual 10Gbps LAN and Corosync would be on a dual 1Gbps LAN. I have enough ports for this.

Thanks, by the way, each of my nodes only has: 4x 1Gbps and 2x 10Gbps. There is no more PCIe expansion available for more links...

Tmanok
 
Last edited:
No no I changed both pg_num and pgp_num of course.
Good point, and thanks for mentioning it. Definitely needed if you change it via the CLI. If you change it via the GUI (possible since PVE 6.4) it is just the value for the number of PGs. :)
Do you recommend giving Corosync a dedicated 1Gbps link instead of a 10Gbps with Ceph-Monitor sharing it? They're both low BW and high latency sensitive so I thought they wouldn't be too bad together.

Yes, Corosync is essential for a working cluster, especially if you use HA. The official recommendation is to have at least 1 dedicated physical network for it. And even though you have never seen the network on which you have your Ceph monitors on, exceed 200mbit, you can never know if a situation happens where that is not the case anymore, and it might use up quite a bit more.

Regarding the network changes: do go for the dedicated Corosync network on the 1Gbit NICs. You do not need to configure a bond on it as Corosync itself can handle multiple links: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This means, you can configure two different subnets directly on the NICs themselves for Corosync and configure the Corosync links accordingly to use them.

For the Ceph network, a bond and switches which are stacked will give you the best fault tolerance. If you do not need guests to access the network directly (which I assume for the Ceph network) you do not need to add a vmbr interface on top of the bond. Configure the IPs directly on the bond. If you want to keep separate Ceph cluster and Ceph public networks (IP wise), you could define them in two VLANs.
 
If you change it via the GUI (possible since PVE 6.4) it is just the value for the number of PGs. :)
Oh cool, thanks for letting me know, I should really pay more attention to the manual (honestly never saw that anywhere in the release notes, I must be blind).
Yes, Corosync is essential for a working cluster, especially if you use HA.
Ok! I'll give it a dedicated interface and segment.
Regarding the network changes: do go for the dedicated Corosync network on the 1Gbit NICs. You do not need to configure a bond on it as Corosync itself can handle multiple links: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This means, you can configure two different subnets directly on the NICs themselves for Corosync and configure the Corosync links accordingly to use them.
Indeed! Somehow I never considered corosync would manage multiple links like that automatically. That saves me some time with my network switches.
For the Ceph network, a bond and switches which are stacked will give you the best fault tolerance. If you do not need guests to access the network directly (which I assume for the Ceph network) you do not need to add a vmbr interface on top of the bond. Configure the IPs directly on the bond. If you want to keep separate Ceph cluster and Ceph public networks (IP wise), you could define them in two VLANs.
My thoughts exactly, I need redundancy. I'm also purchasing larger 10Gbps SFP+ switches, currently I'm rocking dual 8 port Mikrotik's. There is no real reason for the vmbr bridge nics other than "just in case". If I wanted to apply them directly to the bond, how would I apply two subnets to the bond? Do you mean something like:
Bond: 192.168.40.0/24
vlan10: 192.168.40.0/25 (Ceph Public)
vlan20: 192.168.40.128/25 (Ceph Mon)

Thanks Aaron for all of your insight and support, I feel much better now that things are performing more acceptably and there are plans to improve the resilience of the infrastructure even further. Still have not yet had time to run benchmarks or increase to 512 PGs, but I will soon.

Tmanok
 
  • Like
Reactions: aaron
Exciting update, the same VM that previously ran Crystal Disk Mark just gave me some really great results!
View attachment 28700
I tried with both 1 and 2 tests in a row and saw similar results, but the incredible thing is that after doubling PGs from 128 to 256, I've nearly doubled throughput in every test... This may be a very simple benchmark tool but irl performance when using these servers is also improving. I'll let you know what happens after upgrading to 512 PGs.

Thanks,

Tmanok
Hi,

just to compare, tried the same benchmark on our 3-node cluster with ceph in an simple MS 2019 VM:
bench.JPG

Proxmox V7 out of the Box, 9 OSDs, no modifications (compression active).
 
Last edited:
I will say, with the default settings (calc your PGs for correct size), performance should be OK out of the box.
So, YES.
Performance really depends on your existing Hardware.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!