HA LXC and external storage (CEPH?)

proxwolfe · Aug 5, 2021

Hi,

here‘s what I want to achieve - hopefully the esteemed experts around here can point me in the right direction:

I have an LXC running nextcloud. For storage, currently, I have a separate disk mounted on the host and mounted from there into the LXC. That works well.

Recently, I set up a pve cluster to make nextcloud highly available.

But I expect there to be a problem, if/when the LXC is transferred to another node and run from there. Of course, it takes it root disk with it. But the separate disk exists only on the original node. So it won‘t be able to access it.

I could put in additional disks on all nodes at the identical mount point and keep them in sync. So that wherever the LXC goes, it finds its data is already there. But there has to be a better way.

I was thinking of using CephFS, hoping that it would be available on all „participating“ nodes. And that the LXC could then automatically use the node-local CephFS.

But then I read that in order to use CephFS in an LXC, it is recommended to mount CephFS on the host and mount it into the LXC from there. Which brings me back to the above solution (probably without the need to keep the node-local CephFS instances in sync manually).

Is there any other way to have an LXC access the same storage on any node? My use case can‘t be so singular.

(Of course, I could use NAS storage for nextcloud, but I was hoping to considerably improve performance by

If nothing else comes up, maybe the way to go would be to have nextcloud use Ceph as object store. On the other hand, if I need to provide a fix IP that also would only be node-local on one of the nodes and involve network traffic on the others.)

Any help appreciated!

fabian · Aug 5, 2021

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_types

you want a shared storage for HA. instead of using cephfs, you'd probably just want to use a regular RBD volume managed by PVE if you go down the Ceph route. but storing the NC data directly in a Ceph pool is of course also an option - but that is then outside of PVE's scope of management, you need to setup some kind of backup yourself for that data, etc.pp.

proxwolfe · Aug 6, 2021

fabian said:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_types

instead of using cephfs, you'd probably just want to use a regular RBD volume managed by PVE if you go down the Ceph route.

So, how do I do that?

Do you mean that I should give the nextcloud LXC an additional (virtual) that resides on rbd?

Thanks!

fabian · Aug 6, 2021

yeah, you can just create the container using a storage of type 'rbd'. you can split out a mountpoint just for the nextcloud files, but you don't have to.
but note that Ceph does require some hardware / resources to give meaningful performance, so maybe play around a bit with it before doing semi-production setups.

proxwolfe · Aug 6, 2021

yeah, you can just create the container using a storage of type 'rbd'.

Oh, you mean to keep the data on the lxc's root disk. That would be a possibility. But I was going to separate the root disk and the data - it feels more disaster proof / easier to recover to me (maybe I'm wrong).

you can split out a mountpoint just for the nextcloud files, but you don't have to.

But if I do, I can mount it directly into the container (and I don't have to mount on the host first and from there into the lxc, as is recommended for CephFS), right?

That then would seem to address my requirement that the lxc take the data storage with it wherever it goes (and I don't have to prepare each node separately for the lxc to find its data there).

Thanks!

fabian · Aug 6, 2021

cephfs is mainly there for cases where you want many systems accessing a shared file system - that is not the case with your nextcloud container. rbd is just a thin layer in front of the object store that exposes a block device backed by objects - it has less overhead than cephfs but only allows a single client to access an image at a time.

proxwolfe · Aug 6, 2021

Thanks for the additional info - I am learning so much using pve (and pbs) in my lab.

But can we please go briefly back to my question (or assumption):

If I mount the rbd into the lxc, I can do that directly, without going through the host first, right? And when I do, the lxc will be able to access the rbd no matter on which host in the cluster the lxc is being run without additional preparations (all nodes are part of the pve cluster and of the ceph cluster), right?

Thanks!

fabian · Aug 6, 2021

yes, a volume on an RBD storage is available on each node that can access the ceph cluster (all nodes in a standard hyperconverged PVE+Ceph setup), and PVE takes care of activating and mounting it when starting the container.

proxwolfe · Aug 7, 2021

Great! I feel I am getting close to where I need to be with this.

Just brought up the last of the necessary three minimum OSDs for Ceph to work. The health indicator turned green.

I also created a Ceph pool that shows up in Storage and on every node. But in each node's storage list (underneath the VMs) the pool is shown with a "?" on the disk symbol - and when I move my mouse over it, a tool tip says "unknown". Is there anything else I need to do with the pool/rdb? Initialize maybe?

And since we are talking about it: What performance can I expect from my rbd, given my setup: Three nodes with one OSD each. The target replica number is three. So I would expect a complete set of data on each of the disks. Does that mean that the information can be accessed on each node with the speed of a local disk drive? Or is the network involved anyway?

And what about writing - is writing happening with the speed of a local disk or only with the speed of the slowest disk on any of the nodes (i.e. a write is only complete when the last replica is written on another node)?

Thanks!

proxwolfe · Aug 8, 2021

proxwolfe said:
I also created a Ceph pool that shows up in Storage and on every node. But in each node's storage list (underneath the VMs) the pool is shown with a "?" on the disk symbol - and when I move my mouse over it, a tool tip says "unknown". Is there anything else I need to do with the pool/rdb? Initialize maybe?

Okay, so I ran

Code:

systemctl restart pvestatd

on each node and the question mark is gone.

proxwolfe said:
What performance can I expect from my rbd, given my setup: Three nodes with one OSD each. The target replica number is three. So I would expect a complete set of data on each of the disks. Does that mean that the information can be accessed on each node with the speed of a local disk drive? Or is the network involved anyway?

And what about writing - is writing happening with the speed of a local disk or only with the speed of the slowest disk on any of the nodes (i.e. a write is only complete when the last replica is written on another node)?

I found

Code:

ceph tell osd.* bench

and it told me that on the nodes/osds the write performance was as follows (bytes_per_sec):
Test run on....node1................node2.................node3
OSD0......85,711,706........88,741,965.........86,685,077
OSD1......95,690,456........92,092,628.........97,718,971
OSD2....104,138,235......103,654,102.......101,557,560

The disks in each node are identical (WD Enterprise 2TB 3.5% 5.400 rpm). There first two nodes are also identical in hardware. The third node is slightly stronger. All nodes are connected via 1gbe dedicated to Ceph. No CTs/VMs were run on any of the nodes at the time of testing.

Apparently it does not make much a difference, on which node the data is put into the rbd, but data written out to OSD2 is always written fastest (which to me is plausible, as node3 which houses OSD2 has slightly stronger hardware.

Interestingly, data put into the rbd on node3 is written on OSD2 not as fast as when put into the rbd on the other nodes. That I did not expect.

Well, I have no idea whether this test makes any sense to start with. It is not intended to be overly comprehensive, just to give me a first idea how things work. Of course, I could run it several times and average the result. I did run it a couple of times but the results did not deviate substantially.

For comparison purposes, I ran

Code:

hdparm -tv /dev/sdc

on a node that is identical to node2 and that also houses the same hard drive that, however, is not used as an OSD in the Ceph cluster yet.

It gave me
148.04 MB/sec

So writing to a local disk is about 50% faster than writing to the rbd.

And just for fun I also test the local sdd
304.75 MB/sec

Given all that. What would be the logical next step to optimise performance? What is my bottleneck?

Would it be better to give Ceph a faster network (10gbe) or should I use ssds instead of the hdds or should I throw in ssds to support the hdds as journaling devices? I have a couple of ssds lying around. I do not need petabytes of storage. 1-2 TB is probably going to be sufficient for a while.

Just for perspective: This is my home lab. I have no enterprise objectives (the enterprise hdds I had lying around). I do want to replace my cloud drive (onedrive) in the long run. My hardware is low end soho server stuff. And my DSL uplink is 40mbit. So there is no point in maxing out everything locally when the ultimate bottleneck is my uplink (I might invest in a second line, expecting to roughly double the uplink to 80mbit). I do like to play around, however, and learn a thing or two doing that.

Any comments welcome!

floh8 · Aug 8, 2021

1. you do not have to use a rdb device - you can go with cephfs
2. ceph performance grow with number of disks
3. when one disk can 300 MB/s then 1 Git/s is a bottleneck, of course
4. ssd is always better, you can also use it as cache tier

question: which fucking hdd u use?

proxwolfe · Aug 8, 2021

floh8 said:
1. you do not have to use a rdb device - you can go with cephfs

Understood. But based on Fabian's answer above, rdb is the simpler and (I am assuming, therefore, less ressource intensive) approach. What would be the benefit of using CephFS instead of just rdb?

floh8 said:
2. ceph performance grow with number of disks

Okay, so I could improve speed but that would mean I need to add in disks that I don't really need in terms of storage (I already have more than I need now).

floh8 said:
3. when one disk can 300 MB/s then 1 Git/s is a bottleneck, of course

Yes, but that is a local ssd that is not part of Ceph (yet) and that I tested just to get an impression of what performance I have available.

floh8 said:
4. ssd is always better, you can also use it as cache tier

And what would be the best approach - replace the hdds with sdds or add in the sdds as journals (is that the same as the cache you mentioned or is that still another way to use them in Ceph?)

floh8 said:
question: which fucking hdd u use?

My hdds are WD Enterprise 2TB, the ssd is some random m.2 sata disk (Samsung PM781, I believe). As I said, the gear is low cost... but my goals are moderate

floh8 · Aug 8, 2021

1. no, rdb was right, sorry
2. how many hdd and how many sdd do you have?
3. ah ok - only the sdd
4. https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

proxwolfe · Aug 8, 2021

floh8 said:
2. how many hdd and how many sdd do you have?

I have another 5 wd red 3tb and 5 hgst 6tb drives. Plus a couple of inexpensive 250gb and 500gb ssds. Each node has room for 4 (if necessary 5 drives). One slot in each node is, however, already occupied by the boot drive and one by the current Ceph pool drive.

floh8 said:
4. https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

Thanks for the pointer. It seems I will need to do some thinking about how I access my data and whether this usage lends itself to caching.

Is it possible to set this up in the pve web gui or does this only work on command line?

proxwolfe · Aug 8, 2021

proxwolfe said:
What would be the logical next step to optimise performance? What is my bottleneck?

Would it be better to give Ceph a faster network (10gbe) or should I use ssds instead of the hdds or should I throw in ssds to support the hdds as journaling devices?

Reading through the docs, I found a quick fix (I‘m hoping). Each node has another free 1gbe nic. So I could bond those with the ones already dedicated to Ceph and, hopefully, double network throughput.

I will try to do that and then run my test again to see, if the write speed has increased (ideally to the level of the local disk access I measured with the one drive not in the Ceph pool).

floh8 · Aug 8, 2021

that all sounds great. u have enough hdds and ssd. maybe u can optimize step by step and decide if u wanne go ahead.
journaling on the ssds is prima. do it! if enough ssd try the cache pool.
It will be great. u will find only basic function in the web console.

fabian · Aug 9, 2021

a HDD only ceph pool will never be very fast (throughput might not be bad if it's big enough and the workload matches). you basically have additional network latency and bottle necks on top of that of the disks, and for writes you need to wait for replication to finish before the write is "done".

see our ceph benchmark for what is possible in small hyperconverged clusters if you throw some hardware at it, and also for some more suggestions on how to check your cluster's performance https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/

proxwolfe · Aug 9, 2021

fabian said:
a HDD only ceph pool will never be very fast (throughput might not be bad if it's big enough and the workload matches). you basically have additional network latency and bottle necks on top of that of the disks, and for writes you need to wait for replication to finish before the write is "done".

Understood, thanks!

So would you agree that the following two measures will improve performance:
- recreate the HDD only OSDs by adding in an SSD for each HDD?
- add another 1gbe NIC on each node for the Ceph network?

Thanks!

fabian · Aug 10, 2021

that depends on whether those SSDs are suitable for journal usage (high IOPs and endurance, powerloss protection, ...) - else you'll tear through them quite fast.

proxwolfe · Aug 10, 2021

fabian said:
that depends on whether those SSDs are suitable for journal usage (high IOPs and endurance, powerloss protection, ...) - else you'll tear through them quite fast.

Naively, I would expect (any) SSD to have higher IOPs than my HDDs.

But the endurance is an important consideration. The SSDs I have available are low end stuff and probably not very durable. So the speed benefit might become expensive quickly (although I am not expecting to write huge amounts of data).

That then leaves me with improving network throughput. Does bonding make sense (or are there caveats as well I should be aware of)?

Thanks!

HA LXC and external storage (CEPH?)

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member