3 node ceph build

nes

New Member
Mar 18, 2017
8
0
1
50
Hallo,

i am planning to build a 3 node proxmox ceph cluster
and am looking for advice if my choices are good enough for a (very) small corporate cluster.

Basically it would run a Windows-AD Server and a Linux-Postgres LXC (both serving ~50 Clients)
another LXC with Nextcloud (low traffic),
some additional Windows W7 and XP VMs (rarely used but have to be accessible)
Main Goal is to have Resiliency against a failure of one the servers.
e.g. Server fails - VM is restarted on another host.
Performance only has to be "good enough" for the 50 clients
which are using mostly one program accessing the Postgres-DB.


the hosts will have
CPU : i5 3450 (one i5 2500)
RAM : 32GB (max which the CPUs support) ECC
We probably need 24GB of Ram for all the VMs we want to run - so max 8 GB left for Proxmox.
so probably no ZFS to save some ram ?

HDD : WD Re 2TB or something similar (WD Gold ?)
No Raid necessary as Ceph should handle failures afaiu (not complety sure)
If we need Raid then we could only use slower/cheaper disks.

SSD for the Journal :
Samsung 850 - not sure if i should buy a Pro 256GB or Evo 512 GB
almost the same price - and i could overprovision the Evo and still have more space

Networking : Mesh-Network with Intel TX540 T2 or (quite cheaper) the DLink DXE-820T
the Dlink seems to have good specs but i cant find reviews as it is quite new (i think)

Would this be "good enough" for our purposes ?
Should i put the DB-Files not on Ceph but on a separate SSD and do Master/Slave replication ?

Any other improvements ?
 
Hallo,

i am planning to build a 3 node proxmox ceph cluster
and am looking for advice if my choices are good enough for a (very) small corporate cluster.

Basically it would run a Windows-AD Server and a Linux-Postgres LXC (both serving ~50 Clients)
another LXC with Nextcloud (low traffic),
some additional Windows W7 and XP VMs (rarely used but have to be accessible)
Main Goal is to have Resiliency against a failure of one the servers.
e.g. Server fails - VM is restarted on another host.
Performance only has to be "good enough" for the 50 clients
which are using mostly one program accessing the Postgres-DB.


the hosts will have
CPU : i5 3450 (one i5 2500)
RAM : 32GB (max which the CPUs support) ECC
We probably need 24GB of Ram for all the VMs we want to run - so max 8 GB left for Proxmox.
so probably no ZFS to save some ram ?
Hi,
32GB Ram looks to smal for me...
HDD : WD Re 2TB or something similar (WD Gold ?)
No Raid necessary as Ceph should handle failures afaiu (not complety sure)
If we need Raid then we could only use slower/cheaper disks.

SSD for the Journal :
Samsung 850 - not sure if i should buy a Pro 256GB or Evo 512 GB
almost the same price - and i could overprovision the Evo and still have more space
take not this disks - you will get poor performance and the will die fast.
Use DC-SSD for journal like Intel DC 3610
Networking : Mesh-Network with Intel TX540 T2 or (quite cheaper) the DLink DXE-820T
the Dlink seems to have good specs but i cant find reviews as it is quite new (i think)
I don't know the twisted-pair 10G-Nics but I had good experiences with 10GB-Intel Nics
Would this be "good enough" for our purposes ?
Should i put the DB-Files not on Ceph but on a separate SSD and do Master/Slave replication ?
You wil get much better DB-Performance with local SSDs.

Udo
 
Hi Udo,

thanks for the response.

32GB - sadly no way to expand this.
(except buying new Servers)
i could run the vms spread out on the 3 hosts and only if one fails then it would get cramped on one of the remaining hosts.
Is it possible to define a failover with less memory for the vm than it had on the original host ?

SSD : Ok, how about Samsung SSD SM863 240GB - this also has a TBW of >1,5PB
(and a better write rate and IOPs than the one from Intel )
Maybe the sustained write rate ist not as good,
but i dont think that we have continuous writes the whole time -
the main writing will be from postgres, and even being generous
i think we dont have more than 20 inserts/sec and even that only for short times.
OTOH i dont really know how ceph handles the redistribution of the writes...

using local SSD for Postgres -
so i would have 3 LXCs - 1 Master / 2Slaves - can they use a partition on the journal SSDs ?


Nenad
 
SSD : Ok, how about Samsung SSD SM863 240GB - this also has a TBW of >1,5PB
Just to reiterate on the SSD's and Data written values. Gave a look at this post where someone ran the numbers:
SSD €/TBW

Maybe the sustained write rate ist not as good,
but i dont think that we have continuous writes the whole time -
the main writing will be from postgres, and even being generous
i think we dont have more than 20 inserts/sec and even that only for short times.
OTOH i dont really know how ceph handles the redistribution of the writes...
Please have a look at how journals work in Ceph:
https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/

And also keep this in mind:
ditaa-54719cc959473e68a317f6578f9a2f0f3a8345ee.png

If you have a replication of 3. So for a Replication 3 write, there will be 6 Writes for any Client write. 3 to the journals and then 3 to the OSDs.

On a correctly set up 3 node Cluster these additional 2 journal writes should land on a separate host's journal and then on that hosts assigned OSD
The Write Rate of your journal will determine how where your write IO ceiling is.


Networking : Mesh-Network with Intel TX540 T2 or (quite cheaper) the DLink DXE-820T
the Dlink seems to have good specs but i cant find reviews as it is quite new (i think)

I'd always go for a switch option. That way you will be able to expand your Cluster later on (gets expensive fast when you have a mesh network reliant on dual-nic 10G cards ... and want to add a 4th node)
Have you thought about how and if you want to separate ceph public/CLient networking ?
for reference see:
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

You wil get much better DB-Performance with local SSDs.
I can only echo this.
You probably wan't keep this in mind as well:
https://pve.proxmox.com/wiki/Storage

There is always the option to create a ceph-pool based on SSD only OSDs via Crush-hook-scripts that act as their own journal device.
 
Hi,

€/TBW
Ok, thanks.
TBW is usually not a hard limit on SSDs (as far as i gathered from different tests)
but of course it feels safer to have some guaranteed numbers.

So i will need to spend quite a bit more on the SSDs ...


switch - well that gets even more expensive .

The lowest cost option i have found would be
a switch from TP-LINK T1700G-28TQ (4SFP+ and 24x 1G) 400€
or the
Ubiquiti Networks EdgeSwitch, 16-Port, 10G, ES-16-XG - 600€,
a lot more 10G ports but had some bad reviews in the past...

but maybe i can now go a lot cheaper on the network cards ?
there are Mellanox ConnectX 10Gigabit Kits with 2 Cards and 1 Cable on ebay for 75€
(from germany or uk, so at least the quality should be probably ok )
The price - difference to the intel cards is quite dramatic (!?)
are these viable cards ?


Ceph Replication -
i the blog-post he also talks about fragmentation being an unsolved problem (in 2014)-
is this still true ?

There is always the option to create a ceph-pool based on SSD only OSDs via Crush-hook-scripts that act as their own journal device.

I have to read up on the whole crush subject, but afaiu :
the Postgres-DB needs probably no more than 5-10GB - so even if i tripple this to be safe and add the journal -
it would be somewehre around 20-40GB for the DB-ceph-pool , correct ?
So plenty of space for the journal for the main pool.



Just to reiterate on the SSD's and Data written values. Gave a look at this post where someone ran the numbers:


<good info about the replication - thanks>

I'd always go for a switch option. That way you will be able to expand your Cluster later on (gets expensive fast when you have a mesh network reliant on dual-nic 10G cards ... and want to add a 4th node)
Have you thought about how and if you want to separate ceph public/CLient networking ?
for reference see:


I can only echo this.
You probably wan't keep this in mind as well:


There is always the option to create a ceph-pool based on SSD only OSDs via Crush-hook-scripts that act as their own journal device.
 
So my revised plan is :

3 Hosts with
CPU : i5 3450 (one i5 2500)
RAM : 32GB

HDD : WD Re 2TB or something similar (WD Gold ?)
no ZFS to save some RAM (as Ceph should be good enough)

SSD :
Intel SSD DC S3700 200GB (3PB TBW)
or
Intel SSD DC S3610 200GB (1,1 PB TBW)


still unsure about the network -
(i need to save some money for the SSDs)
Ubiquiti Networks EdgeSwitch, 16-Port, 10G, ES-16-XG
with really cheap Mellanox ConnectX 10Gigabit Cards with SFP+

or
doing a mesh with 6 Mellanox Cards (if the servers can take so many cards)
probably cheapest - *if* they work reliably

or
mesh with 3 x 2-Port Mellanox - not so cheap anymore,
and i would need to buy 3 sfp+ cables extra which also are not cheap

or
doing a mesh with the intel x550 t2 cards
(the most expensive option)


any comments ?
 
My 2 cents are stick with the S3700 drives or S3710s. I prefer 10GbE vs InfiniBand as it seems more standard. Then I can easily bring more 10GbE to the LAN (vs storage) side of the network. (I know IB has lower latency) If you don't mind used to save some money on the 10GbE switch. Check out the Quanta LB6M or LB8 switches on ebay.
 
Last edited:
There is always the option to create a ceph-pool based on SSD only OSDs via Crush-hook-scripts that act as their own journal device.
I have to read up on the whole crush subject, but afaiu :
the Postgres-DB needs probably no more than 5-10GB - so even if i tripple this to be safe and add the journal -
it would be somewehre around 20-40GB for the DB-ceph-pool , correct ?
So plenty of space for the journal for the main pool.

If your Postgres-DB is using 10 GB in space and you use 3 nodes with a single dedicated SSD for the task with Replication set to 3, then you are looking at 30GB of data written residing on said SSD's in total (10G each). In reality you probbaly want to size this pool larger. You probbaly wan't twice that +20% buffer.

Just to reitterate. Udo was talking dedicated locally attached SSD's. i was talking dedicated SSD's for dedicated SSD-only ceph pool.

I'd personally check out what TBW/€ my choice of SSD would be; Then put it into context with the features of said SSD and then make my decision based on my financial pain-points.

We do use some (about 100 of em) SM863 SSD's at work and they have been good sofar, but we do use the 2TB models (12.3 PBW rated.)

ps.: YOu probably wan't to read up on SSD write amplification while you are at it to get an idea why people are advising to go for high TBW high controller quality type of SSD's when using them in journals/caching/high write scenarios

Here are a bunch of topics on this Forum:
https://forum.proxmox.com/threads/high-ssd-wear-after-a-few-days.24840/
https://forum.proxmox.com/threads/ssd-endurance-evaluation.32559/
https://forum.proxmox.com/threads/ssd-setup-with-huge-write-loads.31723/
There has been a lot more, i just took the first 3 google hits :p

And a wiki link that IMHo explains the "basics" pretty well:
https://en.wikipedia.org/wiki/Write_amplification

HDD : WD Re 2TB or something similar (WD Gold ?)
no ZFS to save some RAM (as Ceph should be good enough)

SSD :
Intel SSD DC S3700 200GB (3PB TBW)
or
Intel SSD DC S3610 200GB (1,1 PB TBW)


still unsure about the network -
(i need to save some money for the SSDs)
Ubiquiti Networks EdgeSwitch, 16-Port, 10G, ES-16-XG
with really cheap Mellanox ConnectX 10Gigabit Cards with SFP+

or
doing a mesh with 6 Mellanox Cards (if the servers can take so many cards)
probably cheapest - *if* they work reliably

or
mesh with 3 x 2-Port Mellanox - not so cheap anymore,
and i would need to buy 3 sfp+ cables extra which also are not cheap

or
doing a mesh with the intel x550 t2 cards
(the most expensive option)


I think i have mentioned that before, just in case lets mentioned it again:

  • If you will never expand your 3-node Cluster, go with 10G mesh (or Infiniband)
  • If you may wan't to expand your 3-Node Cluster, then go with a switch.

  • Also, think about how you wan't to assure QOS for your Node(s) networks, especially Proxmox-Cluster (most important), Ceph-Public and Ceph-Network. (in some situations it makes sense to buy 2x 10G switches (poor mans QOS) instead of 1x10G switch that is capable of propper QOS. This becomes especially important as the number of OSD's grows on your nodes.
 
SSD :

ok, will then buy the S3700 200GB (3PB TBW) -
costs more money than i really have, but better this than a broken installation ...
(and if i leave 10-20% empty then the controller can use this to extend the TBW even more, right ?)

i read up on the threads about SSD failing with PVE,
(thanks for the links)
found a really strange one about "Proxmox 4.x is killing my SSDs"
(even as they were not used for the journal and even more strangely this did not happen on PVE3)
and of course i know a little bit about Write Amplification
(and AFAIK that no vendor will tell you how much their SSD really has)
but from the threads it doesnt sound like there is a consensus what the real reason is
or what a good avoidance strategy would be
(e.g. "dont write small chunks" - how/where could i configure this ? and why is this not the default ? ) -

only that using a high TBW SSD is better,
(as you can burn through a lot more TB before it becomes a problem)
but nobody can be really sure that his installation will be spared the High-Write-Amount problems, or ?



Network :


As i now need money for the SSDs, i was looking on ebay for cheaper alternatives to the Intel X540's -

and ...
I saw some used 40GbE (!) Mellanox Cards which seem to be quite affordable (half price of the intel cards)
if i only use a mesh then i also only need 3 x 40G SFP+ cables which can also be found for reasonably prices..
(a 40GbE switch is beyond my means)

Would this be good for the cluster or are 10G cards enough ?
(because the rest of the system is not fast enough)

Infiniband vs Ethernet :
IB should have 50% lower latencies and use much less energy (according to some IB-site)
so that would be better, or ?

But how complicated is it to get this up and running ?
I saw some older threads and blog-posts which made it look quite involved -
is this still true with PVE4.4 ?


QoS :

maybe i dont understand the question (or i am naive regarding network hardware - i am a software guy mostly )
but my plan (for now) is :
the Proxmox-Cluster (3 nodes) will be connected by the mesh
the outside world can be connected by 2x1GB Ports which every server has.

We have a big switch (actually 2 connected) where the rest of the the clients are connected to.
If the switch dies all the clients are cut off , so there is nothing to gain by a second switch for the servers.
(and i cannot connect every client-pc to two switches - or can i ?)

There is probably some error in my logic, so please tell me what i can improve.




Thanks for all the answers until now !
 
Last edited:
Network :
Would this be good for the cluster or are 10G cards enough ?

Infiniband vs Ethernet :
IB should have 50% lower latencies and use much less energy (according to some IB-site)
so that would be better, or ?

QoS :

We have a big switch (actually 2 connected) where the rest of the the clients are connected to.
If the switch dies all the clients are cut off , so there is nothing to gain by a second switch for the servers.
(and i cannot connect every client-pc to two switches - or can i ?)

Network: if you can buy 40Gb cards for crossconnect, do it. Or look for x520-DA2 (dual 10G SPF+).
I vs E: stick with ethernet (less complicated, standard for everything), but it's true, used infinibands switches are cheap
QoS: You have 2 options:
1] 2 switches in stack - stack looks as "single" switch for server, multiple connections with lacp (redundancy) no problem, but stack switches costs more (and stack can fails too)
2] 2 standalone switches - with spanning tree protocol you can have connections from server to different switches and interfaces will be in active/passive role.
 
IB : https://pve.proxmox.com/wiki/Infiniband
doesnt sound too complicated at first sight ..
but the 40GbE cards i wanted are gone - will have to see if i want to import it from the UK or US (?).

QoS :
So if i have 2 switches and one fails then the clients connected to the second switch will still be able to talk to the server.
(but not the ones connected to the failed one)
As i said i think all our (approx. 50 clients) are connected to one big switch -
but i will have to check what our network-guy has really done.
 
i read up on the threads about SSD failing with PVE,
(thanks for the links)
found a really strange one about "Proxmox 4.x is killing my SSDs"
(even as they were not used for the journal and even more strangely this did not happen on PVE3)
and of course i know a little bit about Write Amplification
(and AFAIK that no vendor will tell you how much their SSD really has)
but from the threads it doesnt sound like there is a consensus what the real reason is
or what a good avoidance strategy would be
(e.g. "dont write small chunks" - how/where could i configure this ? and why is this not the default ? ) -

only that using a high TBW SSD is better,
(as you can burn through a lot more TB before it becomes a problem)
but nobody can be really sure that his installation will be spared the High-Write-Amount problems, or ?

It all boils down to this in essence:
  • The only way to be sure is to constantly meassure how much data is written (reads as: "run a software that graphs smartvalues of all your flash devices")
  • Write amplification exists. It effects are worse in lowend SSD's because the mitigation tools are either not there or terrible (this is the controller chip), there is less spare NAND or the NAND is of inferior quality.

In my experience, for both professional and personal usage as a Proxmox root disk, or a caching device, you use a HIGH TBW SSD, unless all of the following reasons are true:
  • You have spares on hand
  • You have a truly redundant setup (where it counts - raid on the ssds)
  • You graph your storage's smart values (so you know when to change them based on TBW values [rating -20%])
  • You need a lot of multiple storage devices now (for performance reasons), but do not have the cash for high TBW devices now. You are aware that, the longer you run this setup, the more money you are burning (read up on TBW/€ for different models)
  • You can hot-swap
or you just don't get bothered by nodes failing for short periods of time and you might need to reinstall em completely :p





SSD :

ok, will then buy the S3700 200GB (3PB TBW) -
costs more money than i really have, but better this than a broken installation ...
(and if i leave 10-20% empty then the controller can use this to extend the TBW even more, right ?)
[...]
As i now need money for the SSDs, i was looking on ebay for cheaper alternatives to the Intel X540's -


The less stuff you write on these Disks, the more spare blocks you end up having. You will not extend the TBW, you rather make it last longer (by not writing as much stuff on it)
Just to reiterate and dum it down some: The Problem is not the files that you write once and then read 100k times.
It is the files that you write-delete-write-delete-wri.....

best advise i can give you when money is an issue and 3.8PB is going to be overkill:
if you already have a test setup running somewhere, you could benchmark your TBW/week needs. Then double it (this is you doubling your VM's as you notice how awesome your setup is). Then double it again (this is your buffer) and multiply it by 52 weeks and 5 years.
Then find a SSD that has at least that TBW-rating and is perfect for your usage scenario (high on Write IO, High on Read IO, or a happy compromise somewhere down the middle)


Infiniband vs Ethernet :
IB should have 50% lower latencies and use much less energy (according to some IB-site)
so that would be better, or ?

But how complicated is it to get this up and running ?
I saw some older threads and blog-posts which made it look quite involved -
is this still true with PVE4.4 ?

if you understand networking priniciples and the technology its about the same. I had a IB capable NAS/Proxmox cluster at home for testing. Threw it out because i got cheap used dual 40G cards and switches from work.

AFAIK @udo is one of the 3 other people i remember running IB at one point on proxmox. maybe we should page him :p


I saw some used 40GbE (!) Mellanox Cards which seem to be quite affordable (half price of the intel cards)
if i only use a mesh then i also only need 3 x 40G SFP+ cables which can also be found for reasonably prices..
(a 40GbE switch is beyond my means)

Would this be good for the cluster or are 10G cards enough ?
(because the rest of the system is not fast enough)

IF: 40G mesh + Cables is cheaper then dual-10G nic's and 2x10G stackked switches, then yes, go 40G.
Especially when you know that you can break a 40G cable into 4x10G cables (it is called a breakout cable - should google that) and find out oyu can use those with 10G nics (as mesh) and 10G switches ...

with cables .. make sure you get compatible ones. if you come from 1G networking this can be daunting...

my opinion: better to have more bandwith as needed, then less. Especially when cheaper.



QoS: You have 2 options:
1] 2 switches in stack - stack looks as "single" switch for server, multiple connections with lacp (redundancy) no problem, but stack switches costs more (and stack can fails too)
2] 2 standalone switches - with spanning tree protocol you can have connections from server to different switches and interfaces will be in active/passive role.
QoS :
So if i have 2 switches and one fails then the clients connected to the second switch will still be able to talk to the server.
(but not the ones connected to the failed one)
As i said i think all our (approx. 50 clients) are connected to one big switch -
but i will have to check what our network-guy has really done.

What you guys are talking about is Redundancy.
ps.: when using LACP use balance-tcp mode and openvswitch on proxmox. See below why


What i am talking about is ensuring QOS (Quality of service) as in prioritizing flow of data from a source to its destination based on specific characterists (i.e. Subnet and/or Vlan).


lets break this down (dumb it down):
In an ideal Setup, you have at least 4 "networks" (you can number/assign em as you feel like btw, even put em on vlans)

Proxmox Public (10.1.X.Y/16) your clients connect here.
Proxmox Cluster (10.2.X.Y/16) your Cluster talks here.
Ceph Client (10.3.X.Y/16) your Proxmox Servers talk to the Ceph MON/MDS/OSDs here)
Ceph Cluster (10.4.X.Y/16) Your Ceph OSD's replicate on this network.

Link speeds of nics
1 GBit/s = 125 MB/s (or the speed of a normal HDD)
10 GBit/s = 1250 MB/s (or the speed of 2 SSDs)
40 Gbit/s = 5000 MB/s (you get the idea)
(ovehead neglected)


So why is this important ?
  • Imagine... you have 2x 1G and 1x 10G available.
    • You assign 2x 1G to Proxmox Public
    • You assign Proxmox Cluster + Ceph Public + Ceph Cluster to the 10G nic.
the second your OSD's start replicating your Proxmox CLuster (e.g. Corosync) starts to become upset, throw a tantrum and desyncs your cluster (also sometimes referred to as Cluster/Node flapping) or a node.

So what most people do is called "poor mans QOS"

For that you need at least 4 nics for every single node.
1 xG nic for Proxmox Cluster and a seperate switch to connect these links
1 xG nic for Proxmox Public and a seperate switch to connect these links
1 xG nic for Ceph Cluster and a seperate switch to connect these links
1 xG nic for Ceph Publicand a seperate switch to connect these links

But this can get expensive really fast. especially at higher then 1G link speeds because switches are expensive at that point. And it also terribly inefficient use of the total link-capacities.

Then there are people that do 10G for Ceph, and just limit Ceph OSD replication speeds. While it works, it can have a large performance impact and also be very time consuming when e.g. ceph does need to do a re-balance after a failed disk.

Then there are People that make sure they have switches that can do QOS (as in prioritise flows based on criteria). And then there are even people that operate a SDN (Software defined network).

What these people do is LACP all links (where possible) and then just make sure that Proxmox-Cluster network flows have the highest Priority, Ceph-Cluster network flows have the lowest priority and Proxmox-Client + Ceph Client share the happy medium.

Your network flows are not congested anymore. And when Ceph needs to re-balance you get the as much bandwith as your network can spare(and not crawl to a halt)



Now why did i mention above openvswitch and balance-tcp ?
For Openvswitch it is easy, because it uses less cpu cycles to do the same amount of work compared to a native linux bridge.

fro Balance-TCP you need to know how Ceph networking works:
http://docs.ceph.com/docs/master/_images/ditaa-2452ee22ef7d825a489a08e0b935453f2b06b0e6.png
Every Ceph-Node on your network assigns each of its OSD's, Mons a unique port. So they are reachable under different <IP:port> combinations.

What balance-TCP does is do load-balancing of flows on top of all available network links based on source IP and Source Port AND Destination IP and Destination Port. For a 4x10G network that would mean a total bandwith capacity of 40G and maximum bandwith allotment per flow of 10G)
While a Active-passive LACP option basically means that you have a master and a standby link. the standby link only gets used when the master is down . so on the same example of 4x10G you are looking at a 10G capacity and a 10G maximum bandwith allotment per flow.





Hope that answers "SOME" of the questions you did not know you had :p
 
  • Like
Reactions: nes
AFAIK @udo is one of the 3 other people i remember running IB at one point on proxmox. maybe we should page him :p
Hi,
I had two pve-cluster, where two nodes are directly connected with IB (Mellanox - afaik 20GBit/s) for drbd-sync.
Sometimes I had trouble with split-brain conditions and an drbd-verify also don't work.
After I switched to 10GB-Ethernet no further split-brain occurs!

Perhaps it was an issue which can avoided with tuning... but I had the same earlier (on another cluster) with dolphin-cards... imho the 10GB-driver are better tested in the kernel - much more people use them.

Udo
 
Hi Q-wulf
Hi udo
thanks for the replies
Will have more questions tomorrow :)
 
Hi,
had to check some other things first ...

SSD-TBW etc. - yeah i am trying to get some (lightly) used 400GB S3700's so even more TBW to burn :)

i plan to use one HD for the Proxmox root disk *and* for the Ceph OSD - if posible (and not a dumb idea)

if not possible (or dumb) - i will buy a small extra HD for the root disk.
(afaik the root disk is not doing very much, or is it ? -
mainly writing lots of log files which would burn out a lower TBW SSD)


QoS and Redunancy :

How about this -
1G - Proxmox Public
1G - Proxmox Cluster
40G - Ceph Client and Cluster

AFAIU this should prevent the Node flapping, or ?
Big question :
what happens when the ceph-network fails but the proxmox cluster network stays alive ?
and vice versa ?

i can only afford to build a 40G Mesh network (switches are crazy expensive),
Proxmox Cluster network will probably be also a mesh (for now)

Should be doable and eliminate possible problems with the switch..
(Remember that i will only have 3 proxmox server,
with 2-3 critical VMs,
of which only one will generate most of the traffic
-> 50 (mostly) data-entry clients accessing a postgres DB.)


@Q-wulf : thanks for your (very interessting) infos about openvswitch and balance-tcp

AFAIU this is for the use-case where i would LACP all links through a switch, or ?

for this i would need one of the really expensive switches which can do 40G -
using a breakout-cable would allow to connect to a (still expensive) 10G switch
but i dont see how that would be better (for a 3 node setup) than a 40G mesh...

as you can guess i'm still a little bit confused about the hardware connection/cabling options...
 
i plan to use one HD for the Proxmox root disk *and* for the Ceph OSD - if posible (and not a dumb idea)

if not possible (or dumb) - i will buy a small extra HD for the root disk.
(afaik the root disk is not doing very much, or is it ? -
mainly writing lots of log files which would burn out a lower TBW SSD)

Using a SSD for OS and OSD is not stupid in an off itself.
It just makes it more complex. to the point where it becomes nonsense-sical.
Best practice for OSD's is to use the storage of the same size and performance characteristics.

If you were to utilize separate sizes and write speeds, you'd needed to manually adjust the weights of each OSD, to ensure that PG's are spread uniformly according to the performance characteristics of the storage. But then at some point you end up having some OSD's full, while others aren't, and then you realize this was just a terrible idea in the first place and your "fix" to make use of a that little spare SSD space, cost you in other places dearly.

You can read up on this here:
http://docs.ceph.com/docs/master/rados/operations/data-placement/
http://docs.ceph.com/docs/master/rados/operations/placement-groups/
http://docs.ceph.com/docs/master/architecture/#mapping-pgs-to-osds
and here (search for "weight")
http://docs.ceph.com/docs/master/rados/operations/crush-map/


@Q-wulf : thanks for your (very interessting) infos about openvswitch and balance-tcp

AFAIU this is for the use-case where i would LACP all links through a switch, or ?

for this i would need one of the really expensive switches which can do 40G -
using a breakout-cable would allow to connect to a (still expensive) 10G switch
but i dont see how that would be better (for a 3 node setup) than a 40G mesh...

as you can guess i'm still a little bit confused about the hardware connection/cabling options...


Openvswitch is always a good idea to use instead of linux bridging, because it utilises less cpu to do the same task.
(Disclaimer: I have been told here that OVS per port rate limiting does not work. for a Mesh setup like yours (1G+1G+40G) you probably will not need it. unless you wanna limit how much bandwith your VM can use on your Proxmox-Public link compared to another VM on the same host)

Balance-TCP is only interesting when used with LACP/Bonding
For Bonding, you need a switch that does LACP (10G or 40G)

The 40G to 4x10G breakout Cables allow you to
a) connect to 10G switch(es) - there is a use-case for both options
b) direct connect a 40G nic to 4x 10G Nics (or 2 2x10G nics )

It is not better than a mesh. UNLESS you are moving past 3 nodes, at which point Mesh almost always becomes more expensive than utilizing a switch.
for a n-node mesh you need (n-1) nics per host (n being the number of hosts) or n*(n-1) for the total number of nics.

3 node = 6 nics.
4 nodes = 12 nics.
5 nodes = 20 nics.
6 nodes = 30 nics

you get the idea.

I just thought I point out that there is such a thing as

point out that there is such a thing as break out cables for 40G nics. Just incase you might find a cheap 10g switch :) . Hey who knows, might even be interesting 2-3 years down the line :)

Should be doable and eliminate possible problems with the switch..
(Remember that i will only have 3 proxmox server,
with 2-3 critical VMs,
of which only one will generate most of the traffic
-> 50 (mostly) data-entry clients accessing a postgres DB

How about this -
1G - Proxmox Public
1G - Proxmox Cluster
40G - Ceph Client and Cluster

That would be More then you'd need.




Hope that answers most of these questions you had.
going on vacation now.
ps.: This is a topic similar to yours, maybe it is of interest to yours ?
https://forum.proxmox.com/threads/hardware-concept-for-ceph-cluster.32814/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!