i read up on the threads about SSD failing with PVE,
(thanks for the links)
found a really strange one about "Proxmox 4.x is killing my SSDs"
(even as they were not used for the journal and even more strangely this did not happen on PVE3)
and of course i know a little bit about Write Amplification
(and AFAIK that no vendor will tell you how much their SSD really has)
but from the threads it doesnt sound like there is a consensus what the real reason is
or what a good avoidance strategy would be
(e.g. "dont write small chunks" - how/where could i configure this ? and why is this not the default ? ) -
only that using a high TBW SSD is better,
(as you can burn through a lot more TB before it becomes a problem)
but nobody can be really sure that his installation will be spared the High-Write-Amount problems, or ?
It all boils down to this in essence:
- The only way to be sure is to constantly meassure how much data is written (reads as: "run a software that graphs smartvalues of all your flash devices")
- Write amplification exists. It effects are worse in lowend SSD's because the mitigation tools are either not there or terrible (this is the controller chip), there is less spare NAND or the NAND is of inferior quality.
In my experience, for both professional and personal usage as a Proxmox root disk, or a caching device, you use a HIGH TBW SSD, unless
all of the following reasons are true:
- You have spares on hand
- You have a truly redundant setup (where it counts - raid on the ssds)
- You graph your storage's smart values (so you know when to change them based on TBW values [rating -20%])
- You need a lot of multiple storage devices now (for performance reasons), but do not have the cash for high TBW devices now. You are aware that, the longer you run this setup, the more money you are burning (read up on TBW/€ for different models)
- You can hot-swap
or you just don't get bothered by nodes failing for short periods of time and you might need to reinstall em completely
SSD :
ok, will then buy the S3700 200GB (3PB TBW) -
costs more money than i really have, but better this than a broken installation ...
(and if i leave 10-20% empty then the controller can use this to extend the TBW even more, right ?)
[...]
As i now need money for the SSDs, i was looking on ebay for cheaper alternatives to the Intel X540's -
The less stuff you write on these Disks, the more spare blocks you end up having. You will not extend the TBW, you rather make it last longer (by not writing as much stuff on it)
Just to reiterate and dum it down some: The Problem is not the files that you write once and then read 100k times.
It is the files that you write-delete-write-delete-wri.....
best advise i can give you when money is an issue and 3.8PB is going to be overkill:
if you already have a test setup running somewhere, you could benchmark your TBW/week needs. Then double it (this is you doubling your VM's as you notice how awesome your setup is). Then double it again (this is your buffer) and multiply it by 52 weeks and 5 years.
Then find a SSD that has at least that TBW-rating and is perfect for your usage scenario (high on Write IO, High on Read IO, or a happy compromise somewhere down the middle)
Infiniband vs Ethernet :
IB should have 50% lower latencies and use much less energy (according to some IB-site)
so that would be better, or ?
But how complicated is it to get this up and running ?
I saw some older threads and blog-posts which made it look quite involved -
is this still true with PVE4.4 ?
if you understand networking priniciples and the technology its about the same. I had a IB capable NAS/Proxmox cluster at home for testing. Threw it out because i got cheap used dual 40G cards and switches from work.
AFAIK
@udo is one of the 3 other people i remember running IB at one point on proxmox. maybe we should page him
I saw some used 40GbE (!) Mellanox Cards which seem to be quite affordable (half price of the intel cards)
if i only use a mesh then i also only need 3 x 40G SFP+ cables which can also be found for reasonably prices..
(a 40GbE switch is beyond my means)
Would this be good for the cluster or are 10G cards enough ?
(because the rest of the system is not fast enough)
IF: 40G mesh + Cables is cheaper then dual-10G nic's and 2x10G stackked switches, then yes, go 40G.
Especially when you know that you can break a 40G cable into 4x10G cables (it is called a breakout cable - should google that) and find out oyu can use those with 10G nics (as mesh) and 10G switches ...
with cables .. make sure you get compatible ones. if you come from 1G networking this can be daunting...
my opinion: better to have more bandwith as needed, then less. Especially when cheaper.
QoS: You have 2 options:
1] 2 switches in stack - stack looks as "single" switch for server, multiple connections with lacp (redundancy) no problem, but stack switches costs more (and stack can fails too)
2] 2 standalone switches - with spanning tree protocol you can have connections from server to different switches and interfaces will be in active/passive role.
QoS :
So if i have 2 switches and one fails then the clients connected to the second switch will still be able to talk to the server.
(but not the ones connected to the failed one)
As i said i think all our (approx. 50 clients) are connected to one big switch -
but i will have to check what our network-guy has really done.
What you guys are talking about is Redundancy.
ps.: when using LACP use balance-tcp mode and openvswitch on proxmox. See below why
What i am talking about is ensuring QOS (Quality of service) as in prioritizing
flow of data from a source to its destination based on specific characterists (i.e. Subnet and/or Vlan).
lets break this down (dumb it down):
In an ideal Setup, you have at least 4 "networks" (you can number/assign em as you feel like btw, even put em on vlans)
Proxmox Public (10.1.X.Y/16) your clients connect here.
Proxmox Cluster (10.2.X.Y/16) your Cluster talks here.
Ceph Client (10.3.X.Y/16) your Proxmox Servers talk to the Ceph MON/MDS/OSDs here)
Ceph Cluster (10.4.X.Y/16) Your Ceph OSD's replicate on this network.
Link speeds of nics
1 GBit/s = 125 MB/s (or the speed of a normal HDD)
10 GBit/s = 1250 MB/s (or the speed of 2 SSDs)
40 Gbit/s = 5000 MB/s (you get the idea)
(ovehead neglected)
So why is this important ?
- Imagine... you have 2x 1G and 1x 10G available.
- You assign 2x 1G to Proxmox Public
- You assign Proxmox Cluster + Ceph Public + Ceph Cluster to the 10G nic.
the second your OSD's start replicating your Proxmox CLuster (e.g. Corosync) starts to become upset, throw a tantrum and desyncs your cluster (also sometimes referred to as Cluster/Node flapping) or a node.
So what most people do is called "poor mans QOS"
For that you need at least 4 nics for every single node.
1 xG nic for Proxmox Cluster and a seperate switch to connect these links
1 xG nic for Proxmox Public and a seperate switch to connect these links
1 xG nic for Ceph Cluster and a seperate switch to connect these links
1 xG nic for Ceph Publicand a seperate switch to connect these links
But this can get expensive really fast. especially at higher then 1G link speeds because switches are expensive at that point. And it also terribly inefficient use of the total link-capacities.
Then there are people that do 10G for Ceph, and just limit Ceph OSD replication speeds. While it works, it can have a large performance impact and also be very time consuming when e.g. ceph does need to do a re-balance after a failed disk.
Then there are People that make sure they have switches that can do QOS (as in prioritise flows based on criteria). And then there are even people that operate a SDN (Software defined network).
What these people do is LACP all links (where possible) and then just make sure that Proxmox-Cluster network flows have the highest Priority, Ceph-Cluster network flows have the lowest priority and Proxmox-Client + Ceph Client share the happy medium.
Your network flows are not congested anymore. And when Ceph needs to re-balance you get the as much bandwith as your network can spare(and not crawl to a halt)
Now why did i mention above openvswitch and balance-tcp ?
For Openvswitch it is easy, because it uses less cpu cycles to do the same amount of work compared to a native linux bridge.
fro Balance-TCP you need to know how Ceph networking works:
http://docs.ceph.com/docs/master/_images/ditaa-2452ee22ef7d825a489a08e0b935453f2b06b0e6.png
Every Ceph-Node on your network assigns each of its OSD's, Mons a unique port. So they are reachable under different <IP
ort> combinations.
What balance-TCP does is do load-balancing of flows on top of all available network links based on source IP and Source Port AND Destination IP and Destination Port. For a 4x10G network that would mean a total bandwith capacity of 40G and maximum bandwith allotment per flow of 10G)
While a Active-passive LACP option basically means that you have a master and a standby link. the standby link only gets used when the master is down . so on the same example of 4x10G you are looking at a 10G capacity and a 10G maximum bandwith allotment per flow.
Hope that answers "SOME" of the questions you did not know you had