Thanks for the input!
For Ceph network backbone we use Infiniband 40gbps. If you are building small with 10gbe 12 port switch, keep in mind that we will need one dedicated port for Ceph cluster sync. To be really efficient you will need 3 switches. 1 for Proxmox cluster, 1 for Ceph public network and 1 for Ceph cluster sync network.
Did not know about the cluster sync network! In none of the examples i've seen there's been no mention of such. So does the sync network do rebalancing etc. OSD to OSD traffic? or what does it do? Does it need as much BW?
Ofc i haven't been considering Infiniband 40gbps - that would be probably cheaper than 10GbE - but i've had a terrible experience with Infiniband adapters & Linux drivers along with advertised specs
Do you run IPoIB or as plain pure Infiniband?
Our nodes have only one PCI-e slot, mini size to use, and the 10GbE adapters only provide 2x10GbE + 2x1GbE.
Yes, we did have some data corruption with one big one being in last year. But, every time it was none of Proxmox+Ce[h faults. Mostly user error. The last big one was caused by none other but Me.
Cannot remember the details but i tried something i should not have tried in production cluster. We had good backups to fall back on to so the actual data loss was minimal. But the hassle and some downtime was painful.
I think this goes for any system, dont test & trial on live system and know what you are doing before trying something. In my case i did not read on fully so missed a step or two which caused disaster.
Ouch! Good thing to have backups
I don't know if i can in any means cram backups for data on our budget. for OS & Pure SSD backups are on plan, but bulk storage, no budget. We need to get below 3€/TiB/Mo operational expenses (inc. hardware!) for our target segment. It's that tough!
That's why i'm going back and forth on using Ceph or just doing plain ol' style RAID5 per machine. But i'd REALLY REALLY would like to have live migration as option for server maintenance etc.
We have tons of disk space vacant even after over provisioning only about 60-70% gets used, business sense in me says we need to ramp it up to 90% !
Then the scalability and other possibilities, all of this i'm weighting against the risk of loosing a significant portion of our customer's data, since once again for cost we want to use largest possible drive models.
Then again, need to also weigh in, disks in lower usage don't fail as often
We have all disks currently active all the time, averaging 40-50% utilization as per Iostat. Ofc, there is signicant number below 30% and significant number above 60% too.
I fully understand the need to save initial cost and using recycled hardware. We followed similar path years ago. But i will caution you and so will anybody else with some level of Ceph experience, try not to go too cheap with hard drives. Specially the way you are thinking, i see huge disaster. Here is what i mean:
Although you can mix and match different sizes of HDD in Ceph cluster, you have to maintain some form of balance. For example, Lets say you have 512Gb, 1Tb and 2Tb HDDs. Dont end up few nodes with majority kind. The following scenario is bad idea:
Node 1: 4 x 512Gb, 2 x 1Tb
Node 2: 1 x 512Gb, 1 x 2Tb
Node 3: 3 x 2Tb
..................................
The following scenario is good idea:
Node 1 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 2 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
Node 3 : 2 x 512Gb, 2 x 2Tb, 1 x 2Tb
...................................
Thank you for reminding me! Sometimes i'm just going too fast when building things to think about all the details, might have forgotten to balance them out!
Use caution when rebooting node to add drives. When your Ceph cluster is up and running, everytime there is a node/hdd failure, the cluster will go into rebalancing mode. Even when there are no failures and you have rebooted, the node will still think something has failed. Tell cluster not to rebalance every time before you reboot: Following a line command will just do the trick:
# ceph osd set noout
After you have rebooted simply unset the noout option as this:
#ceph osd unset nout
This way you can reboot without rebalancing. And by any means do not reboot multiple nodes and hdd at the same time. Try to do one node at a time.
Which brings me to the question, what if we have total lights out event?
It's rare, but most definitively will happen at some point of time. It's just plain guaranteed to happen no matter what - even if big players sometimes this happens, i'm sure we cannot avoid it!
So if all nodes go down at roughly the same time, or say 1-2 mons and 10-20% of OSDs stay online, what happens?
is all data lost at that time?
can't be - that would be kind of disastrous drawback for ceph, as this kind of situation is pretty much guaranteed to happen eventually.
Why Raid10 for the OS drive? Are you putting Ceph journals on the same OS SSD?
I guess in this case RAID6 might actually be better, besides i do use software raid (don't trust any HW raid adapters, every single one i've ever used has been total and utter garbage!)
Anyways, the idea is redundancy.
I do not know the depth of your Ceph knowledge, so if i am saying what you already know, i apologize. I would not recommend running Ceph cluster above 85% usage continuously. Also did you take into consideration how Ceph uses space with replica and all? For example, lets say your total raw cluster size is 144Tb. This does not mean that you can store that much user data. With replica 3, any data you store will get written total 3 times. So lets say you have stored 15Tb of customer data. Ceph will actually consume about 45Tb of space due to replica 3. With replica 2, it will use 30Tb of space. You got the idea. Some see this as drawback. But i see a very small price to pay considering what Ceph does.
And this right here is why i have not considered Ceph before, needed to wait for erasure code to mature a bit!
So erasure coding 18+2 means for 18TB of customer data 20TB is consumed!
Then add the SSD cache (cache tier support recently included in Ceph), with Replica 2. I cannot justify to myself going for Replica 3
We have obviously different perspectives and goals here - my most important goal is to get sensible costs with maximal capacity and decent performance.
Tho, with Ceph reliability & redundancy needs to be a major objective, but not the point our business plan becomes unviable.
For example, we need to be able to provision 16GiB Ram + 4TB storage with at least 200IOPS at any given time for less than 20€ a month, far less since that is our benchmark, design goal needs to be somewhere around 13€, preferrably 10€.
Our initial common compute node will be: 48GiB of ram, 12 cores and 4x Drives (whether HDD or SSD, OS + Ceph OSD). Our cost for such system in operational costs is roughly 13-14€ a month.
Tho next models will probably be 96GiB of ram, 8 cores, 3x drives.
May sound impossible, but with smart purchasing and very frugal approach i believe this can be done by leveraging economies of scale on each step of the way.
You wouldn't believe how little i pay for 1 such node as described above, they are almost free in practice compared to the drive, ops & infrastructure expenditure
We use very similar systems as you linked, thanks for linking, that confirms the chassis specs i've been considering are the correct ones!
Tho ultimately i'd love to use the Backblaze pod for storage chassis, but 48 drives per chassis ... that's a bit too much (rich) for me!
I think our chassis could actually take the SC mobo used in the model you linked:
http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRT-P.cfm
Looks exactly the same form factor, and some of our nodes actually use a older gen similar SC mobo
Tho we use the 3.5" drive model, since initially we did not do virtualization and needed the lower price & higher capacity of 3.5" drives.