Urgent: Proxmox/Ceph Support Needed

Q-wulf · Nov 9, 2015

The way i read it, it is to take that Node (which has some issues of high load - affecting node1+2) out of the equation, while he goes and fixes/reinstalls Node3.

wahmed · Nov 9, 2015

Q-wulf said:
The way i read it, it is to take that Node (which has some issues of high load - affecting node1+2) out of the equation, while he goes and fixes/reinstalls Node3.

The high load may be the result of the all the rebalancing ceph trying to do. Eric's original post says the cluster lost 33% of its disk. But do not know what caused the loss. After the loss i believe he marked the OSDs OUT. Which started rebalancing and as fas i can understand it never finished rebalancing. High IO is obviously normal while Ceph rebalancing which Will slow down cluster significantly specially a small 3 node cluster with 3 replicas.

ejmerkel · Nov 9, 2015

We got the ceph cluster down to 2 replicas. It finished rebuilding but during that process 1 of the 2 remaining ceph nodes marked all of it's OSD's as down. Not sure why it did that but the cluster is still running ok and we are moving stuff off temporarily to local storage instead.

The node 3 that was having all of the issues appears to have had a bad SSD that the journal was on. I had the journal setup as 2 SSD's in RAID1 configuration hoping that if one ever failed it would be ok. Unfortunately the SSD in the degraded state was too slow to keep up causing all kinds of issues.

We appreciate all the advice and help everyone has lent. I think till we have a better understanding of ceph we're going to steer clear of it.

Best regards,
Eric

Q-wulf · Nov 10, 2015

How many OSD's does your SSD-journal handle ?

check this thread:
https://forum.proxmox.com/threads/24176-Newbie-need-your-input
Post #13 and following comments made by "udo" regarding SSD's and an increase in failure likelihood as you scale up the # of handled OSDs per SSD.

PS:
I personally after doing a bunch of Tests on my test-Machine, then Test-CLuster, then verifying with the OFFice production Cluster and finally moving to our Storage-CLusters have completely moved off SSD backed journals and run them on their OSD instead. We now use those SSD's as Replicated SSD-Caching-tiers for all pools instead.

It also should be noted that we never used/use enterprise grade SSD's and stick with Consumer-Grade Hardware all the way. It also should be noted that our clusters are >=4 nodes and have min 10 OSD per node. Pools are planned for loosing 20% of Nodes/OSD's and still function with their base-settings (ie. EC-Pools with k=80 m=20 on a 120++ OSD-Cluster) .

Be advised that this generally goes against the Ceph-hardware suggestions. However based on our Use-Cases we get a lot better performance out of the Clusters and a reduction by 1 failure domain. I'm pretty sure it only works because of the scale involved.

Instead of X 800€ Intel S3700 400GB SSD's we now use Y 240GB SSD's for 75€ each,. Yes - they likely fail faster, but by having them in abundance and 100% expecting them to fail, we can overcome this downside easily, which makes it economical. (i can buy 10 consumer-grade SSD's for 1 Enterprise SSD) . The Key here is to benchmark each use-case and to diligently verify expected behaviour for every possible failure-domain (including compounded failure Domain issues)

For a 3-Node cluster with less then 30 or so OSD's - i'd be very sceptical at deploying that for production. Might cause me some sleepless nights

mir · Nov 10, 2015

Q-wulf said:
I personally after doing a bunch of Tests on my test-Machine, then Test-CLuster, then verifying with the OFFice production Cluster and finally moving to our Storage-CLusters have completely moved off SSD backed journals and run them on their OSD instead. We now use those SSD's as Replicated SSD-Caching-tiers for all pools instead.

If you have never tried with data center grade SSD's how can you then assume journals on SSD does not noticeably increase performance?

Q-wulf · Nov 10, 2015

mir said:
If you have never tried with data center grade SSD's how can you then assume journals on SSD does not noticeably increase performance?

That statement refers to production use. We did have 3 400GB DC 3700's for our Test-Cluster , but since have put them out to greener pastures aka. CEO's Workstation

We benched the living daylights out of it before shelling out big bucks on our Production Clusters (50 nodes total). I personally leaving stuff to feeling, "vendor-Guidance" or best practice i rather benchmark for a week and know for sure.

You use SSD's to accelerate (at that point Euro/TB becomes somewhat relevant) and (Power consumption becomes in a distant second after your Performance)

At that point it comes down to "investment + Running cost" and "Available Drive-Space" and then "failure domains" . Since our Pods would theoretically allow us to put some more internal Backplanes, we assume that we could put 30+ SSD's in.

We used 5 OSD per DC 3700 400GB as journal for 800 €
We use 5 OSD (own journal) + 10 x Adata Premier 256GB (as chaching Tier) for 800 €
We benched marked and the the Winner performance wise was clearly the 10 SSD's

It all simply comes down to a matter of scale.

We then extrapolated what we'd need for 40 OSD's (our Goal of OSD's per Node) --> 8 DC3700 400GB's --> 6400€
Thats basically doubling our setup cost on top of what our Hardware )including the Caching SSD's would cost. For the Performance we'd get they are just too expensive.
We noticed that 8 SSD's as Cachetier per node would give us the Performance we require (hence our use-case) at a fraction of the cost (640 €)

And then comes failure domains. You loose a journal and you then have to replace said journal and rebuild the OSD's. Or Loose the OSD's and recycle em. If you go with a over-provisioned SSD-Cachepool, you loose a SSD, you replace said SSD.

Lemme use an analogy here: If you could get the same type of underwear for a fraction of the cost, at which point do you stop washing your underwear and instead get a new one each morning and throw it away in the evening ????

So we basically accept that we might go through 2-3 SSD's in the same time we'd go through a DC3700, buy spares and be done with it.

Now it should be noted that the Santa Clause has already confirmed the wishlist of the IT-Department:
3x8xM.2 NVME Cards complete with "X16 to 4x X4" PCIE- Converters, Adapters and flexible Risers for even hotter Caching tiers. Pretty sure there is also a package with a nodes worth of SSHD's coming - aychprox if your reading this ... its all your fault (and udo's by assosiation)

ps.:
Hope that answers your curiosity.
And yes i know it is counter intuitive and might not work for all Use-Cases. Hence me pointing it out in big black letters.

mir · Nov 10, 2015

But does your Adata provide power loss protection?

Q-wulf · Nov 11, 2015

the ssd's do not have power loss protection.
But they should not need to provide that feature to provide Data-loss protection.

Some HBA's have batteries (but do not need to) - those that do not stem from cannibalised systems do not have batteries.

We use em as a caching pool, that has the same Replication scheme as the backing pool.

As Leafs we use (osd, host, LeftOrRightSideOfTower, StorageTower, Room, Building, Campus, Region [Not in use yet])
Examples:
HDD-Pool replica size =4 and over the leaf host, crush will replicate the SSD-Caching-pools Data on SSD's on 4 different Hosts
HDD-Pool replica size =3 and over the leaf Building, then rush will replicate the SSD-Caching-pools Data on SSD's in 3 different Buildings.
Each HDD-Pool gets a SSD-Cache pool assigned. Replicated and EC-Pools alike.

Now the way i understand Ceph is, that when e.g. a Write request from a client comes in over a node, it is placed according to Crush on the first SSD, then Replicated according to the Pool setting and only once its replicated the write is acknowledged.

It should be noted we use BTRFS FIle-System for all OSD's (ssd and HDD alike) for transparencies sake.

Search

Search

Urgent: Proxmox/Ceph Support Needed

Q-wulf

Renowned Member

wahmed

Famous Member

ejmerkel

Renowned Member

Q-wulf

Renowned Member

mir

Famous Member

Q-wulf

Renowned Member

mir

Famous Member

Q-wulf

Renowned Member

We value your privacy