Proxmox VE 4.4/CEPH/RAID issues

Nicolas Dey · Jun 6, 2017

Hello,

We have been experiencing strange issues since we have installed our cluster.

First, let's give you a glance of our configuration:
* 3 times the same server (RAM, disks, CPUs),
* Connected to our network from one side,
* All on UPS, switches included,
* Connected to a 10GB switch for CEPH (private) on the other side,
* RAID card (Avago MegaRaid SAS MFI 3108) which is not used for RAID, but just configured as "pass-through" so we can see all individually (maybe not the best part of the design, we should have removed it completely), no battery,
* 2x Samsung SSD 1TB (SSD model compatible with RAID card, I've checked),
* 2x 2TB spinning.

Configuration at a glance:
* Ceph is using the 2 individual spinning disks on each server as volumes,
* Ceph journal is on first SSD,
* Remaining space of first SSD is a local volume for VM,
* Other SSD is a local volume for VMs,
* Ceph is configured in HA, so any failing server will migrate automatically the VM to another healthy one (tests OK).

Please let me know if you want specific logs/details that could help in the diagnostic.

Issue description:
Randomly, and too often (~5 times in a month), we have lost one or the other SSD from our configuration on one or another server, or 2 at the same time: the RAID card flags the SSD as failing, and removes it from the system. When it is the one with CEPH journal on it, we partly loose CEPH redundancy + some VMs are dying (no local disks); when it is the other one, we "only" loose VM.
We can repair this state by rebooting the server, marking the SSD disk as good and importing former RAID configuration, but we could face a total disaster if we are not lucky and loose all CEPH SSD at once.
Note that we never noticed the spinning disks to be flagged as faulty.

Due to this problem, we are seriously considering stepping back to VMware. Of course, I'm not sure this is 100% related to Proxmox, by the way, I'm just looking here for any suggestion, recommendation, or a similar experience with this issue.
Thank you, and best regards,
-- Nicolas

alexskysilk · Jun 6, 2017

I'm having trouble reconciling

" RAID card (Avago MegaRaid SAS MFI 3108) which is not used for RAID, but just configured as "pass-through" and
"importing former RAID configuration"

are you sure you have the cards set to passthrough, or are you making single disk volumesets? In either case, the answers are likely in your logs. It is likely that regardless of your certainty that your samsung drives are compatible with your controller, it may be limited to specific matching firmware revs on your drive AND controller, AND mode of operation. Its highly likely you'll have the same problems with vmware until you solve this.

Nicolas Dey · Jun 6, 2017

Hi alexskysilk,

alexskysilk said:
are you sure you have the cards set to passthrough, or are you making single disk volumesets?

You're right, the RAID part was confusing:
* We are using RAID chip,
* All disks are set as individual RAID0,
* From the system, once booted, they are listed as AVAGO, not Samsung anymore (there are beyond the RAID card).

alexskysilk said:
In either case, the answers are likely in your logs. It is likely that regardless of your certainty that your samsung drives are compatible with your controller, it may be limited to specific matching firmware revs on your drive AND controller, AND mode of operation. Its highly likely you'll have the same problems with vmware until you solve this.

What do you mean by 'mode of operation'? Nevertheless, I see your point… and I agree.
I have to say that we have another similar cluster with exact same MotherBoard (same BIOS version), same RAID card, different CPU (very similar anyway), different SSD (older ones, still Samsung Pro), but globally very similar (CEPH, one RAID 0 volume per HD, …). We never had any issue with this cluster. I have to admit it is used less intensively.

Thanks,
-- Nicolas

Ashley · Jun 7, 2017

What Samsung SSD are you using?

Nicolas Dey · Jun 7, 2017

Hi Ashley,

The SSD we are using are:
Samsung SSD 850 PRO 1TB, FW:EXM03B6Q

These drives are listed in the compatibility matrix of the RAID Adapter: https://docs.broadcom.com/docs/MegaRAID-SAS-Gen3CompatibilityList
…
Samsung SSD SATA 6Gb/s 850 PRO, MZ7KE1T0HMJB 2B6Q 1TB 2.5" NA 512 No
…

Nevertheless, as alexskysilk was saying before ("[…] it may be limited to specific matching firmware revs on your drive AND controller, AND mode of operation […]"). Nevertheless, the FW version is not the same as the recommended one. I was able to update two SSD to 4B6Q (latest FW), as we cannot downgrade a Samsung SSD FW (http://www.samsung.com/global/business/semiconductor/minisite/SSD/M2M/html/support/faqs_01.html). Our only guess for now is that we suppose we had a FW incompatibility which is fixed in latest one…

On the other cluster which has no issue, the SSD FW is much more older FW:GXM1003Q.

Thanks,
-- Nicolas

fortechitsolutions · Jun 7, 2017

Hi, for what it is worth. (a) Dogma says that despite what logic might otherwise say, LSI raid cards running a bunch of single-disk "Raid0" volumes, is **!!NOT!!** the same thing as converting the LSI Raid card to be a JBOD_mode card in which all disks are indeed attached as single drives. Go digging in ZFS Filer Forums for more detail if you like... but I recall reading the threads and there is a lot of general discussion that says, more or less:
-- either don't use the darned LSI card, period
-- or be darn sure you have it flipped into JBOD mode, period.
-- otherwise, there is absolutely, no point in bothering.
-- because of the way that LSI implements the 1-disk Raid0 volumes / and how it does not play nice / in the way that we might hope / and -- well, just go read and it is all laid out; many many pages of happy ranting

Otherwise for what it is worth, my initial review of ceph design last year suggested to me that a 'decent sensible' config for ceph would be more on the order of 6+ nodes with at least 6+ spindles per host for minimum of 36 spinning rust bulk storage drives. This was driven by reading / observations that when you lose a node, and you have too few disks in the pool, then the storage-rebalance traffic spike basically "sends everything to shit". To put it politely. However, I never went to the point of testing because my deploy config was of interest, was much smaller - 2x2Tb drive locally plus 2xSSD locally; for 3 identical proxmox nodes, period. Pretty 'modest' scale. So I just didn't bother testing ceph. Maybe I was being lazy.

Possibly someone who has a production ceph config can comment, on what they have found to be their "Minimum viable build size" in terms of nodes and spindles.

But.

That being said, just because you have problems with Proxmox and Ceph config, IMHO does not mean that you want to run off into the arms of the evil vixen (VMWARE). Your benefits of proxmox are so many, that really, just because of one storage subsystem testing type, I would not take the bleeding nose to mean the war is lost.

Rather, what are your other options and how can you make best from what you have ?
--- Local storage, shared nothing, live migrations that take a while but work fine.
--- non-local storage, convert one of your boxes into a mega-store filer, do 10gig NFS filer and shared NFS storage exported to your other nodes?
--- note in reality, hardware node proxmox boxes fail less often than humans going "oops!" - so I have observed with quite a lot of deployments - that over-building a nice fancy HA Complex topology, still leaves the biggest risk as part of the picture (ie, people who go "doh!" after they hit the wrong button). And thus, there is a lot to be said with

-- nice simple clean layout
-- easy to understand, hard for people to get confused
-- fewer moving parts, less chance it 'just breaks because dammit this is complex'
-- and so overally, yes, you don't have certain features, but, end of the day you still have very good reliable platform, which "just works".

Just my 2 cents!

Tim

Nicolas Dey · Jun 8, 2017

Some news that could interest this thread…
A colleague as found some interesting threads regarding a known issue (back in 2015) on Samsung SSD Pro 8* series with the Linux kernel:

http://www.spinics.net/lists/raid/msg49440.html - "It turns out that there is misunderstanding between raid driver and scsi/ata driver. […] in case of trim, there are some problems."
https://git.kernel.org/pub/scm/linu...c?id=9a9324d3969678d44b330e1230ad2c8ae67acf81 - (Kernel Git - interesting comment)
https://linux.slashdot.org/story/15/07/30/1814200/samsung-finds-fixes-bug-in-linux-trim-code - "Samsung Finds, Fixes Bug In Linux Trim Code"
…

I'm not yet able to find if/when the patch as been issued, if it has been accepted and released, and if Linux Kernel version Proxmox VE 4.4 is based on already has the patch. It *seems* not, but no other clue than our issues.

At this point, a Proxmox staff member answer would be highly valuable, I guess.

For now, our reseller has kindly offered to RMA our 850 Pro and send us S863 instead. As our other cluster has SM863 and never presented this issue, we are rather optimistic.

@Tim, sadly, AFAIK (and as hard as I tried), we cannot get rid of the RAID card from the system BIOS, or from the RAID BIOS.

Thank you,
-- Nicolas

fabian · Jun 8, 2017

Nicolas Dey said:
Some news that could interest this thread…
A colleague as found some interesting threads regarding a known issue (back in 2015) on Samsung SSD Pro 8* series with the Linux kernel:

http://www.spinics.net/lists/raid/msg49440.html - "It turns out that there is misunderstanding between raid driver and scsi/ata driver. […] in case of trim, there are some problems."

https://git.kernel.org/pub/scm/linu...c?id=9a9324d3969678d44b330e1230ad2c8ae67acf81 - (Kernel Git - interesting comment)

https://linux.slashdot.org/story/15/07/30/1814200/samsung-finds-fixes-bug-in-linux-trim-code - "Samsung Finds, Fixes Bug In Linux Trim Code"

…

I'm not yet able to find if/when the patch as been issued, if it has been accepted and released, and if Linux Kernel version Proxmox VE 4.4 is based on already has the patch. It *seems* not, but no other clue than our issues.

At this point, a Proxmox staff member answer would be highly valuable, I guess.

is upstream since 4.1-rc4, so included in all PVE 4.x (since 4.0 Beta 2) and PVE 5.0 Beta kernels.

Nicolas Dey · Jun 9, 2017

Hi Fabian,
Good to know, thanks for your answer.
We too have performances issues with Samsung 850 Pro, I think everything is pointing to these drives: https://forum.proxmox.com/threads/slow-ceph-journal-on-samsung-850-pro.27733/

I'll let you know once we have change them with SM863A (SM863 not available anymore).
Thanks,
-- Nicolas

Nicolas Dey · Jul 11, 2017

Hello,
An update as promised: Samsung 8xx Pro are not good if used for CEPH journal. We have changed all of them, and have been using SM863 and SM863A for the last 3 weeks: we did not have any issue since then!
Apparently, we have some server reboot we have to investigate now, but the main issue is solved.
Hope this helps,
-- Nicolas

Search

Search

Proxmox VE 4.4/CEPH/RAID issues

Nicolas Dey

New Member

alexskysilk

Distinguished Member

Nicolas Dey

New Member

Ashley

Member

Nicolas Dey

New Member

fortechitsolutions

Renowned Member

Nicolas Dey

New Member

fabian

Proxmox Staff Member

Nicolas Dey

New Member

Nicolas Dey

New Member

We value your privacy