[SOLVED] Poor VM Performance

emilhozan · Sep 21, 2019

Hey all,

This is my first time posting on here (I believe at least) so please bear with me as I'm trying to figure out some performance concerns we're having with some hosted VMs. Our testing consisted of both dd (which I read wasn't necessarily satisfactory with PVE) as well as "fio".

I have 2x clusters, one with 10x servers and another cluster with 5x servers - both with Ceph for storage and both are getting poor write performance within VMs.

After reading many articles and discussions, I have tried quite a few things all to no avail EXCEPT the SSD aspect. The reason is, how much better can I expect? Would I need to get top quality SSDs to really make an impact?

My PVE host itself gets decent performance (using dd, at least - see attachment 5.12.44), the hosted VMs are the issue at hand. Most testing VMs are CentOS7.

The physical hardware:
R610s all around
4x 4TB 5400 RPMs for storage (I know, I know, poor speeds but they were the best suited drive for what we need and budget)
96 GB of RAM each
1x SAS drive for the PVE host install OS (I forget the speed but definitely better than 5400 RPM)
10Gb isolated network for Ceph
1Gb for WAN / cluster sync

That said, I saw the sample command using fio as described in this wiki link. This seems to be a read command though and not write (per the command flags; i.e., "--rw=read"). I looked up other tutorials for fio and found this command:

fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=4M --numjobs=4 --size=1G --runtime=600 --group_reporting

My results are in screenshot (5.17.16). I interpret that (164,282KB => 164.282MB) as not too shabby. Am I right here?

The dd command isn't too bad either (5.12.44) screenshot.

I read that the default block size of ceph pools is 4M per that wiki discussing fio in the first place. This is the same block size I used for both tests: dd and fio.

Now when I test within a CentOS7 VM, "bs=4K":

dd results are (5.22.21)
fio command with "--bs=4M" results are (5.38.31)

So what can I do?
Do allocated resources for VMs matter in this case? We'd like not to use any caching for obvious reasons (I saw many "no-nos" in this regard from other forums).
I tried this none authentication route but that seemed to make things worse so I reverted.
I tried reducing replication size to just 1, same thing.

I didn't try SSDs and we're willing to go this route but would prefer not to. We have a handful of servers that don't need the write performance but some require it, so we're thinking of a dedicated SSD pool via the CRUSH commands.

If I am missing any data, please let me know and I'll get it ASAP.

Thanks

LnxBil · Sep 21, 2019

emilhozan said:
I didn't try SSDs and we're willing to go this route but would prefer not to.

You cannot run a CEPH cluster with 4x5400rpm and expect performance, sorry. That's the plain truth.
Enterprise class SSDs, even cheap, used 128 GB Intel SLC will boost your system tremendously. As a first test, I'd go and buy 5x used 128 GB enterprise class SSDs from this list for 100-150 Euro in total and split each into 4 partitions for the cache for each harddisk.

Just to understand what is happening in your setup: You want to measure I/O times, not throughput which is directly proportional, but it's easier to understand with I/O times. You want fast I/O times, e.g. one write to your CEPH pool has to take as little as possible to get a system that "feels fast".
Each write in your VM will return, if the data is written at least twice (if this is set in your crush map) and the write will go over the network and has to be written somewhere on the disk on the other machine to return. This is really slow. In an SSD setup, the write will return if the data hits the SSD, which can write much, much faster so that you will have a very short I/O time. The quality and size of your SSD will impact the amount of data that can be written fast to your CEPH pool.

emilhozan · Sep 22, 2019

Hey LnxBill,

That's what I suspected and I appreciate the suggested SSDs as well.

However, here is my dilemma based on your response: I do not have any available PCI interfaces available anymore. Our units only have two, of which both are in use at them moment for our SFP+ module as well as our RAID controller.

By chance, do you have any SATA suggestions? Or can you let me know what to look for in SATA options? Should I seek only by IOPs performance?? Anything I should stay away from?

Thank you SO MUCH for your help and confirmation.

emilhozan · Sep 22, 2019

I take that back! Just looked at a few more and saw some non-SATA drives.

Either way, is there anything I should look for in SATA drives that would help? Anything I should steer clear of?

paradox55 · Sep 22, 2019

Your 4k iops are horrid. But that could be due to the drives themselves and not anything wrong with your setup.

100MB/s on your write test is actually pretty good for such a small cluster. I'm more interested in your read tests.

You can use the following tests for more performance statistics:

Code:

rados bench -p primary_data 10 write --no-cleanup
rados bench -p primary_data 10 seq
rados bench -p primary_data 10 rand
rados -p primary_data cleanup

primary_data being the pool.

emilhozan · Sep 22, 2019

Also, sorry, one more question.

Is it better to create a dedicated SSD pool versus using SSDs for journaling.

To be clear of our usecase: only a few VMs require high disk write performance. The other ones will be okay with what we're facing at the moment.

@paradox55
Standby.

emilhozan · Sep 22, 2019

@paradox55

rados bench -p primary_data 10 write --no-cleanup

rados bench -p primary_data 10 seq

rados bench -p primary_data 10 rand

rados -p primary_data cleanup

Done.

paradox55 · Sep 23, 2019

emilhozan said:
@paradox55

rados bench -p primary_data 10 write --no-cleanup
View attachment 11879

rados bench -p primary_data 10 seq
View attachment 11880

rados bench -p primary_data 10 rand

View attachment 11881

rados -p primary_data cleanup

Done.

Yeah, I wouldn't use that setup for VM images. Your write iops are pretty horrid.

emilhozan · Sep 23, 2019

@paradox55
In that case, is it better to create a dedicated SSD pool versus using SSDs for journaling.

To be clear of our use case: only a few VMs require high disk write performance. The other ones will be okay with what we're facing at the moment.

totalimpact · Sep 23, 2019

Are they 2.5 laptop drives?? Even if they were *higher end* desktop drives, theres a reason why most servers have SAS ports, just skip right to a decent SSD, look for something that was common on Dells, avoid "read optimized SSDs" if possible, there are many decent 2nd hand Dell Toshiba SAS SSD that are sometimes even cheaper than SATA SSDs. Before buying try to find verified write speeds, some can be dismal ... ie: https://www.storagereview.com/toshiba_mkx001grzb_enterprise_ssd_review

Several other nice options- https://www.ebay.com/sch/i.html?_from=R40&_nkw=toshiba+sas+ssd&_sacat=0&_sop=15
https://business.toshiba-memory.com...oducts/enterprise-ssd/px04svb-px04svqxxx.html
Not that great being "read optimized", but you said you dont need much write ... Note that is a much newer dual channel SAS SSD, if your card or backplane are not dual, you will get 1/2 the speed advertised, and of course they advertise as 12gbSAS, so if you are using 6gb, its going to be 1/2*1/2 ... but I think response time is all you are looking for?

On top of that I just simply would not use the *10 series chassis for anything, they originally came with Perc6i, which came from the PE2900, when an R720 is valued around $300 more, considering the chassis alone. Then you get like 4x more server for your $$. That brings up the question of your card, what HBA is it? It sounds like your budget is pretty low, I'm not trying to bash you for this, but a 15 machine cluster? That is like a $300,000 cluster new that you are skating by on 3% worth of hardware? ...You may be just on the edge of making it work, maybe something newer will get you the response you are wanting at a lower machine density. R720 are pretty affordable in 16 bay chassis, and not much more for 26 bays. More drives, better CPUs, you could probably run 8-10 machines for your 2 clusters. H310 re-flashed to LSI IT mode will get you a decent HBA for $30-40.

paradox55 · Sep 23, 2019

Could easily be some of those archive drives as well.

emilhozan · Sep 26, 2019

@totalimpact
The HBAs are H200s.

The write performances do not seem to be a big issue for the most part. Again, there are a handful of servers that require higher write IOPS.

For these, a dedicated SSD pool seems to be the best route to go.

Thank you all for your help. I'll update this once we figure out what to do and post the results for future viewers as well.

emilhozan · Oct 9, 2019

Just wanted to swoop back around in testing with an SSD pool made up of 5x these Seagate drives. I'll admit, waaay better performance. I semi understand the depth queue issue with dd and ceph but I realized that if you use bs=100M and count=10, the ratings get way better.

rados bench -p primary_data 10 write --no-cleanup

rados bench -p primary_data 10 seq

rados bench -p primary_data 10 rand

rados -p primary_data cleanup

Thanks for the help in this, as well as my other pose regarding ceph pools and rules. That was quite fun learning! If you (as a future reader) run into this requirement, here's my link detailing the learning excursion I had over the past week or so.

Search

Search

[SOLVED] Poor VM Performance

emilhozan

Member

Attachments

LnxBil

Distinguished Member

emilhozan

Member

emilhozan

Member

paradox55

Member

emilhozan

Member

emilhozan

Member

Attachments

paradox55

Member

emilhozan

Member

totalimpact

Renowned Member

paradox55

Member

emilhozan

Member

emilhozan

Member

We value your privacy