[SOLVED] Poor VM Performance

emilhozan

Member
Aug 27, 2019
51
4
8
Hey all,

This is my first time posting on here (I believe at least) so please bear with me as I'm trying to figure out some performance concerns we're having with some hosted VMs. Our testing consisted of both dd (which I read wasn't necessarily satisfactory with PVE) as well as "fio".

I have 2x clusters, one with 10x servers and another cluster with 5x servers - both with Ceph for storage and both are getting poor write performance within VMs.


After reading many articles and discussions, I have tried quite a few things all to no avail EXCEPT the SSD aspect. The reason is, how much better can I expect? Would I need to get top quality SSDs to really make an impact?

My PVE host itself gets decent performance (using dd, at least - see attachment 5.12.44), the hosted VMs are the issue at hand. Most testing VMs are CentOS7.

The physical hardware:
R610s all around
4x 4TB 5400 RPMs for storage (I know, I know, poor speeds but they were the best suited drive for what we need and budget)
96 GB of RAM each
1x SAS drive for the PVE host install OS (I forget the speed but definitely better than 5400 RPM)
10Gb isolated network for Ceph
1Gb for WAN / cluster sync


That said, I saw the sample command using fio as described in this wiki link. This seems to be a read command though and not write (per the command flags; i.e., "--rw=read"). I looked up other tutorials for fio and found this command:

fio --name=seqwrite --rw=write --direct=1 --ioengine=libaio --bs=4M --numjobs=4 --size=1G --runtime=600 --group_reporting

My results are in screenshot (5.17.16). I interpret that (164,282KB => 164.282MB) as not too shabby. Am I right here?

The dd command isn't too bad either (5.12.44) screenshot.

I read that the default block size of ceph pools is 4M per that wiki discussing fio in the first place. This is the same block size I used for both tests: dd and fio.


Now when I test within a CentOS7 VM, "bs=4K":

dd results are (5.22.21)
fio command with "--bs=4M" results are (5.38.31)



So what can I do?
Do allocated resources for VMs matter in this case? We'd like not to use any caching for obvious reasons (I saw many "no-nos" in this regard from other forums).
I tried this none authentication route but that seemed to make things worse so I reverted.
I tried reducing replication size to just 1, same thing.

I didn't try SSDs and we're willing to go this route but would prefer not to. We have a handful of servers that don't need the write performance but some require it, so we're thinking of a dedicated SSD pool via the CRUSH commands.


If I am missing any data, please let me know and I'll get it ASAP.

Thanks
 

Attachments

  • Screen Shot 2019-09-20 at 5.17.16 PM.png
    Screen Shot 2019-09-20 at 5.17.16 PM.png
    7.2 KB · Views: 26
  • Screen Shot 2019-09-20 at 5.22.21 PM.png
    Screen Shot 2019-09-20 at 5.22.21 PM.png
    8.3 KB · Views: 23
  • Screen Shot 2019-09-20 at 5.12.44 PM.png
    Screen Shot 2019-09-20 at 5.12.44 PM.png
    8.1 KB · Views: 21
  • Screen Shot 2019-09-20 at 5.38.31 PM.png
    Screen Shot 2019-09-20 at 5.38.31 PM.png
    7.2 KB · Views: 20
I didn't try SSDs and we're willing to go this route but would prefer not to.

You cannot run a CEPH cluster with 4x5400rpm and expect performance, sorry. That's the plain truth.
Enterprise class SSDs, even cheap, used 128 GB Intel SLC will boost your system tremendously. As a first test, I'd go and buy 5x used 128 GB enterprise class SSDs from this list for 100-150 Euro in total and split each into 4 partitions for the cache for each harddisk.

Just to understand what is happening in your setup: You want to measure I/O times, not throughput which is directly proportional, but it's easier to understand with I/O times. You want fast I/O times, e.g. one write to your CEPH pool has to take as little as possible to get a system that "feels fast".
Each write in your VM will return, if the data is written at least twice (if this is set in your crush map) and the write will go over the network and has to be written somewhere on the disk on the other machine to return. This is really slow. In an SSD setup, the write will return if the data hits the SSD, which can write much, much faster so that you will have a very short I/O time. The quality and size of your SSD will impact the amount of data that can be written fast to your CEPH pool.
 
Hey LnxBill,

That's what I suspected and I appreciate the suggested SSDs as well.

However, here is my dilemma based on your response: I do not have any available PCI interfaces available anymore. Our units only have two, of which both are in use at them moment for our SFP+ module as well as our RAID controller.

By chance, do you have any SATA suggestions? Or can you let me know what to look for in SATA options? Should I seek only by IOPs performance?? Anything I should stay away from?

Thank you SO MUCH for your help and confirmation.
 
I take that back! Just looked at a few more and saw some non-SATA drives.

Either way, is there anything I should look for in SATA drives that would help? Anything I should steer clear of?
 
Your 4k iops are horrid. But that could be due to the drives themselves and not anything wrong with your setup.

100MB/s on your write test is actually pretty good for such a small cluster. I'm more interested in your read tests.

You can use the following tests for more performance statistics:

Code:
rados bench -p primary_data 10 write --no-cleanup
rados bench -p primary_data 10 seq
rados bench -p primary_data 10 rand
rados -p primary_data cleanup

primary_data being the pool.
 
Also, sorry, one more question.

Is it better to create a dedicated SSD pool versus using SSDs for journaling.

To be clear of our usecase: only a few VMs require high disk write performance. The other ones will be okay with what we're facing at the moment.

@paradox55
Standby.
 
@paradox55

rados bench -p primary_data 10 write --no-cleanup
1569189217140.png

rados bench -p primary_data 10 seq
1569189263200.png

rados bench -p primary_data 10 rand

1569189301522.png

rados -p primary_data cleanup

Done.
 

Attachments

  • 1569189210247.png
    1569189210247.png
    5.9 KB · Views: 3
@paradox55
In that case, is it better to create a dedicated SSD pool versus using SSDs for journaling.

To be clear of our use case: only a few VMs require high disk write performance. The other ones will be okay with what we're facing at the moment.
 
Are they 2.5 laptop drives?? Even if they were *higher end* desktop drives, theres a reason why most servers have SAS ports, just skip right to a decent SSD, look for something that was common on Dells, avoid "read optimized SSDs" if possible, there are many decent 2nd hand Dell Toshiba SAS SSD that are sometimes even cheaper than SATA SSDs. Before buying try to find verified write speeds, some can be dismal ... ie: https://www.storagereview.com/toshiba_mkx001grzb_enterprise_ssd_review

Several other nice options- https://www.ebay.com/sch/i.html?_from=R40&_nkw=toshiba+sas+ssd&_sacat=0&_sop=15
https://business.toshiba-memory.com...oducts/enterprise-ssd/px04svb-px04svqxxx.html
Not that great being "read optimized", but you said you dont need much write ... Note that is a much newer dual channel SAS SSD, if your card or backplane are not dual, you will get 1/2 the speed advertised, and of course they advertise as 12gbSAS, so if you are using 6gb, its going to be 1/2*1/2 ... but I think response time is all you are looking for?

On top of that I just simply would not use the *10 series chassis for anything, they originally came with Perc6i, which came from the PE2900, when an R720 is valued around $300 more, considering the chassis alone. Then you get like 4x more server for your $$. That brings up the question of your card, what HBA is it? It sounds like your budget is pretty low, I'm not trying to bash you for this, but a 15 machine cluster? That is like a $300,000 cluster new that you are skating by on 3% worth of hardware? ...You may be just on the edge of making it work, maybe something newer will get you the response you are wanting at a lower machine density. R720 are pretty affordable in 16 bay chassis, and not much more for 26 bays. More drives, better CPUs, you could probably run 8-10 machines for your 2 clusters. H310 re-flashed to LSI IT mode will get you a decent HBA for $30-40.
 
@totalimpact
The HBAs are H200s.

The write performances do not seem to be a big issue for the most part. Again, there are a handful of servers that require higher write IOPS.

For these, a dedicated SSD pool seems to be the best route to go.

Thank you all for your help. I'll update this once we figure out what to do and post the results for future viewers as well.
 
Just wanted to swoop back around in testing with an SSD pool made up of 5x these Seagate drives. I'll admit, waaay better performance. I semi understand the depth queue issue with dd and ceph but I realized that if you use bs=100M and count=10, the ratings get way better.

rados bench -p primary_data 10 write --no-cleanup
1570600542931.png

rados bench -p primary_data 10 seq
1570600605112.png

rados bench -p primary_data 10 rand
1570600639348.png

rados -p primary_data cleanup


1570601158748.png
1570601174892.png

1570601189533.png
1570601210425.png


Thanks for the help in this, as well as my other pose regarding ceph pools and rules. That was quite fun learning! If you (as a future reader) run into this requirement, here's my link detailing the learning excursion I had over the past week or so.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!