Poor write performance on ceph backed virtual disks.

AllanM · Jan 19, 2021

Hello!

I've had this issue on my home cluster since I built it in late 2019, but always figured it was caused by being on old/slow hardware and using a mixed bag of consumer SSD/HDDs...

~50MB/s is as good as I can get out of ceph write performance from any VM, to any pool type (SSD or HDD), on either of the following clusters.

Home cluster is a 4-node made from old supermicro fat twins with 2 X E5-2620 V2 and 80GB RAM per node. Ceph has a dedicated 1G Public network and dedicated 1G cluster network. 12 X 500GB SSDs (3 per node consumer drives) and 12 X 4TB HDDs (3 per node mixed blues and white labels, confirmed all conventional recording).

Work cluster is a 6-node made from supermicro 2113S-WTRT's with 1 X 7402P and 256GB RAM per node. Ceph has a dedicated 10G Public network and dedicated 10G cluster network. 24 X 2TB SSD's (4 X Kingston DC500R per node). 18 X 16TB HDD's (3 X Toshiba MG08ACA16TE per node). (the big spinners here are connected to each node via LSI SAS 9300-8e's in a 1-to-1 direct-attach (non-expander) enclosure.

Pools are configured with SSD/HDD "pool rules" per proxmox manuals on both clusters so that I can assign certain virtual disks to SSD only or HDD only storage. "write back" mode helped on the home cluster with continuous write performance from a truenas VM, bringing performance up to ~50MB/s, but at work our windows file server and pbs caps out around 50MB/s write regardless of cache settings on the virtual disk. It starts off a lot faster until the VM OS write cache fills up. The virtual disks are configured as virtio block devices in all cases.

This ~50MB/s cap is observed on virtualized Proxmox Backup, Windows Server, TrueNAS, Security Onion Sensors, etc. Doesn't matter if it's linux, unix, or windows.

The file copy operations that are bringing things to a crawl here are not high "operation" workloads. These are just big video and zip/7z files being copied to disks hosted on the HDD pools. The ~50MB/s cap I'm observing on both of these clusters is in sequential write operations, not random/small-file. (It's much worse when random/small).

Home Cluster is on "free" repos and has been upgraded to Ceph Octopus (made no difference). Work cluster is on enterprise repos and still running Nautilus.

At home, 50MB/s is fine by me. I really don't care since it's just for personal file storage and this is plenty fast for us to share and backup files at small home-scale. I assumed this was to be expected with virtualization losses, overhead, slow old CPU's, slow network, and mixed bag of garbage drives.

At work, this is not acceptable. We need way more performance than 50MB/s here. Our internet connection is ~120MB/s :\ I have many TB's to move around and back up regularly. We're getting read to bring a lot more users into the domain hosted from this.

What's the trick? How do we speed this up? Need more OOOMF!!! lol

Alwin Antreich · Jan 19, 2021

AllanM said:
LSI SAS 9300-8e's in a 1-to-1 direct-attach (non-expander) enclosure.

Is the enclosure shared or just space for growth?

Please share some more information. Network, Ceph, client configuration.

AllanM said:
Kingston DC500R per node

A side note, are read-intesive SSDs, they have lower write speeds.

AllanM · Jan 19, 2021

Alwin Antreich said:
Is the enclosure shared or just space for growth?

I don't understand the question but will try to offer some clarification.

The enclosure is a 24 X 3.5" bay direct-attach (non-expander) design. groups of 4 drive bays are mapped to 6 X mini-SAS connections. Each node of the cluster is directly attached to 4 of the drive bays in the enclosure via the LSI card and a mini-SAS external cable.

I have 2 clusters here with 4 different pools, each with radically different hardware configurations, performance, and drive types exhibiting almost exactly the same performance bottleneck. I'm pretty sure this isn't a hardware issue.

Alwin Antreich said:
Please share some more information. Network, Ceph, client configuration.

What would you like to know about the network/ceph/client configs?

Alwin Antreich said:
A side note, are read-intesive SSDs, they have lower write speeds.

Read-intensive means they have lower endurance, not lower write speeds. These drives have full data path power protection and very good write performance. There is no excuse for ~50MB/s write speeds to an array of 24 of these.

With that said, it doesn't matter if we're working with this pool of fast SSD's, or a pool of terrible quality consumer SSD's in a smaller cluster on way slower hardware with half as many SSD's, or HDD pools of consumer or enterprise drives, I'm seeing about the same write performance in VM's either way, suggesting that the bottleneck is NOT the hardware.

AllanM · Jan 19, 2021

I was just reading the latest ceph benchmark PDF from the proxmox folks here for any possible insight.

In the single thread sequential write test on a Windows VM, they're only getting 600MB/s on drives that would do near 3000MB/s if directly attached.

I'm seeing a similar relative performance loss here.

On my home cluster, when changing/adding a lot of OSD's, I found some ceph "tuning" options that allowed me to "crank up" the rebalancing performance dramatically by just letting it do more at once. Default configs were very conservative to protect VM performance. I suspect the default ceph config is also likely conservative with regards to protecting available I/O for many running VM's.

I suspect I need to twist some knobs on ceph to improve this.

Alwin Antreich · Jan 19, 2021

AllanM said:
I don't understand the question but will try to offer some clarification.

The enclosure is a 24 X 3.5" bay direct-attach (non-expander) design. groups of 4 drive bays are mapped to 6 X mini-SAS connections. Each node of the cluster is directly attached to 4 of the drive bays in the enclosure via the LSI card and a mini-SAS external cable.

That's what I meant with shared enclosure. Multiple nodes connecting to it. It's just for my understanding.

AllanM said:
Read-intensive means they have lower endurance, not lower write speeds. These drives have full data path power protection and very good write performance. There is no excuse for ~50MB/s write speeds to an array of 24 of these.

Sadly that's not the only thing, the controller plays also a big part. My statement was meant in comparison to what I have experienced. But in any case, you are right, that's not the issue for that low write speeds.

AllanM said:
With that said, it doesn't matter if we're working with this pool of fast SSD's, or a pool of terrible quality consumer SSD's in a smaller cluster on way slower hardware with half as many SSD's, or HDD pools of consumer or enterprise drives, I'm seeing about the same write performance in VM's either way, suggesting that the bottleneck is NOT the hardware.

Well, consumer SSDs area different topic totally. You can search the forum, there are many threads with different issues concerning consumer SSDs. But let's stick with the SSD pool, to investigate the issue.

AllanM said:
In the single thread sequential write test on a Windows VM, they're only getting 600MB/s on drives that would do near 3000MB/s if directly attached.

A Ceph pool has 3 copies by default. That means, it needs to write 3x. The client (PVE node) talks to a primary OSD and sends it's data to it. This OSD is then responsible to send the data to its secondary and tertiary partner OSD. And all that is happening over network, Ceph doesn't know locality.

The Micron 9300 MAX in the 3.2 TB version is capable of doing ~2.5 GB/s raw (directly on disk). But to be fair the cluster in the benchmark paper has only 3x nodes. That makes the performance more predictable.

The ~600 MB/s is from inside the VM. There are a couple of more layers in between, eg. librbd cache, a single thread and how the VM is doing caching on its own. Anyhow, please do some fio and rados benchmarks, like in the paper. This way we can narrow down where does 50 MB/s are stemming from.

AllanM · Jan 30, 2021

Tried a few things....

Enabled the autoscaler, which shrunk the number of pgs quite a bit. Performance dropped a bit after rebalancing.

Gave the autoscaler some information about how large the pool is likely to be down the road, and it grew the number of pgs quite a bit. Performance dropped even further.

Updated firmware on the LSI controllers.. performance dropped even more.

Tried more write cache modes and options within the windows VM... everything I try keeps making it slower...

CURRENTLY WRITING AT 5MB/s FROM WINDOWS FILE SERVER TO THIS POOL.

This needs to be fixed. How can we speed up write speed to a spinning disk pool. I need 100MB/s minimum sequential. The drives are capable of 250MB/s and there are 18 of them. Suggestions?

mmidgett · Jan 30, 2021

I'm currently struggling to find if this is the home cluster you've slowed down to 5MB or the production cluster. 50MB is very close to 500mbits of bandwidth and your transferring that multiple times across your network. Remember 3 copys. I want to see iperf tests to ensure that you can do the full 1Gbit. Also you should enable jumbo frames to help reduce the load on your cpu and your network gear.

AllanM · Jan 30, 2021

mmidgett said:
I'm currently struggling to find if this is the home cluster you've slowed down to 5MB or the production cluster. 50MB is very close to 500mbits of bandwidth and your transferring that multiple times across your network. Remember 3 copys. I want to see iperf tests to ensure that you can do the full 1Gbit. Also you should enable jumbo frames to help reduce the load on your cpu and your network gear.

Hello!

We're getting 5MB/s write speed on the production cluster now. That's on a 10Gb network.

The SSD pool on this cluster rebalances and recovers at speeds of 500-2000MB/s. This is not a network or CPU issue.

With max backfills and recovery_max_active cranked up a bit the spinning disk pool rebalances and recovers at speeds of ~150-250MB/s.

There's something weird going on in the layers of interaction from the Windows Server VM down through the virtIO disk drivers.

Also worth mentioning that I tried mounting the disk as both a Virtio SCSI controller with SCSI device, and VIRTIO BLOCK device. All other things being equal, the VIRT SCSI configuration is about half the speed of VIRTIO BLOCK.

mmidgett · Jan 30, 2021

Try accessing by using krbd, or turn that off. Seems odd, is your ceph cluster separate or is this controlled by the cluster.

mmidgett · Jan 30, 2021

Totally just noticed the windows parts. What about using advanced and enable io thread. Go try older virtio drivers. 1.80-85ish. I've not used the latest drivers in years. I vaguely remember trying them along time ago and saw a reduction of speed.

How is the virtio iscsi drivers under linux?

AllanM · Jan 31, 2021

mmidgett said:
Try accessing by using krbd, or turn that off. Seems odd, is your ceph cluster separate or is this controlled by the cluster.

I turned on krdb for this pool, then shut down and booted the 2 VMs with virtual disks in this pool.

Performance appears to have improved about 60%. So my instance of PBS is writing ~80MB/s instead of ~50MB/s to this pool, and the Windows File server is now moving at a scorching 8MB/s instead of 5MB/s.

Solid Gains. Good suggestion. Now how recover the missing 90% from the windows file server...

mmidgett said:
Totally just noticed the windows parts. What about using advanced and enable io thread. Go try older virtio drivers. 1.80-85ish. I've not used the latest drivers in years. I vaguely remember trying them along time ago and saw a reduction of speed.

How is the virtio iscsi drivers under linux?

linux VM's are writing about 10X faster to this pool right now... lol. Though this is odd because I was getting about the same performance in both before updating the SAS firmware and tinkering with pg autoscaler.

I am going to try rolling back to 1.185 drivers and see what happens. Good idea!..... [edit in] just went back to 1.185. No change.

---------------------------

Not sure if this means anything, but I ran a rados bench on the pool. 20 second sequential writes average 200MB/s. Subsequent sequential read of the same data is ~900MB/s. These are numbers more inline with what I would expect from the pool.

--------------------------

Sounds like I'm not alone on the Windows VM issue here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ANHJQZLJT474B457VVM4ZZZ6HBXW4OPO/

mmidgett · Jan 31, 2021

How are you testing performance that I can try to replicate on windows. Also, I logged in started poking around for this virtio driver iso. It looks like the version that I have running is viritio-win-0.1.141.iso. It's quite old and likely left over when I was testing opennebula.

My ceph cluster is all SSD (16) running on proxmox 6.3 (nautilus) in a 4x 1TB Intel D3-S4510 SSD's per host with 10G LAGs going to arista switches in a mLAG fashion.
128 Placement groups on 3/2 size/min - replicated
This is separate from my compute cluster as originally my ceph was built on ubuntu but after having random network interfaces flapping I decided why not just try proxmox just for the easy ceph config and what do you know, the cluster has rock-solid ever since.

Windows 2012 Test VM
4GB Ram
1 Socket 4 cores - No Numa
Bios - Sea
SCSI Controller Virtio SCSI
Hard disk - VIRTIO - IO Thread - No Limitations

mmidgett · Jan 31, 2021

I just did a basic copy and paste of a 2016 windows image that is almost 6GB in size and the transfer mustered over 100MB per second.

No Cache
IO Thread or off didn't make much a difference. It possibly had a little faster end of test speed.
85-125MB per sec

Write Back
IO Thread
290-300MB per sec

AllanM · Jan 31, 2021

mmidgett said:
I just did a basic copy and paste of a 2016 windows image that is almost 6GB in size and the transfer mustered over 100MB per second.

No Cache
IO Thread or off didn't make much a difference. It possibly had a little faster end of test speed.
85-125MB per sec

Write Back
IO Thread
290-300MB per sec

Hello mmidgett,

Any file copy operation or samba share file copy operation suffers the severe performance bottleneck when writing to the spinning pool.

When I copy a file from the spinning pool to the SSD pool, I also get about 100MB/s in non-cache mode just like you, and windows "behaves" the way I would expect it to. It performs the operation at the actual speed of the underlying equipment. I see about 100MB/s in the windows transfer progress dialog, and about 100MB/s in the ceph logs, and about 100MB/s in the VM summary information.

When I copy a file from the SSD pool to the spinning pool, it reads a whole bunch of data really fast into system RAM, so it "appears" to be copying at like 300MB/s, but then after the RAM fills up, it appears to stall, then eventually starts reporting the "actual" speed of the transfer, which is about 8MB/s right now. The actual write speed shown on ceph logs bounces around from ~5-12MB/s. The VM summary after awhile averages out to ~8MB/s.

AllanM · Jan 31, 2021

I just set the virtual disk to direct sync mode to "test a theory." Big surprise here: 8MB/s.

So how do we get the VM to respect the actual cache settings?

mmidgett · Jan 31, 2021

Try write back and possibly older drivers. They have less cruft from the new features.

Go to device manager and edit the red hat drive controller and set for performance.

AllanM · Jan 31, 2021

mmidgett said:
Try write back and possibly older drivers. They have less cruft from the new features.

Write back starts off fast... like 150MB/s as reported in ceph logs and VM summary graphs, but within a few minutes drops to 25MB/s.

It also has a very nasty problem. When I cancel a file copy mid-way, windows "cancels" it, but there's a crap ton of data still waiting to be flushed on the underlying system (ceph cache), so the drive basically acts unresponsive until all the data is flushed, so the 25MB/s may seem like an advantage, but it's actually much worse because the virtual drive can wind up totally unresponsive for LONG periods of time.

mmidgett said:
Go to device manager and edit the red hat drive controller and set for performance.

I do not see anywhere to set for performance in this location.

mmidgett · Jan 31, 2021

When you upgraded to octopus did you destroy each OSD? I think I remember seeing performance problems on another post if you didn't. The upgrade docs don't say anything about it.

https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

Alwin Antreich · Feb 1, 2021

AllanM said:
Home Cluster is on "free" repos and has been upgraded to Ceph Octopus (made no difference). Work cluster is on enterprise repos and still running Nautilus.

As @mmidgett mentioned the upgrade, see this thread.
https://forum.proxmox.com/threads/d...ter-upgrade-to-ceph-octopus.81542/post-363772

And please do the formal testing, like in the Ceph benchmark paper, it makes it easier for us to compare the results of the different layers.

AllanM · Feb 3, 2021

mmidgett said:
When you upgraded to octopus did you destroy each OSD? I think I remember seeing performance problems on another post if you didn't. The upgrade docs don't say anything about it.

https://pve.proxmox.com/wiki/Ceph_Nautilus_to_Octopus

Production cluster is still on Nautilus.

I did a bunch of testing at home to fine a "best config" to try for the production cluster. Best performance I can get on a windows guest seems to be krbd, virtio block, iothread, writeback, and then configuring windows to disable write cache buffer flushing of the device.

This actually resulted in a continuous write performance of ~200MB/s, for about 5 minutes, then it drops to 20MB/s and stays there. Unfortunately, 5 minutes isn't good enough. We have terabytes worth to move. 20MB/s isn't going to cut it.

Alwin Antreich said:
As @mmidgett mentioned the upgrade, see this thread.
https://forum.proxmox.com/threads/d...ter-upgrade-to-ceph-octopus.81542/post-363772

And please do the formal testing, like in the Ceph benchmark paper, it makes it easier for us to compare the results of the different layers.

I'll try to do some formal testing.... a lot of those tests require removing a drive from the pool before running the test otherwise data destruction so not convenient on a production cluster.

---------------

[edit in]
Just upgraded the production cluster to Octopus.

Observing ~30MB/s instead of ~20MB/s

Poor write performance on ceph backed virtual disks.

Well-Known Member

Attachments

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Active Member

Well-Known Member